Tag: pyspark

count of distinct columns using group by and calculating percentage

Trying to write a sql query: below is normal output I need row wise percentage output for tidcounts: The query I’m trying is below expected output is: Please suggest if i am missing anything it should be in either spark-sql or pyspark Answer Solution with spark.sql Solution with pyspark Example Result

Translating pyspark into sql

pyspark sql

I’m experiencing an issue with the following function. I’m trying to translate this to a SQL statement so I can have a better idea of exactly what’s happening, so I can more effectively work on my actual issue. I know that this contains a join between valid_data to ri_data, a filter, and a s…

Splitting each Multi-Category columns to Multiple columns with counts

pyspark python sql

date Value1 Value2 Value3 16-08-2022 a b e 16-08-2022 a b f 16-08-2022 c d f output date Value1_a Value1_c Value2_b Value2_d Value3_e Value3_f 16-08-2022 2 1 2 1 1 2 continues like this for more columns maybe 10, I will aggregate on date and split the categorical columns with counts for each category , curren…

Pyspark, iteratively get values from column containing json string

pyspark python sql

I wonder how you would iteratively get the values from a json string in pyspark. I have the following format of my data and would like to create the “value” column: id_1 id_2 json_string value 1 1001 {“1001”:106, “2200”:101} 106 1 2200 {“1001”:106, “2200&#…

Spark SQL column doesn’t exist

pyspark sql

I am using Spark in databricks for this SQL command. In the input_data table, I have a string for the st column. Here I want to do some calculations of the string length. However, after I assign the length_s alias to the first column, I can not call it in the following columns. SQL engine gives out Column &#8…

How can I write an SQL query as a template in PySpark?

pyspark sql

I want to write a function that takes a column, a dataframe containing that column and a query template as arguments that outputs the result of the query when run on the column. Something like: func_sql(df_tbl,’age’,’select count(distinct {col}) from df_tbl’) Here, {col} should get rep…

mismatched input error when trying to use Spark subquery

apache-spark-sql pyspark sql

New at PySpark, trying to get a query to run and it seems like it SHOULD run but I get an EOF issue and I’m not sure how to resolve it.. What I’m trying to do is find all rows in blah.table where the value in col “domainname” matches a value from a list of domains. Then I want to

filter stop words from text column – spark SQL

apache-spark apache-spark-sql pyspark sql

I’m using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter. I managed to split the reviews to lists of strings, but don’t know how to filter. this is what I tried to do: thanks!

using a sql request in spark sql error in execution

apache-spark apache-spark-sql pyspark sql

I try to execute this query in pyspark i get all the time error. I have looked everywhere but I don’t know or it doesn’t work if someone can help me. the goal of this request is to update a new column that I will later create called temp_ok : this my code: My table contains this columns: _temp_ok_…