Tag: apache-spark-sql

count of distinct columns using group by and calculating percentage

Trying to write a sql query: below is normal output I need row wise percentage output for tidcounts: The query I’m trying is below expected output is: Please suggest if i am missing anything it should be in either spark-sql or pyspark Answer Solution with spark.sql Solution with pyspark Example Result

How to aggregate on multiple columns using SQL or spark SQL

aggregate apache-spark apache-spark-sql multiple-columns sql

I have following table: Expected output is: The aggregation computation involves 2 columns, is this supported in SQL? Answer In Spark SQL you can do it like this: or in one select: Higher-order aggregate function is used in this example. aggregate(expr, start, merge, finish) – Applies a binary operator to an initial state and all elements in the array, and

mismatched input error when trying to use Spark subquery

apache-spark-sql pyspark sql

New at PySpark, trying to get a query to run and it seems like it SHOULD run but I get an EOF issue and I’m not sure how to resolve it.. What I’m trying to do is find all rows in blah.table where the value in col “domainname” matches a value from a list of domains. Then I want to

filter stop words from text column – spark SQL

apache-spark apache-spark-sql pyspark sql

I’m using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter. I managed to split the reviews to lists of strings, but don’t know how to filter. this is what I tried to do: thanks!

sql – how to join on a column that is less than another join key

apache-spark apache-spark-sql dataframe sql

I have two tables as below. What I’m trying to do is to join A and B base on date and id, to get the value from B. The problem is, I want to join using add_month(A.Date, -1) = B.month (find the data in table B from one month earlier). If that’s not available, I want to join using two

using a sql request in spark sql error in execution

apache-spark apache-spark-sql pyspark sql

I try to execute this query in pyspark i get all the time error. I have looked everywhere but I don’t know or it doesn’t work if someone can help me. the goal of this request is to update a new column that I will later create called temp_ok : this my code: My table contains this columns: _temp_ok_calculer,Operator level

Group by range of dates from date_start to date_end columns

apache-spark apache-spark-sql databricks sql

I have a table with following table structure: I want to count how many events (each row is an event) was in every place by each month. If event dates refer to several months, it should be counted for all affected months. place_id could be repeated, so I did the following query: So I get following grouped table: Problem is

Select rows from a table which contains at-least one alphabet in the column

apache-spark apache-spark-sql databricks pyspark sql

I have column called name under a table in Databricks. I want to find a way to select only those rows from a table, which contains at-least one alphabet character in the name column. Example values in the column: Expected: I need to pick only those values which contains at least one alphabet in it. Or in other words, I

Duplicate a row based in a column that has 2 values in spark sql

apache-spark-sql mysql sql

I have a temporary view that looks like this. What I want is to duplicate a row by adding an ‘All’ value to Activity Expecting result would be: I tried to create it through Zeppelin, but I am not able to update a view. Is there any way to do it please ? I can only use SQL unfortunately Thanks

Spark.sql Filter rows by MAX

apache-spark apache-spark-sql pyspark sql

Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, then date