New at PySpark, trying to get a query to run and it seems like it SHOULD run but I get an EOF issue and I’m not sure how to resolve it.. What I’m trying to do is find all rows in blah.table where the value in col “domainname” matches a value from a list of domains. Then I want to
Tag: apache-spark-sql
filter stop words from text column – spark SQL
I’m using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter. I managed to split the reviews to lists of strings, but don’t know how to filter. this is what I tried to do: thanks!
Select rows from a table which contains at-least one alphabet in the column
I have column called name under a table in Databricks. I want to find a way to select only those rows from a table, which contains at-least one alphabet character in the name column. Example values in the column: Expected: I need to pick only those values which contains at least one alphabet in it. Or in other words, I
Duplicate a row based in a column that has 2 values in spark sql
I have a temporary view that looks like this. What I want is to duplicate a row by adding an ‘All’ value to Activity Expecting result would be: I tried to create it through Zeppelin, but I am not able to update a view. Is there any way to do it please ? I can only use SQL unfortunately Thanks
Spark.sql Filter rows by MAX
Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, then date
Return active vs inactive records count against a given date in a single column
I currently have the following UsersData table which gives aggregated historical data at a given particular date: Date UserID Name isActive 2021-10-01 1 Sam 1 2021-10-01 2 Dan 1 2021-10-08 1 Sam 0 2021-10-08 2 Dan 1 Requirement My requirement is to create another aggregated data which will show active vs inactive record for a the above given dates –
Spark: How to transpose and explode columns with dynamic nested arrays
I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe “””{“id”:3,”c”:[{“date”:3,”val”:3, “val_dynamic”:3}]}}””” , with new column c, where array has new val_dynamic field which can appear on random basis. I’m looking for required output 2 (Transpose and
Missing rows in full outer join
I am trying to count how many users are observed on each of the 3 consecutive days. Each of the 3 intermediate tables (t0, t1, t2) has 2 columns: uid (unique ID) and d0 (or d1 or d2, which is 1 and indicates that the user is observed on that day). The following query: produces this output from spark.sql(q).toPandas().set_index([“d0″,”d1″,”d2”]): Two
How to use LIMIT to sample rows dynamically
I have a table as follows: SampleReq Group ID 2 1 _001 2 1 _002 2 1 _003 1 2 _004 1 2 _005 1 2 _006 I want my query to IDs based on the column SampleReq, resulting in the following output: Group ID 1 _001 1 _003 2 _006 The query should pick any 2 IDs from group
Filter dictionary in pyspark with key names
Given a dictionary like column in a dataset, I want to grab the value from a key given that the value from another key is satisfied. Example: Say I have a column ‘statistics’ in a dataset, where each data row looks as: I want to get the value of ‘eye’ whenever hair is ‘black’ I tried: but it gives an