I’m using spark SQL and have a data frame with user IDs & reviews of products. I need to filter stop words from the reviews, and I have a text file with stop words to filter. I managed to split the reviews to lists of strings, but don’t know how to filter. this is what I tried to do: thanks!
Tag: apache-spark
PySpark: Adding elements from python list into spark.sql() statement
have list in python that is used throughout my code: I also have have a simple spark.sql() line that I need to execute: I want to replace the list of elements in the spark.sql() statment with the python list so that that last line in the SQL is I am aware of using {} and str.format but I am struggling
Select rows from a table which contains at-least one alphabet in the column
I have column called name under a table in Databricks. I want to find a way to select only those rows from a table, which contains at-least one alphabet character in the name column. Example values in the column: Expected: I need to pick only those values which contains at least one alphabet in it. Or in other words, I
Spark.sql Filter rows by MAX
Below is part of a source file which you could imagine being much bigger: After the following code: I would like to obtain this result: The aim is to: Select the dates which each cityname has the MAX total (Note, A city can appear twice if they have MAX total for 2 different dates), Sort by total descending, then date
Return active vs inactive records count against a given date in a single column
I currently have the following UsersData table which gives aggregated historical data at a given particular date: Date UserID Name isActive 2021-10-01 1 Sam 1 2021-10-01 2 Dan 1 2021-10-08 1 Sam 0 2021-10-08 2 Dan 1 Requirement My requirement is to create another aggregated data which will show active vs inactive record for a the above given dates –
Spark: How to transpose and explode columns with dynamic nested arrays
I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe “””{“id”:3,”c”:[{“date”:3,”val”:3, “val_dynamic”:3}]}}””” , with new column c, where array has new val_dynamic field which can appear on random basis. I’m looking for required output 2 (Transpose and
Average open bug life in days
I am looking to identify the average life time in days for open bugs based on severity. bug severity status date_assigned 1 A open 2021-9-14 1 A in progress 2021-9-15 1 A fixed 2021-9-16 1 A verified 2021-9-17 1 A closed 2021-9-18 2 B opened 2021-10-18 2 B in progress 2021-10-19 2 B closed with fix 2021-10-20 3 C open
is there a method to conect to postgresql (dbeaver ) from pyspark?
hello i installed pyspark now and i have a database postgres in local in DBeaver : how can i connect to postgres from pyspark please i tried this but i have an error Answer You need to add the jars you want to use when creating the sparkSession. See this : https://spark.apache.org/docs/2.4.7/submitting-applications.html#advanced-dependency-management Either when you start pyspark or when you
create rows from columns in a apache spark dataset
I’m trying from a dataset to create a row from existing columns. Here is my case: InputDataset accountid payingaccountid billedaccountid startdate enddate 0011t00000MY1U3AAL 0011t00000MY1U3XXX 0011t00000ZZ1U3AAL 2020-06-10 00:00:00.000000 NULL And I would like to have sometthing like this accountid startdate enddate 0011t00000MY1U3AAL 2021-06-10 00:00:00.000000 NULL 0011t00000MY1U3XXX 2021-06-10 00:00:00.000000 NULL 0011t00000ZZ1U3AAL 2021-06-10 00:00:00.000000 NULL In the input dataset the columns billedaccounid and
How to execute custom logic at pyspark window partition
I have a dataframe in the format shown below, where we will have multiple entries of DEPNAME as shown below, my requirement is to set the result = Y at the DEPNAME level if either flag_1 or flag_2= Y, if both the flag i.e. flag_1 and flag_2 = N the result will be set as N as shown for DEPNAME=personnel