I currently have the following UsersData table which gives aggregated historical data at a given particular date: Date UserID Name isActive 2021-10-01 1 Sam 1 2021-10-01 2 Dan 1 2021-10-08 1 Sam 0 2021-10-08 2 Dan 1 Requirement My requirement is to create another aggregated data which will show active vs inactive record for a the above given dates –
Tag: apache-spark
Spark: How to transpose and explode columns with dynamic nested arrays
I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe “””{“id”:3,”c”:[{“date”:3,”val”:3, “val_dynamic”:3}]}}””” , with new column c, where array has new val_dynamic field which can appear on random basis. I’m looking for required output 2 (Transpose and
Average open bug life in days
I am looking to identify the average life time in days for open bugs based on severity. bug severity status date_assigned 1 A open 2021-9-14 1 A in progress 2021-9-15 1 A fixed 2021-9-16 1 A verified 2021-9-17 1 A closed 2021-9-18 2 B opened 2021-10-18 2 B in progress 2021-10-19 2 B closed with fix 2021-10-20 3 C open
is there a method to conect to postgresql (dbeaver ) from pyspark?
hello i installed pyspark now and i have a database postgres in local in DBeaver : how can i connect to postgres from pyspark please i tried this but i have an error Answer You need to add the jars you want to use when creating the sparkSession. See this : https://spark.apache.org/docs/2.4.7/submitting-applications.html#advanced-dependency-management Either when you start pyspark or when you
create rows from columns in a apache spark dataset
I’m trying from a dataset to create a row from existing columns. Here is my case: InputDataset accountid payingaccountid billedaccountid startdate enddate 0011t00000MY1U3AAL 0011t00000MY1U3XXX 0011t00000ZZ1U3AAL 2020-06-10 00:00:00.000000 NULL And I would like to have sometthing like this accountid startdate enddate 0011t00000MY1U3AAL 2021-06-10 00:00:00.000000 NULL 0011t00000MY1U3XXX 2021-06-10 00:00:00.000000 NULL 0011t00000ZZ1U3AAL 2021-06-10 00:00:00.000000 NULL In the input dataset the columns billedaccounid and
Pyspark: How to flatten nested arrays by merging values in spark
I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark? EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well. Notes: Input dataframe has more than 10000 columns name_1_a,
How to execute custom logic at pyspark window partition
I have a dataframe in the format shown below, where we will have multiple entries of DEPNAME as shown below, my requirement is to set the result = Y at the DEPNAME level if either flag_1 or flag_2= Y, if both the flag i.e. flag_1 and flag_2 = N the result will be set as N as shown for DEPNAME=personnel
How to use LIMIT to sample rows dynamically
I have a table as follows: SampleReq Group ID 2 1 _001 2 1 _002 2 1 _003 1 2 _004 1 2 _005 1 2 _006 I want my query to IDs based on the column SampleReq, resulting in the following output: Group ID 1 _001 1 _003 2 _006 The query should pick any 2 IDs from group
Spark SQL: keep a non-key row after join
I have two dataset as following: and: I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field. I tried smoothieDs.join(ingredientDs).filter(col(price).lt(15)) and it gives: But my expected
Facing issue while writing SQL in pyspark
I am trying to convert below SQL code to pyspark. Can someone please help me Here, util, count, procs are column names. while coding in pyspark, i can create a new column ‘col’ like this: Answer You can use when for doing the equivalent of update: