Tag: apache-spark

Return active vs inactive records count against a given date in a single column

I currently have the following UsersData table which gives aggregated historical data at a given particular date: Date UserID Name isActive 2021-10-01 1 Sam 1 2021-10-01 2 Dan 1 2021-10-08 1 Sam 0 2021-10-08 2 Dan 1 Requirement My requirement is to create another aggregated data which will show active vs inac…

Spark: How to transpose and explode columns with dynamic nested arrays

apache-spark apache-spark-sql pyspark python sql

I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe “””{“id”:3,”c”:[{“date”:3,”val”:3, &#8220…

Average open bug life in days

apache-spark apache-spark-sql sql

I am looking to identify the average life time in days for open bugs based on severity. bug severity status date_assigned 1 A open 2021-9-14 1 A in progress 2021-9-15 1 A fixed 2021-9-16 1 A verified 2021-9-17 1 A closed 2021-9-18 2 B opened 2021-10-18 2 B in progress 2021-10-19 2 B closed with fix 2021-10-20…

is there a method to conect to postgresql (dbeaver ) from pyspark?

apache-spark postgresql pyspark python-3.x sql

hello i installed pyspark now and i have a database postgres in local in DBeaver : how can i connect to postgres from pyspark please i tried this but i have an error Answer You need to add the jars you want to use when creating the sparkSession. See this : https://spark.apache.org/docs/2.4.7/submitting-applic…

create rows from columns in a apache spark dataset

apache-spark apache-spark-dataset scala sql

I’m trying from a dataset to create a row from existing columns. Here is my case: InputDataset accountid payingaccountid billedaccountid startdate enddate 0011t00000MY1U3AAL 0011t00000MY1U3XXX 0011t00000ZZ1U3AAL 2020-06-10 00:00:00.000000 NULL And I would like to have sometthing like this accountid star…

Pyspark: How to flatten nested arrays by merging values in spark

apache-spark apache-spark-sql pyspark python sql

I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark? EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well. Notes: Input datafram…

How to execute custom logic at pyspark window partition

apache-spark dataframe pyspark python sql

I have a dataframe in the format shown below, where we will have multiple entries of DEPNAME as shown below, my requirement is to set the result = Y at the DEPNAME level if either flag_1 or flag_2= Y, if both the flag i.e. flag_1 and flag_2 = N the result will be set as N as shown for DEPNAME=personnel

How to use LIMIT to sample rows dynamically

apache-spark apache-spark-sql sql

I have a table as follows: SampleReq Group ID 2 1 _001 2 1 _002 2 1 _003 1 2 _004 1 2 _005 1 2 _006 I want my query to IDs based on the column SampleReq, resulting in the following output: Group ID 1 _001 1 _003 2 _006 The query should pick any 2 IDs from group

Spark SQL: keep a non-key row after join

apache-spark java scala sql

I have two dataset as following: and: I want to join two datasets so that I could get ingredient information for each smoothie whose price is lower than 15$, but keep those even if the price is higher, and fill in with a string To be communicated for the ingredient field. I tried smoothieDs.join(ingredientDs)…

Facing issue while writing SQL in pyspark

apache-spark apache-spark-sql pyspark sql sql-server

I am trying to convert below SQL code to pyspark. Can someone please help me Here, util, count, procs are column names. while coding in pyspark, i can create a new column ‘col’ like this: Answer You can use when for doing the equivalent of update: