I currently have the following UsersData table which gives aggregated historical data at a given particular date: Date UserID Name isActive 2021-10-01 1 Sam 1 2021-10-01 2 Dan 1 2021-10-08 1 Sam 0 2021-10-08 2 Dan 1 Requirement My requirement is to create another aggregated data which will show active vs inactive record for a the above given dates –
Tag: apache-spark-sql
QUALIFY equivalent in HIVE / SPARK SQL
I am trying to convert a Teradata SQL query into HIVE/Spark SQL equivalent. Is there any substitute for qualify along with count Answer Got it 🙂
Spark: How to transpose and explode columns with dynamic nested arrays
I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe “””{“id”:3,”c”:[{“date”:3,”val”:3, “val_dynamic”:3}]}}””” , with new column c, where array has new val_dynamic field which can appear on random basis. I’m looking for required output 2 (Transpose and
Missing rows in full outer join
I am trying to count how many users are observed on each of the 3 consecutive days. Each of the 3 intermediate tables (t0, t1, t2) has 2 columns: uid (unique ID) and d0 (or d1 or d2, which is 1 and indicates that the user is observed on that day). The following query: produces this output from spark.sql(q).toPandas().set_index([“d0″,”d1″,”d2”]): Two
Average open bug life in days
I am looking to identify the average life time in days for open bugs based on severity. bug severity status date_assigned 1 A open 2021-9-14 1 A in progress 2021-9-15 1 A fixed 2021-9-16 1 A verified 2021-9-17 1 A closed 2021-9-18 2 B opened 2021-10-18 2 B in progress 2021-10-19 2 B closed with fix 2021-10-20 3 C open
Pyspark: How to flatten nested arrays by merging values in spark
I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark? EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well. Notes: Input dataframe has more than 10000 columns name_1_a,
How to use LIMIT to sample rows dynamically
I have a table as follows: SampleReq Group ID 2 1 _001 2 1 _002 2 1 _003 1 2 _004 1 2 _005 1 2 _006 I want my query to IDs based on the column SampleReq, resulting in the following output: Group ID 1 _001 1 _003 2 _006 The query should pick any 2 IDs from group
Filter dictionary in pyspark with key names
Given a dictionary like column in a dataset, I want to grab the value from a key given that the value from another key is satisfied. Example: Say I have a column ‘statistics’ in a dataset, where each data row looks as: I want to get the value of ‘eye’ whenever hair is ‘black’ I tried: but it gives an
Joining two tables with same keys but different fields
I have two tables both all with the same fields Except for one. I want to combine these two tables with the resulting table having all the fields from both including the two fields that are not the same in each table. I.e: lets say I have table order_debit with schema and table order_credit with schema What I want is
Facing issue while writing SQL in pyspark
I am trying to convert below SQL code to pyspark. Can someone please help me Here, util, count, procs are column names. while coding in pyspark, i can create a new column ‘col’ like this: Answer You can use when for doing the equivalent of update: