Tag: apache-spark-sql

Return active vs inactive records count against a given date in a single column

I currently have the following UsersData table which gives aggregated historical data at a given particular date: Date UserID Name isActive 2021-10-01 1 Sam 1 2021-10-01 2 Dan 1 2021-10-08 1 Sam 0 2021-10-08 2 Dan 1 Requirement My requirement is to create another aggregated data which will show active vs inac…

QUALIFY equivalent in HIVE / SPARK SQL

apache-spark-sql hive hiveql mysql sql

I am trying to convert a Teradata SQL query into HIVE/Spark SQL equivalent. Is there any substitute for qualify along with count Answer Got it 🙂

Spark: How to transpose and explode columns with dynamic nested arrays

apache-spark apache-spark-sql pyspark python sql

I applied an algorithm from the question Spark: How to transpose and explode columns with nested arrays to transpose and explode nested spark dataframe with dynamic arrays. I have added to the dataframe “””{“id”:3,”c”:[{“date”:3,”val”:3, &#8220…

Missing rows in full outer join

apache-spark-sql outer-join sql

I am trying to count how many users are observed on each of the 3 consecutive days. Each of the 3 intermediate tables (t0, t1, t2) has 2 columns: uid (unique ID) and d0 (or d1 or d2, which is 1 and indicates that the user is observed on that day). The following query: produces this output from spark.sql(q).to…

Average open bug life in days

apache-spark apache-spark-sql sql

I am looking to identify the average life time in days for open bugs based on severity. bug severity status date_assigned 1 A open 2021-9-14 1 A in progress 2021-9-15 1 A fixed 2021-9-16 1 A verified 2021-9-17 1 A closed 2021-9-18 2 B opened 2021-10-18 2 B in progress 2021-10-19 2 B closed with fix 2021-10-20…

Pyspark: How to flatten nested arrays by merging values in spark

apache-spark apache-spark-sql pyspark python sql

I have 10000 jsons with different ids each has 10000 names. How to flatten nested arrays by merging values by int or str in pyspark? EDIT: I have added column name_10000_xvz to explain better data structure. I have updated Notes, Input df, required output df and input json files as well. Notes: Input datafram…

How to use LIMIT to sample rows dynamically

apache-spark apache-spark-sql sql

I have a table as follows: SampleReq Group ID 2 1 _001 2 1 _002 2 1 _003 1 2 _004 1 2 _005 1 2 _006 I want my query to IDs based on the column SampleReq, resulting in the following output: Group ID 1 _001 1 _003 2 _006 The query should pick any 2 IDs from group

Filter dictionary in pyspark with key names

apache-spark-sql databricks dictionary sql

Given a dictionary like column in a dataset, I want to grab the value from a key given that the value from another key is satisfied. Example: Say I have a column ‘statistics’ in a dataset, where each data row looks as: I want to get the value of ‘eye’ whenever hair is ‘black&#821…

Joining two tables with same keys but different fields

apache-spark-sql mysql relational-database sql

I have two tables both all with the same fields Except for one. I want to combine these two tables with the resulting table having all the fields from both including the two fields that are not the same in each table. I.e: lets say I have table order_debit with schema and table order_credit with schema What I…

Facing issue while writing SQL in pyspark

apache-spark apache-spark-sql pyspark sql sql-server

I am trying to convert below SQL code to pyspark. Can someone please help me Here, util, count, procs are column names. while coding in pyspark, i can create a new column ‘col’ like this: Answer You can use when for doing the equivalent of update: