Tag: apache-spark-sql

Only show rows in a table if something changed in previous row

apache-spark apache-spark-sql pyspark python sql

I have a table with a lot of records (6+ million) but most of the rows per ID are all the same. Example: Row Date ID Col1 Col2 Col3 Col4 Col5 1 01-01-2021 1 a b c d e 2 02-01-2021 1 a b c d x 3 03-…

Extract time from Date Time as a separate column

apache-spark apache-spark-sql sql

My table looks like this – DateTime ID 2010-12-01 08:26:00 34 2010-12-01 09:41:00 42 I want to extract the time from DateTime and create a third column of it and then group it with frequency counts. Is there a way to do this in SQL? I’m using Apache Spark with inline SQL. I have achieved the equiv…

How to grouby data in one column and distribute it in another column in HiveSQL?

apache-spark-sql hiveql sql

I have the following data: CompanyID Department No of People Country 45390 HR 100 UK 45390 Service 250 UK 98712 Service 300 US 39284 Admin 142 Norway 85932 Admin 260 Germany I wish to know how many people belong to the same department from different countries? Required Output Department No of People Country H…

Escaped single quote ignored in SELECT clause

apache-spark apache-spark-sql sql

Not sure why the escaped single quote doesn’t appear in the SQL output. Initially tried this in Jupyter notebook, but reproduced it in PySpark shell below. Output shows Bobs home instead of Bob’s home Answer Use backslash instead of a single quote to escape a single quote: Alternatively, you can u…

Change null to empty array in databricks SQL?

apache-spark apache-spark-sql sql

I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs …

How to use where clause referencing a column when querying a JSON object in another column in SQL

apache-spark apache-spark-sql json sql

I have the following sales table with a nested JSON object: sale_id sale_date identities 41acdd9c-2e86-4e84-9064-28a98aadf834 2017-05-13 {“SaleIdentifiers”: [{“type”: “ROM”, “Classifier”: “CORNXP21RTN”}]} To query the Classifier I do the following: T…

Pass date string as variable in spark sql

apache-spark apache-spark-sql sql

I am unable to pass a date string in spark sql When I run this However I get error when I want to pass the date string as variable I am not sure, how to pass that variable. Answer you can just add single quotes to the query and it should work for you

similar to groupByKey() in Spark but using SQL queries

apache-spark-sql postgresql pyspark sql

I trying to make into using only SQL queries. It is kind of similar to using groupByKey() in pyspark. Is there a way to do this? Answer Just use conditional aggregation. One method is: In Postgres, this would be phrased using the standard filter clause:

How to Select ID’s in SQL (Databricks) in which at least 2 items from a list are present

apache-spark apache-spark-sql sql

I’m working with patient-level data in Azure Databricks and I’m trying to build out a cohort of patients that have at least 2 diagnoses from a list of specific diagnosis codes. This is essentially what the table looks like: The list of ICD_CD codes of interest is something like [2500, 3850, 8888].…

Spark SQL : filtering a table on records which appear in another table (two columns)?

apache-spark apache-spark-sql filtering join sql

I have several tables, and I would like to filter the rows in one of them, based on whether two columns are present in another table. The data on each table is as follows Table1 : one hash can be associated to several articles ; one article can be associated to several hashes User Hash Article Name Hash1 Arti…