I have a table with a lot of records (6+ million) but most of the rows per ID are all the same. Example: Row Date ID Col1 Col2 Col3 Col4 Col5 1 01-01-2021 1 a b c d e 2 02-01-2021 1 a b c d x 3 03-…
Tag: apache-spark-sql
Extract time from Date Time as a separate column
My table looks like this – DateTime ID 2010-12-01 08:26:00 34 2010-12-01 09:41:00 42 I want to extract the time from DateTime and create a third column of it and then group it with frequency counts. Is there a way to do this in SQL? I’m using Apache Spark with inline SQL. I have achieved the equivalent using Spark functions
How to grouby data in one column and distribute it in another column in HiveSQL?
I have the following data: CompanyID Department No of People Country 45390 HR 100 UK 45390 Service 250 UK 98712 Service 300 US 39284 Admin 142 Norway 85932 Admin 260 Germany I wish to know how many people belong to the same department from different countries? Required Output Department No of People Country HR 100 UK Service 250 UK 300
Escaped single quote ignored in SELECT clause
Not sure why the escaped single quote doesn’t appear in the SQL output. Initially tried this in Jupyter notebook, but reproduced it in PySpark shell below. Output shows Bobs home instead of Bob’s home Answer Use backslash instead of a single quote to escape a single quote: Alternatively, you can use double quotes to surround the string, so that you
Change null to empty array in databricks SQL?
I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to
How to use where clause referencing a column when querying a JSON object in another column in SQL
I have the following sales table with a nested JSON object: sale_id sale_date identities 41acdd9c-2e86-4e84-9064-28a98aadf834 2017-05-13 {“SaleIdentifiers”: [{“type”: “ROM”, “Classifier”: “CORNXP21RTN”}]} To query the Classifier I do the following: This gives me the result: Classifier CORNXP21RTN How would I go about using the sale_date column in a where clause? For instance this shows me a list of the classifiers in
Pass date string as variable in spark sql
I am unable to pass a date string in spark sql When I run this However I get error when I want to pass the date string as variable I am not sure, how to pass that variable. Answer you can just add single quotes to the query and it should work for you
similar to groupByKey() in Spark but using SQL queries
I trying to make into using only SQL queries. It is kind of similar to using groupByKey() in pyspark. Is there a way to do this? Answer Just use conditional aggregation. One method is: In Postgres, this would be phrased using the standard filter clause:
How to Select ID’s in SQL (Databricks) in which at least 2 items from a list are present
I’m working with patient-level data in Azure Databricks and I’m trying to build out a cohort of patients that have at least 2 diagnoses from a list of specific diagnosis codes. This is essentially what the table looks like: The list of ICD_CD codes of interest is something like [2500, 3850, 8888]. In this case, I would want to return
Spark SQL : filtering a table on records which appear in another table (two columns)?
I have several tables, and I would like to filter the rows in one of them, based on whether two columns are present in another table. The data on each table is as follows Table1 : one hash can be associated to several articles ; one article can be associated to several hashes User Hash Article Name Hash1 Article1 Hash1