I have a table with a lot of records (6+ million) but most of the rows per ID are all the same. Example: Row Date ID Col1 Col2 Col3 Col4 Col5 1 01-01-2021 1 a b c d e 2 02-01-2021 1 a b c d x 3 03-…
Tag: apache-spark
Extract time from Date Time as a separate column
My table looks like this – DateTime ID 2010-12-01 08:26:00 34 2010-12-01 09:41:00 42 I want to extract the time from DateTime and create a third column of it and then group it with frequency counts. Is there a way to do this in SQL? I’m using Apache Spark with inline SQL. I have achieved the equivalent using Spark functions
Escaped single quote ignored in SELECT clause
Not sure why the escaped single quote doesn’t appear in the SQL output. Initially tried this in Jupyter notebook, but reproduced it in PySpark shell below. Output shows Bobs home instead of Bob’s home Answer Use backslash instead of a single quote to escape a single quote: Alternatively, you can use double quotes to surround the string, so that you
Change null to empty array in databricks SQL?
I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to
How to use an alias in Hive?
I am trying to find unique cities using the window function, I am not able to use an alias in this query Answer You cannot have a window function in the where clause. Put it in a subquery and do the filter afterwards:
How to use where clause referencing a column when querying a JSON object in another column in SQL
I have the following sales table with a nested JSON object: sale_id sale_date identities 41acdd9c-2e86-4e84-9064-28a98aadf834 2017-05-13 {“SaleIdentifiers”: [{“type”: “ROM”, “Classifier”: “CORNXP21RTN”}]} To query the Classifier I do the following: This gives me the result: Classifier CORNXP21RTN How would I go about using the sale_date column in a where clause? For instance this shows me a list of the classifiers in
Pass date string as variable in spark sql
I am unable to pass a date string in spark sql When I run this However I get error when I want to pass the date string as variable I am not sure, how to pass that variable. Answer you can just add single quotes to the query and it should work for you
Count number of weeks, days and months from a certain date in PySpark
So, I have a DataFrame of this type: And I want to create multiple columns containing, for each line, the current day, week, month and year from a certain date(simply a year, like 2020 for 2020-01-01). At first I thought of using something like this line of code unfortunately this wouldn’t work (except for year and month) correctly since my
How to Select ID’s in SQL (Databricks) in which at least 2 items from a list are present
I’m working with patient-level data in Azure Databricks and I’m trying to build out a cohort of patients that have at least 2 diagnoses from a list of specific diagnosis codes. This is essentially what the table looks like: The list of ICD_CD codes of interest is something like [2500, 3850, 8888]. In this case, I would want to return
Spark SQL : filtering a table on records which appear in another table (two columns)?
I have several tables, and I would like to filter the rows in one of them, based on whether two columns are present in another table. The data on each table is as follows Table1 : one hash can be associated to several articles ; one article can be associated to several hashes User Hash Article Name Hash1 Article1 Hash1