I have a table as follows: SampleReq Group ID 2 1 _001 2 1 _002 2 1 _003 1 2 _004 1 2 _005 1 2 _006 I want my query to IDs based on the column SampleReq, resulting in the following output: Group ID 1 _001 1 _003 2 _006 The query should pick any 2 IDs from group
Tag: apache-spark-sql
Filter dictionary in pyspark with key names
Given a dictionary like column in a dataset, I want to grab the value from a key given that the value from another key is satisfied. Example: Say I have a column ‘statistics’ in a dataset, where each data row looks as: I want to get the value of ‘eye’ whenever hair is ‘black’ I tried: but it gives an
Joining two tables with same keys but different fields
I have two tables both all with the same fields Except for one. I want to combine these two tables with the resulting table having all the fields from both including the two fields that are not the same in each table. I.e: lets say I have table order_debit with schema and table order_credit with schema What I want is
Only show rows in a table if something changed in previous row
I have a table with a lot of records (6+ million) but most of the rows per ID are all the same. Example: Row Date ID Col1 Col2 Col3 Col4 Col5 1 01-01-2021 1 a b c d e 2 02-01-2021 1 a b c d x 3 03-…
How to grouby data in one column and distribute it in another column in HiveSQL?
I have the following data: CompanyID Department No of People Country 45390 HR 100 UK 45390 Service 250 UK 98712 Service 300 US 39284 Admin 142 Norway 85932 Admin 260 Germany I wish to know how many people belong to the same department from different countries? Required Output Department No of People Country HR 100 UK Service 250 UK 300
Change null to empty array in databricks SQL?
I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to
How to Select ID’s in SQL (Databricks) in which at least 2 items from a list are present
I’m working with patient-level data in Azure Databricks and I’m trying to build out a cohort of patients that have at least 2 diagnoses from a list of specific diagnosis codes. This is essentially what the table looks like: The list of ICD_CD codes of interest is something like [2500, 3850, 8888]. In this case, I would want to return
Spark SQL : filtering a table on records which appear in another table (two columns)?
I have several tables, and I would like to filter the rows in one of them, based on whether two columns are present in another table. The data on each table is as follows Table1 : one hash can be associated to several articles ; one article can be associated to several hashes User Hash Article Name Hash1 Article1 Hash1
How I can select a column where in another column I need a specific things
I have a pyspark data frame. How I can select a column where in another column I need a specific things. suppose I have n columns. for 2 columns I have A. B. a b a c d f I want all column B. …
Converting query from SQL to pyspark
I am trying to convert the following SQL query into pyspark: The code I have in PySpark right now is this: However, this is simply returning the number of rows in the “data” dataframe, and I know this isn’t correct. I am very new at PySpark, can anyone help me solve this? Answer You need to collect the result into