I have a table which requires filtering based on the dates.
| Group | Account || Values | Date_ingested | | -------- | -------- || -------- | -------- | | X | 3000 || 0 | 2023-01-07 | | Y | 3000 || null | 2021-02-22 |
The goal is to select the latest date when there is multiple data points like in the example above. The account 3000 in the dataframe occurs under two Groups but the up-to-date and correct result should only reflect the group X because it was ingested into Databricks very recently. Now, if I try to use the code below with grouping the code gets executed but the max function is ignored and in the results I get two results for account 3000 with group X and then Y.
Select Group, Account, Values, max(Date_ingested) from datatableX
If I choose to use the code without grouping, I get the following error
Error in SQL statement: AnalysisException: grouping expressions sequence is empty, and ‘datatableX.Account’ is not an aggregate function. Wrap ‘(max(spark_catalog.datatableX.Date_ingested) AS`max(Date_ingested))’ in windowing function(s) or wrap ‘spark_catalog.datatableX.Account’ in first() (or first_value) if you don’t care which value you get.
I can’t, however, figure out a way to do the above. Tried reading about the aggreate functions but I can’t grasp the concept.
Select Group, Account, Values, max(Date_ingested) from datatableX
or
Select Group, Account, Values, max(Date_ingested) from datatableX group by Group, Account, Values
Advertisement
Answer
You want the entire latest record per account, which suggests filtering rather than aggregation.
A typical approach uses rank()
to enumerate records having the same account by descending date of ingestion, then filters on the top-record per group in the outer query:
select * from ( select d.*, row_number() over(partition by account order by date_ingested desc) rn from datatableX ) d where rn = 1