Skip to content
Advertisement

How to get observation frequency counts from multiple dataset into one table?

I have a bunch of large datasets.

DS_1 (contains all unique IDs and names):

ID   Name
1    Apple
2    Banana
3    Cherry

DS_2:

ID   Observation
1    Apple detail
1    Apple detail
1    Apple detail
2    Banana detail
2    Banana detail
3    Cherry detail
3    Cherry detail
3    Cherry detail

DS_3:

ID   Observation
2    Banana detail
2    Banana detail
3    Cherry detail

I’m looking to create a new dataset that shows frequency counts across the datasets (and lastly calculates Total_Obs). I would output something like this:

ID   Name      DS_2    DS_3   Total_Obs
1    Apple     3       0      3
2    Banana    2       2      4
3    Cherry    3       1      4

The datasets are fairly large. Is there a more efficient way to do this apart from concatenating the datasets and doing a frequency table? Or having to creating a bunch of sorted frequency tables, then merging by ID across all the datasets?

Advertisement

Answer

You can do below –

Select t1.id As id
      ,t1.name As name
      ,coalesce(DS_2_obs,0) as DS_2_obs
      ,coalesce(DS_3_obs,0) as DS_3_obs
      ,coalesce(DS_2_obs,0) + coalesce(DS_3_obs,0) As Total_Obs
from DS_1 t1
left join (Select id, count(1) as DS_2_obs from DS_2 group by id) t2
on t1.id = t2.id
left join (Select id, count(1) as DS_3_obs from DS_3 group by id) t3
on t1.id = t3.id;

Also, you should always tag which Database you are using.

If the above SQL takes a lot of time, instead of t2 and t3 as inline queries you can consider creating aggregate observation tables with frequency/counts and having an index on id. That way when you join the observation aggregates with the primary table, the join can be faster based on indexes.

User contributions licensed under: CC BY-SA
4 People found this is helpful
Advertisement