I have a bunch of large datasets.
DS_1 (contains all unique IDs and names):
ID Name 1 Apple 2 Banana 3 Cherry
DS_2:
ID Observation 1 Apple detail 1 Apple detail 1 Apple detail 2 Banana detail 2 Banana detail 3 Cherry detail 3 Cherry detail 3 Cherry detail
DS_3:
ID Observation 2 Banana detail 2 Banana detail 3 Cherry detail
I’m looking to create a new dataset that shows frequency counts across the datasets (and lastly calculates Total_Obs). I would output something like this:
ID Name DS_2 DS_3 Total_Obs 1 Apple 3 0 3 2 Banana 2 2 4 3 Cherry 3 1 4
The datasets are fairly large. Is there a more efficient way to do this apart from concatenating the datasets and doing a frequency table? Or having to creating a bunch of sorted frequency tables, then merging by ID across all the datasets?
Advertisement
Answer
You can do below –
Select t1.id As id ,t1.name As name ,coalesce(DS_2_obs,0) as DS_2_obs ,coalesce(DS_3_obs,0) as DS_3_obs ,coalesce(DS_2_obs,0) + coalesce(DS_3_obs,0) As Total_Obs from DS_1 t1 left join (Select id, count(1) as DS_2_obs from DS_2 group by id) t2 on t1.id = t2.id left join (Select id, count(1) as DS_3_obs from DS_3 group by id) t3 on t1.id = t3.id;
Also, you should always tag which Database you are using.
If the above SQL takes a lot of time, instead of t2 and t3 as inline queries you can consider creating aggregate observation tables with frequency/counts and having an index on id. That way when you join the observation aggregates with the primary table, the join can be faster based on indexes.