Problem Description
I am looking for a SQL query that can create table 2 using table 1. I’ve thought of way to do this but it’s not very eloquent and would prefer something simpler if possible.
TABLE 1:
ID | Col_1 | Col_2 | Col_3 | Col_4 |
---|---|---|---|---|
01 | A | Q | R | U |
02 | A | Q | R | P |
03 | A | Q | R | U |
04 | A | Q | R | P |
05 | A | Q | T | U |
06 | A | Q | T | P |
07 | A | Q | T | U |
08 | A | Q | T | P |
09 | A | Q | R | U |
10 | B | Q | R | P |
11 | B | Q | R | U |
12 | B | Q | R | P |
13 | B | Q | T | U |
14 | C | Q | T | P |
15 | C | Q | T | U |
16 | C | Q | T | U |
17 | C | Q | R | P |
18 | C | Q | R | U |
19 | C | Q | R | P |
20 | C | Q | R | U |
TABLE 2:
ID | Col_1 | Col_2 | Col_3 | Col_4 |
---|---|---|---|---|
01 | 9 | 9 | 5 | 3 |
02 | 9 | 9 | 5 | 2 |
03 | 9 | 9 | 5 | 3 |
04 | 9 | 9 | 5 | 2 |
05 | 9 | 9 | 4 | 2 |
06 | 9 | 9 | 4 | 2 |
07 | 9 | 9 | 4 | 2 |
08 | 9 | 9 | 4 | 2 |
09 | 9 | 9 | 5 | 3 |
10 | 4 | 4 | 3 | 2 |
11 | 4 | 4 | 3 | 1 |
12 | 4 | 4 | 3 | 2 |
13 | 4 | 4 | 1 | 1 |
14 | 7 | 7 | 3 | 1 |
15 | 7 | 7 | 3 | 2 |
16 | 7 | 7 | 3 | 2 |
17 | 7 | 7 | 4 | 2 |
18 | 7 | 7 | 4 | 2 |
19 | 7 | 7 | 4 | 2 |
20 | 7 | 7 | 4 | 2 |
The numbers in Table 2 are generated as follows:
In Table 2’s Col_1, the number is the count of all IDs that have the same Col_1 value.
In Table 2’s Col_2, the number is the count of all IDs that have the same Col_1 AND Col_2 value.
In Table 2’s Col_3, the number is the count of all IDs that have the same Col_1 AND Col_2 AND Col_3 value.
In Table 2’s Col_4, the number is the count of all IDs that have the same Col_1 AND Col_2 AND Col_3 AND Col_4 value.
Example:
In Table 1, ID01’s Col_1 value is ‘A’. ‘A’ matches with 02, 03, 04, 05, 06, 07, 08, and 09 (8 other IDs) so the number is 9.
ID01’s Col_2 value is ‘Q’. ‘A’ in Col_1 AND ‘Q’ in Col_2 matches with 02, 03, 04, 05, 06, 07, 08, and 09 (8 other IDs) so the number is 9.
ID01’s Col_3 value is ‘R’. ‘A’ in Col_1 AND ‘Q’ in Col_2 AND ‘R’ in Col_3 matches with 02, 03, 04, and 09 (4 other IDs) so the number is 5.
ID01’s Col_4 value is ‘U’. ‘A’ in Col_1 AND ‘Q’ in Col_2 AND ‘R’ in Col_3 AND ‘U’ in Col_4 matches with 03 and 09 (2 other IDs) so the number is 3.
Essentially, we are trying to find the total number of identical IDs in Table 1 – with respect to the Col of interest.
What I’ve tried
I figured out a way to get the numbers in Table 2, with short SQL queries:
-- Somehow use in Table 2 SELECT col_1, col_2, col_3, col_4, COUNT() AS ct FROM sample_table GROUP BY col_1, col_2, col_3, col_4 ORDER BY ct DESC;
this returns
Col_1 | Col_2 | Col_3 | Col_4 | ct |
---|---|---|---|---|
A | Q | R | U | 3 |
A | Q | R | P | 2 |
A | Q | T | P | 2 |
A | Q | T | U | 2 |
B | Q | R | P | 2 |
C | Q | R | P | 2 |
C | Q | R | U | 2 |
C | Q | T | U | 2 |
B | Q | R | U | 1 |
B | Q | T | U | 1 |
C | Q | T | P | 1 |
which are the correct numbers (ct) for Col_4 of Table 2. I would just need to match the columns in this output and Table 1. I can also GROUP BY
fewer Columns to get the correct numbers for Col_3, Col_2, and 1. e.g.
SELECT col_1, col_2, col_3, COUNT() AS ct FROM samptbl GROUP BY col_1, col_2, col_3 ORDER BY ct DESC;
Col_1 | Col_2 | Col_3 | ct |
---|---|---|---|
A | Q | R | 5 |
A | Q | T | 4 |
C | Q | R | 4 |
B | Q | R | 3 |
C | Q | T | 3 |
B | Q | T | 1 |
I’m doing this in SQLite right now but I can use any other language if needed (PostgreSQL, Python). I would really like this to be 100% in a SQL if possible. The data sets that I’m working with are large, so good efficiency is desired.
Advertisement
Answer
You can do it with window function COUNT()
:
SELECT ID, COUNT(*) OVER (PARTITION BY Col_1) Col_1, COUNT(*) OVER (PARTITION BY Col_1, Col_2) Col_2, COUNT(*) OVER (PARTITION BY Col_1, Col_2, Col_3) Col_3, COUNT(*) OVER (PARTITION BY Col_1, Col_2, Col_3, Col_4) Col_4 FROM table1 ORDER BY ID
Without any changes it works for Postgresql also.
See the demo.