I’m trying to join a new column to my current query that uses aggregate functions. I create this column with a new query that also uses an aggregate function from a different table but I’m not sure if a JOIN
will work for me since I need to join it to its respective row.
TABLE A (employees that are enrolled or were enrolled in a project)
ID | DEPARTMENT | ENROLLED | PROJECT |
---|---|---|---|
1 | MARKETING | Yes | ARQ |
2 | MARKETING | Yes | TC |
3 | MARKETING | No | ARQ |
4 | MARKETING | No | TC |
5 | FINANCE | Yes | ARQ |
6 | FINANCE | Yes | TC |
7 | FINANCE | No | ARQ |
8 | FINANCE | Yes | TC |
This table has more departments and more projects, but I simplified.
TABLE B (relation with departments and employees)
ID | DEPARTMENT | TOTAL_EMPLOYEES |
---|---|---|
1 | MARKETING | 2 |
2 | MARKETING | 3 |
3 | FINANCE | 4 |
4 | FINANCE | 8 |
In my first query I was asked to achieve the following result – using only table A:
(employees enrolled) (employees not enrolled)
DEPARTMENT | ARQ_E | TC_E | TOTAL_ENROLLED | ARQ_N | TC_N | TOTAL_NOT_ENROLLED | TOTAL |
---|---|---|---|---|---|---|---|
MARKETING | 1 | 1 | 2 | 1 | 1 | 2 | 4 |
FINANCE | 1 | 1 | 2 | 1 | 1 | 2 | 4 |
Using the following query:
SELECT tableA.department, sum(case when enrolled = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E, sum(case when enrolled = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E, sum(case when enrolled = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED, sum(case when enrolled != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N, sum(case when enrolled != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N, sum(case when enrolled != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED, count (*) AS Total FROM tableA GROUP BY tableA.department;
My second query gets departments and their total employees from table B:
DEPARTMENT | TOTAL_EMPLOYEES |
---|---|
MARKETING | 5 |
FINANCE | 12 |
Using the following query:
SELECT tableB.department, sum(tableB.total_employees) AS TOTAL_EMPLOYEES FROM tableB GROUP BY tableB.department;
I need to add the column TOTAL_EMPLOYEES
to my first query, next to TOTAL
will be TOTAL_EMPLOYEES
. But it has to be placed with its respective department row. I need this to compare this 2 columns and see how many employees were not assigned to any project.
This is my expected result.
(employees enrolled) (employees not enrolled)
DEPARTMENT | ARQ_E | TC_E | TOTAL_ENROLLED | ARQ_N | TC_N | TOTAL_NOT_ENROLLED | TOTAL | T_EMPL |
---|---|---|---|---|---|---|---|---|
MARKETING | 1 | 1 | 2 | 1 | 1 | 2 | 4 | 5 |
FINANCE | 1 | 1 | 2 | 1 | 1 | 2 | 4 | 12 |
I have tried to achieve this using the following query:
SELECT tableA.department, sum(case when enrolled = 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_E, sum(case when enrolled = 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_E, sum(case when enrolled = 'Yes' then 1 else 0 end) as TOTAL_ENROLLED, sum(case when enrolled != 'Yes' and tableA.project = 'ARQ' then 1 else 0 end) as ARQ_N, sum(case when enrolled != 'Yes' and tableA.project = 'TC' then 1 else 0 end) as TC_N, sum(case when enrolled != 'Yes' then 1 else 0 end) as TOTAL_NOT_ENROLLED, count (*) AS Total, sum (tableB.total_employees) AS T_EMPL FROM tableA JOIN tableB ON tableA.department = tableB.department GROUP BY tableA.department;
But the numbers I get in my query are completely wrong since the JOINS repeat my rows and my SUMS duplicate.
I don’t know if I really need to use a join or a subquery to place my sum(tableB.department)
in its respective row.
I’m using PostgreSQL but since I’m using Standard 92 any SQL solution will help.
Advertisement
Answer
Your main issue stemmed from inadvertently multiplying rows with the join, and has already been addressed. See:
But use the standard SQL aggregate FILTER
clause. It’s shorter, cleaner, and noticeably faster. See:
- Aggregate columns with additional (distinct) filters
- For absolute performance, is SUM faster or COUNT?
SELECT * FROM ( SELECT department , count(*) FILTER (WHERE enrolled AND project = 'ARQ') AS arq_e , count(*) FILTER (WHERE enrolled AND project = 'TC') AS tc_e , count(*) FILTER (WHERE enrolled) AS total_enrolled , count(*) FILTER (WHERE NOT enrolled AND project = 'ARQ') AS arq_n , count(*) FILTER (WHERE NOT enrolled AND project = 'TC') AS tc_n , count(*) FILTER (WHERE NOT enrolled) AS total_not_enrolled , count(*) AS total FROM tableA a GROUP BY 1 ) a LEFT JOIN ( -- ! SELECT department , sum(total_employees) AS total_employees FROM tableB b GROUP BY 1 ) b USING (department);
enrolled
should be a boolean
column. Make it so if it isn’t. Then you can use it directly. Smaller, faster, cleaner, shorter code.
I replaced the [INNER] JOIN
with a LEFT [OUTER] JOIN
on a suspicion. Typically, you want to keep all results, even if the same department is not found in the other table. Maybe even a FULL [OUTER] JOIN
?
Also, USING (department)
as join condition conveniently outputs that column only once, so we can make do with SELECT *
in the outer SELECT
.
Finally, subqueries are shorter and faster than CTEs. Not much since Postgres 12, but still. See: