I have a query to generate a report based on a date series that is grouped by date
and employee_id
. The date should be based on a particular time zone, in this case ‘Asia/Kuala_Lumpur’. But this can change depending on where the user’s time zone is.
SELECT d::date AT TIME ZONE 'Asia/Kuala_Lumpur' AS created_date, e.id, e.name, e.division_id, ARRAY_AGG( a.id ) as rows, MIN(a.created_at) FILTER (WHERE a.activity_type = 1) as min_time_in, MAX(a.created_at) FILTER (WHERE a.activity_type = 2) as max_time_out, ARRAY_AGG( CASE WHEN a.activity_type = 1 THEN a.created_at ELSE NULL END ) as check_ins, ARRAY_AGG( CASE WHEN a.activity_type = 2 THEN a.created_at ELSE NULL END ) as check_outs FROM (SELECT MIN(created_at), MAX(created_at) FROM attendance) AS r(startdate,enddate) , generate_series( startdate::timestamp, enddate::timestamp, interval '1 day') g(d) CROSS JOIN employee e LEFT JOIN attendance a ON a.created_at::date = d::date AND e.id = a.employee_id where d::date = date '2020-11-20' and division_id = 1 GROUP BY created_date , e.id , e.name , e.division_id ORDER BY created_date , e.id;
Definition and sample data for table attendance
:
CREATE TABLE attendance ( id int, employee_id int, activity_type int, created_at timestamp with time zone NOT NULL ); INSERT INTO attendance VALUES ( 1, 1, 1,'2020-11-18 07:10:25 +00:00'), ( 2, 2, 1,'2020-11-18 07:30:25 +00:00'), ( 3, 3, 1,'2020-11-18 07:50:25 +00:00'), ( 4, 2, 2,'2020-11-18 19:10:25 +00:00'), ( 5, 3, 2,'2020-11-18 19:22:38 +00:00'), ( 6, 1, 2,'2020-11-18 20:01:05 +00:00'), ( 7, 1, 1,'2020-11-19 07:11:23 +00:00'), ( 8, 1, 2,'2020-11-19 16:21:53 +00:00'), <-- Asia/Kuala_Lumpur +8 should be in 20.11 (refer to the check_outs field in the results output) ( 9, 1, 1,'2020-11-19 19:11:23 +00:00'), <-- Asia/Kuala_Lumpur +8 should be in 20.11 (refer to the check_ins field in the results output) (10, 1, 2,'2020-11-19 20:21:53 +00:00'), <-- Asia/Kuala_Lumpur +8 should be in 20.11 (refer to the check_outs field in the results output) (11, 1, 1,'2020-11-20 07:41:38 +00:00'), (12, 1, 2,'2020-11-20 08:52:01 +00:00');
Here is a fiddle to test.
The query does not include rows 8-10 in the output for the time zone Asia/Kuala_Lumpur +8, though it should. Result shows “rows” field 11,12
.
How can I fix the query so that it generates the report based on dates of a given time zone? (Meaning I can change Asia/Kuala_Lumpur
to America/New_York
etc.)
I was told to do something like this:
where created_at >= timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur' and created_at < timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur' + interval '1 day'
But I am not sure how to apply it. Does not seem to work properly in this fiddle. It should include rows 8,9,10,11,12 but only rows 8,9,10 show up.
Advertisement
Answer
DB design
Consider some modifications to your setup:
CREATE TABLE employee ( id int PRIMARY KEY -- ! , name text -- do NOT use char(n) ! , division_id int ); CREATE TABLE attendance ( id int PRIMARY KEY --! , employee_id int NOT NULL REFERENCES employee -- FK! , activity_type int , created_at timestamptz NOT NULL );
Defining a PK makes it easier to aggregate rows, because the PK covers the whole row in the GROUP BY
clause. See:
I wouldn’t use “name” as column name. It’s not descriptive. Every other column could be named “name”. Consider:
- Any downsides of using data type “text” for storing strings?
- How to implement a many-to-many relationship in PostgreSQL?
Query
SELECT * FROM ( -- complete employee/date grid for division in range SELECT g.d::date AS the_date, id AS employee_id, name, division_id FROM ( SELECT generate_series(MIN(created_at) AT TIME ZONE 'Asia/Kuala_Lumpur' , MAX(created_at) AT TIME ZONE 'Asia/Kuala_Lumpur' , interval '1 day') FROM attendance ) g(d) CROSS JOIN employee e WHERE e.division_id = 1 ) de LEFT JOIN ( -- checkins & checkouts per employee/date for division in range SELECT employee_id, ts::date AS the_date , array_agg(id) as rows , min(ts) FILTER (WHERE activity_type = 1) AS min_check_in , max(ts) FILTER (WHERE activity_type = 2) AS max_check_out , array_agg(ts::time) FILTER (WHERE activity_type = 1) AS check_ins , array_agg(ts::time) FILTER (WHERE activity_type = 2) AS check_outs FROM ( SELECT a.id, a.employee_id, a.activity_type, a.created_at AT TIME ZONE 'Asia/Kuala_Lumpur' AS ts -- convert to timestamp FROM employee e JOIN attendance a ON a.employee_id = e.id -- WHERE a.created_at >= timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur' -- "sargable" expressions -- AND a.created_at < timestamp '2020-11-21' AT TIME ZONE 'Asia/Kuala_Lumpur' -- exclusive upper bound (includes all of 2020-11-20); AND e.division_id = 1 ORDER BY a.employee_id, a.created_at, a.activity_type -- optional to guarantee sorted arrays ) sub GROUP BY 1, 2 ) a USING (the_date, employee_id) ORDER BY 1, 2;
db<>fiddle here
Note that my query outputs local date and time for Asia/Kuala_Lumpur:
test=> SELECT timestamptz '2020-11-20 08:52:01 +0' AT TIME ZONE 'Asia/Kuala_Lumpur' AS local_ts; local_ts --------------------- 2020-11-20 16:52:01
Where to start? You need to understand the concepts of time zones and the Postgres data types timestamp with time zone
(timestamptz
) vs. timestamp without time zone
(timestamp
). Else, it will be confusion without end. Start here:
Most notably, timestamptz
does not store a time zone:
When simply casting timestamptz
to date
or timestamp
, the current time zone setting of the session is assumed. Not what you want. Provide a time zone explicitly with the AT TIME ZONE
construct to avoid this pifall. In your fiddle you have both:
... , generate_series( startdate::timestamp AT TIME ZONE 'Asia/Kuala_Lumpur', enddate::timestamp AT TIME ZONE 'Asia/Kuala_Lumpur', interval '1 day') g(d) ...
Also not doing what you want. After the (faulty!) cast to timestamp
, the AT TIME ZONE
construct translates the values back to timestamptz
.
Also, your query generates the complete Cartesian Product of all users and and the maximum range of days in the the table attendance
, only to reduce it back to a single day with:
where created_at >= timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur' and created_at < timestamp '2020-11-20' AT TIME ZONE 'Asia/Kuala_Lumpur' + interval '1 day'
The WHERE
clause finally does what it’s supposed to do. But it makes no sense to first generate the full range of days, only to throw away most of it. (Seems you copied that from my other fiddle in the meantime?)
I commented out the WHERE
clause and kept an optimized version of your generate_series()
in my query as proof of concept. Further reading: