I’m working in SQL Workbench in Redshift. We have daily event tables for customer accounts, the same format each day just with updated info. There are currently 300+ tables. For a simple example, I would like to extract the top 10 rows from each table and place them in 1 table.
Table name format is Events_001, Events_002, etc. Typical values are Customer_ID and Balance.
Redshift does not appear to support declare variables, so I’m a bit stuck.
Advertisement
Answer
You’ve effectively invented a kind of pseudo-partitioning; where you manually partition the data by day.
To manually recombine the tables create a view to union everything together…
CREATE VIEW events_combined AS SELECT 1 AS partition_id, * FROM events_001 UNION ALL SELECT 2 AS partition_id, * FROM events_002 UNION ALL SELECT 3 AS partition_id, * FROM events_003 etc, etc
That’s a hassle, you need to recreate the view every time you add a new table.
That’s why most modern databases have partitioning schemes built in to them, so all the boiler-plate is taken care of for you.
But RedShift doesn’t do that. So, why not?
In general because RedShift has many alternative mechanisms for dividing and conquering data. It’s columnar, so you can avoid reading columns you don’t use. It’s horizontally partitioned across multiple nodes (sharded), to share the load with large volumes of data. It’s sorted and compressed in pages to avoid loading rows you don’t want or need. It has dirty pages for newly arriving data, which can then be cleaned up with a VACUUM
.
So, I would agree with others that it’s not normal practice. Yet, Amazon themselves do have a help page (briefly) describing your use case.
So, I’d disagree with “never do this”. Still, it is a strong indication that you’ve accidentally walked in to an anti-pattern and should seriously re-consider your design.