I’m trying to optimise my query for when an internal customer only want to return one result *(and it’s associated nested dataset). My aim is to reduce the query process size.
However, it appears to be the exact same value regardless of whether I’m querying for 1 record (with unnested 48,000 length array) or the whole dataset (10,000 records with unnest total 514,048,748 in total length of arrays)!
So my table results for one record query:
SELECT test_id, value FROM <my_table_reference>, unnest(TimeSeries)timeseries WHERE test_id= "T0003" and SignalName = "Distance";
looks like this:
test_id | value |
---|---|
T0003 | 1.0 |
T0003 | 2.0 |
T0003 | 3.0 |
T0003 | 4.0 |
(48000 rows)
This will continue until value (Distance) = 48000m (48000 rows) for 1 record: WHERE == 'T0003
.
Total process was 3.84GB
For whole table (~10,000 records):
SELECT test_id, value FROM <my_table_reference>, unnest(TimeSeries)timeseries WHERE SignalName = "Distance";
looks like this:
test_id | value |
---|---|
T0001 | 1.0 |
T0001 | 2.0 |
T0001 | 3.0 |
T0001 | 4.0 |
(514,048,748 rows)
Total process was 3.84GB
Why are the process size the same for both queries and how can I optimise this for singular row extractions?
Advertisement
Answer
This is happening because there is still need for a full table scan to find all the test IDs that are equal to the specified one.
It is not clear from your example which columns are part of the timeseries
record. In case test_id
is not one of them, I would suggest to cluster the table on the test_id
column. By clustering, the data will be automatically organized according to the contents of the test_id
column.
So, when you query with a filter on that column a full scan won’t be needed to find all values.
Read more about clustered tables here.