I’m trying to optimise my query for when an internal customer only want to return one result *(and it’s associated nested dataset). My aim is to reduce the query process size. However, it appears to be the exact same value regardless of whether I’m querying for 1 record (with unnested 48,000 length array) or the whole dataset (10,000 records with
Tag: bigdata
SQL server Change Column datatype on 1K million records
I am facing the column length issue in my table and I want to change column type to big int from int and table rows around 1K million records but when ever I tried to change data type it is taking to much time and it is eating my machine all space, what is best way and fast way to
Google BigQuery – Subtract SUMs of a column basing on values in another column
Hi I need 1 query to get top 10 country which has largest [total(import) – total(export)] for goods_type medicines between 2019 – 2020. The data sample is as below: The returned data should include country, goods_type, and the value of [total(imports) – total(export)]. I have come up with the query below, but I don’t know if it’s right or wrong,
Is there a way to filter rows in BigQuery by the contents of an array?
I have data in a BigQuery table that looks like this: My question is, how can I find all rows where “key” = “a”, “value” = 1, but also “key” = “b” and “value” = 3? I’ve tried various forms of using UNNEST but I haven’t been able to get it right. The CROSS JOIN leaves me with one row
Process several billion records from Redshift using custom logic
I want to apply custom logic over dataset placed in Redshift. Example of input data: userid, event, fileid, timestamp, …. 100000, start, 120, 2018-09-17 19:11:40 100000, done, 120, 2018-…
Storing a huge amount of points(x,y,z) in a relational database
I need to store a very simple data structure on disk – the Point. It’s fields are just: Moment – 64-bit integer, representing a time with high precision. EventType – 32-bit integer, reference to another object. Value – 64-bit floating point number. Requirements: The pair of (Moment + EventType) is unique identifier of the Point, so I suspect it to
How to create a large pandas dataframe from an sql query without running out of memory?
I have trouble querying a table of > 5 million records from MS SQL Server database. I want to select all of the records, but my code seems to fail when selecting to much data into memory. This works: …but this does not work: It returns this error: I have read here that a similar problem exists when creating a
Best way to delete millions of rows by ID
I need to delete about 2 million rows from my PG database. I have a list of IDs that I need to delete. However, any way I try to do this is taking days. I tried putting them in a table and doing it in batches of 100. 4 days later, this is still running with only 297268 rows deleted.