I have spent a good portion of today and yesterday attempting to decide whether to utilize a loop or cursor in SQL or to figure out how to use set based logic to solve the problem. I am not new to set logic, but this problem seems to be particularly complex.
The Problem
The idea is that if I have a list of all transactions (10’s, 100’s of millions) and a date they occurred, I can start combining some of that data into a daily totals table so that it is more rapidly view able by reporting and analytic systems. The pseudocode for this is as such:
foreach( row in transactions_table ) if( row in totals_table already exists ) update totals_table, add my totals to the totals row else insert into totals_table with my row as the base values delete ( or archive ) row
As you can tell, the block of the loop is relatively trivial to implement, and as is the cursor/looping iteration. However, the execution time is quite slow and unwieldy and my question is: is there a non-iterative way to perform such a task, or is this one of the rare exceptions where I just have to “suck it up” and use a cursor?
There have been a few discussions on the topic, some of which seem to be similar, but not usable due to the if/else statement and the operations on another table, for instance:
How to merge rows of SQL data on column-based logic? This question doesn’t seem to be applicable because it simply returns a view of all sums, and doesn’t actually make logical decisions about additions or updates to another table
SQL Looping seems to have a few ideas about selection with a couple of cases statements which seems possible, but there are two operations that I need done dependent upon the status of another table, so this solution does not seem to fit.
SQL Call Stored Procedure for each Row without using a cursor This solution seems to be the closest to what I need to do, in that it can handle arbitrary numbers of operations on each row, but there doesn’t seem to be a consensus among that group.
Any advice how to tackle this frustrating problem?
Notes
I am using SQL Server 2008
The schema setup is as follows:
Totals: (id int pk, totals_date date, store_id int fk, machine_id int fk, total_in, total_out)
Transactions: (transaction_id int pk, transaction_date datetime, store_id int fk, machine_id int fk, transaction_type (IN or OUT), transaction_amount decimal)
The totals should be computed by store, by machine, and by date, and should total all of the IN transactions into total_in and the OUT transactions into total_out. The goal is to get a pseudo data cube going.
Advertisement
Answer
You would do this in two set-based statements:
BEGIN TRANSACTION; DECLARE @keys TABLE(some_key INT); UPDATE tot SET totals += tx.amount OUTPUT inserted.some_key -- key values updated INTO @keys FROM dbo.totals_table AS tot WITH (UPDLOCK, HOLDLOCK) INNER JOIN ( SELECT t.some_key, amount = SUM(amount) FROM dbo.transactions_table AS t WITH (HOLDLOCK) INNER JOIN dbo.totals_table AS tot ON t.some_key = tot.some_key GROUP BY t.some_key ) AS tx ON tot.some_key = tx.some_key; INSERT dbo.totals_table(some_key, amount) OUTPUT inserted.some_key INTO @keys SELECT some_key, SUM(amount) FROM dbo.transactions_table AS tx WHERE NOT EXISTS ( SELECT 1 FROM dbo.totals_table WHERE some_key = tx.some_key ) GROUP BY some_key; DELETE dbo.transactions_table WHERE some_key IN (SELECT some_key FROM @keys); COMMIT TRANSACTION;
(Error handling, applicable isolation level, rollback conditions etc. omitted for brevity.)
You do the update first so you don’t insert new rows and then update them, performing work twice and possibly double counting. You could use output in both cases to a temp table, perhaps, to then archive/delete rows from the tx table.
I’d caution you to not get too excited about MERGE
until they’ve resolved some of these bugs and you have read enough about it to be sure you’re not lulled into any false confidence about how much “better” it is for concurrency and atomicity without additional hints. The race conditions you can work around; the bugs you can’t.
Another alternative, from Nikola’s comment
CREATE VIEW dbo.TotalsView WITH SCHEMABINDING AS SELECT some_key_column(s), SUM(amount), COUNT_BIG(*) FROM dbo.Transaction_Table GROUP BY some_key_column(s); GO CREATE UNIQUE CLUSTERED INDEX some_key ON dbo.TotalsView(some_key_column(s)); GO
Now if you want to write queries that grab the totals, you can reference the view directly or – depending on query and edition – the view may automatically be matched even if you reference the base table.
Note: if you are not on Enterprise Edition, you may have to use the NOEXPAND
hint to take advantage of the pre-aggregated values materialized by the view.