Skip to content
Advertisement

How to convert a loop in SQL to Set-based logic

I have spent a good portion of today and yesterday attempting to decide whether to utilize a loop or cursor in SQL or to figure out how to use set based logic to solve the problem. I am not new to set logic, but this problem seems to be particularly complex.

The Problem

The idea is that if I have a list of all transactions (10’s, 100’s of millions) and a date they occurred, I can start combining some of that data into a daily totals table so that it is more rapidly view able by reporting and analytic systems. The pseudocode for this is as such:

foreach( row in transactions_table )
    if( row in totals_table already exists )
        update totals_table, add my totals to the totals row
    else
        insert into totals_table with my row as the base values
    delete ( or archive ) row

As you can tell, the block of the loop is relatively trivial to implement, and as is the cursor/looping iteration. However, the execution time is quite slow and unwieldy and my question is: is there a non-iterative way to perform such a task, or is this one of the rare exceptions where I just have to “suck it up” and use a cursor?

There have been a few discussions on the topic, some of which seem to be similar, but not usable due to the if/else statement and the operations on another table, for instance:

How to merge rows of SQL data on column-based logic? This question doesn’t seem to be applicable because it simply returns a view of all sums, and doesn’t actually make logical decisions about additions or updates to another table

SQL Looping seems to have a few ideas about selection with a couple of cases statements which seems possible, but there are two operations that I need done dependent upon the status of another table, so this solution does not seem to fit.

SQL Call Stored Procedure for each Row without using a cursor This solution seems to be the closest to what I need to do, in that it can handle arbitrary numbers of operations on each row, but there doesn’t seem to be a consensus among that group.

Any advice how to tackle this frustrating problem?

Notes

I am using SQL Server 2008

The schema setup is as follows:

Totals: (id int pk, totals_date date, store_id int fk, machine_id int fk, total_in, total_out)

Transactions: (transaction_id int pk, transaction_date datetime, store_id int fk, machine_id int fk, transaction_type (IN or OUT), transaction_amount decimal)

The totals should be computed by store, by machine, and by date, and should total all of the IN transactions into total_in and the OUT transactions into total_out. The goal is to get a pseudo data cube going.

Advertisement

Answer

You would do this in two set-based statements:

BEGIN TRANSACTION;

DECLARE @keys TABLE(some_key INT);

UPDATE tot
  SET totals += tx.amount
OUTPUT inserted.some_key -- key values updated
INTO @keys
FROM dbo.totals_table AS tot WITH (UPDLOCK, HOLDLOCK)
INNER JOIN 
(
  SELECT t.some_key, amount = SUM(amount)
  FROM dbo.transactions_table AS t WITH (HOLDLOCK)
  INNER JOIN dbo.totals_table AS tot
  ON t.some_key = tot.some_key
  GROUP BY t.some_key
) AS tx
ON tot.some_key = tx.some_key;

INSERT dbo.totals_table(some_key, amount)
  OUTPUT inserted.some_key INTO @keys
  SELECT some_key, SUM(amount)
  FROM dbo.transactions_table AS tx
  WHERE NOT EXISTS 
  (
    SELECT 1 FROM dbo.totals_table
    WHERE some_key = tx.some_key
  )
  GROUP BY some_key;

DELETE dbo.transactions_table
  WHERE some_key IN (SELECT some_key FROM @keys);

COMMIT TRANSACTION;

(Error handling, applicable isolation level, rollback conditions etc. omitted for brevity.)

You do the update first so you don’t insert new rows and then update them, performing work twice and possibly double counting. You could use output in both cases to a temp table, perhaps, to then archive/delete rows from the tx table.

I’d caution you to not get too excited about MERGE until they’ve resolved some of these bugs and you have read enough about it to be sure you’re not lulled into any false confidence about how much “better” it is for concurrency and atomicity without additional hints. The race conditions you can work around; the bugs you can’t.

Another alternative, from Nikola’s comment

CREATE VIEW dbo.TotalsView
WITH SCHEMABINDING
AS
   SELECT some_key_column(s), SUM(amount), COUNT_BIG(*)
    FROM dbo.Transaction_Table
    GROUP BY some_key_column(s);
GO
CREATE UNIQUE CLUSTERED INDEX some_key ON dbo.TotalsView(some_key_column(s));
GO

Now if you want to write queries that grab the totals, you can reference the view directly or – depending on query and edition – the view may automatically be matched even if you reference the base table.

Note: if you are not on Enterprise Edition, you may have to use the NOEXPAND hint to take advantage of the pre-aggregated values materialized by the view.

User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement