Skip to content
Advertisement

Input selected row and ignore duplicate row

I have a problem that I need to insert a selected row from the table to another table based on last_update data or last_transaction_product. last_update is the data updated from the table, and last_transaction_product is the latest transaction from my shop, and then my row has duplicate data but has a different last_update column here, this duplicate data must not be inserted to my destination table here, because of the data has already inserted on destination table.

For example, I have 3 tables here.

  1. Source Table AS Table_Source
  2. Destination Table AS Table_Destination
  3. Log Table AS Table_Log

My problem is, my Source Table and Destination Table didn’t have any PK that can be used for the identity of the Row table.

  1. So here are my Table_Source
product_name shop_name last_transaction_product last_update
Clock Benefical 2020-04-15 2020-06-01
Monitor Hinezo Computer 2020-01-17 2020-06-01
Notebook LA Shop 2020-03-25 2020-06-01
  1. I have a query that inserts it to Table_Destination based on > @last_transaction_product or > @last_update, here are my Table_Destination after inserted from Table_Source, and I’m inserted it at 2020-06-02
product_name shop_name last_transaction_product last_update
Clock Benefical 2020-04-15 2020-06-01
Monitor Hinezo Computer 2020-01-17 2020-06-01
Notebook LA Shop 2020-03-25 2020-06-01
  1. After it inserted into my Table_Destination, it will save the record from the latest activity table to Table_Log, so here my Table_Log after inserted.
table_name last_transaction_product last_update last_run
Table_Source 2020-04-15 2020-06-01 2020-06-02
  1. And the next time, there was new data inserted at Table_Source
product_name shop_name last_transaction_product last_update
Clock Benefical 2020-04-15 2020-06-01
Monitor Hinezo Computer 2020-01-17 2020-06-01
Notebook LA Shop 2020-03-25 2020-06-01
Poster Maniac Shop 2020-04-27 2020-06-03
Clock Benefical 2020-04-15 2020-06-03
  1. As you can see from the newest table of Table_Source there was a duplicated data ('Clock', 'Benefical', '2020-04-15', '2020-06-03') that has a different last_update. In my case, I want to excluded this duplicated data to insert into my Table_Destination, so it will only insert the ('Poster', 'Maniac Shop', '2020-04-27', '2020-06-03'), and of course I used latest last_transaction_product and last_update from Table_Log as my parameter to inserting into Table_Destination . So my Table_Destination and my Table_Log will be like this.
product_name shop_name last_transaction_product last_update
Clock Benefical 2020-04-15 2020-06-01
Monitor Hinezo Computer 2020-01-17 2020-06-01
Notebook LA Shop 2020-03-25 2020-06-01
Poster Maniac Shop 2020-04-27 2020-06-03
table_name last_transaction_product last_update last_run
Table_Source 2020-04-15 2020-06-01 2020-06-02
Table_Source 2020-04-27 2020-06-03 2020-06-04

So here are my T-SQL now

BEGIN
    -- INSERT DESTINATION DATA --
    SET Table_Destination ON;
    INSERT INTO 
        Table_Destination
    SELECT 
        *
    FROM 
        Table_Source
    WHERE 
        last_transaction_product > (SELECT MAX(last_transaction_product) FROM Table_Log WHERE table_name = 'Table_Source')
        OR
        last_update > (SELECT MAX(last_update) FROM Table_Log WHERE table_name = 'Table_Source')

    SET Table_Destination OFF;

    -- INSERT LOG UPDATE --
    INSERT INTO Table_Log(table_name, last_transaction_product, last_update, last_run)
    VALUES('Table_Source', (SELECT MAX(last_transaction_product) FROM Table_Source WHERE table_name = 'Table_Source'), (SELECT MAX(last_update) FROM Table_Source WHERE table_name = 'Table_Source'),  @Date)
END;

Note:

  • This is the flow for inserting of my Data Data Inserting Flow
  • I don’t want this logic using the NOT IN or NOT EXISTS query, because in my case I have 10 million rows from the Table_Source and Table_Destination. Because it will make my database work too hard.

Advertisement

Answer

You should use NOT EXISTS, it is your best option because of the semi-join.

If you have problems with millions rows you should check your indexes. Table_Destination hahas no duplicates for product_name so indexing it (in both tables) is a good start and a join on it should not be terrific.

Also, there is something strange in your WHERE clause..

I will try this

EDIT

Added a condition in NOT EXISTS to get a product inserted in destination once per shop instead of only once.

Added also protection for 1st run when Table_Log is empty.

SET Table_Destination ON;

INSERT INTO 
    Table_Destination
SELECT 
    *
FROM 
    Table_Source ts,
    (
        SELECT 
            MAX(last_transaction_product) max_last_transaction_product, 
            MAX(last_update) max_last_update 
        FROM Table_Log 
        WHERE table_name = 'Table_Source'
    ) agg
WHERE 
    (
        last_transaction_product > ISNULL(max_last_transaction_product, 0)
        OR 
        last_update > ISNULL(max_last_update, 0)
    )
    AND
    NOT EXISTS (
        SELECT NULL 
        FROM Table_Destination td
        WHERE td.product_name = ts.product_name
        /* REMOVE BELOW LINE IF YOU NEED PRODUCTS IN DEST ONLY ONCE
        KEEP THE LINE IF YOU WANT PRODUCT REPEATED ONCE FOR EACH SHOP */
        AND td.shop_name = ts.shop_name 
    )

SET Table_Destination OFF;
User contributions licensed under: CC BY-SA
5 People found this is helpful
Advertisement