Skip to content
Advertisement

How to fix slow running SSIS package

I am trying to insert large amount of data from one table from PROD DB to table in Archive DB. Tables are with the identical schema and archive table is with drooped indexes and “Identity insert on”. I need to insert only records that not exists in the Arh DB/Table. I am using “Execute SQL Task” in SSIS sequence package and it works very very slow (using 20000 insert batch size).I have 20000 records inserted for 10 minutes.Just to mention I need to insert 48000000 records. SQL server is 2016 standard edt. Is there some solution for this.

The SQL query is :

SELECT TOP (@InsertBatchSize) s.ID,.....and other columns 
FROM PRODDB.dbo.source_table AS s WITH (NOLOCK) 
    INNER JOIN ArchiveDB.dbo.MissingIDsTable AS t WITH (NOLOCK) 
    ON s.ID = t.ID 
WHERE s.ID not in (SELECT ID 
                   from ArchiveDB..destination_table 
                   WHERE IsUpdated is null ) 

Advertisement

Answer

Execute SQL isn’t suitable for data transfer. It can’t batch data, nor transform them. That’s the job of Dataflow tasks. A Dataflow tasks allows reading source data using a firehose cursor and writing them to a data target using batched bulk operations using minimal logging. Its speed depends on the source query though. A slow query will result in slow execution.

The question is missing a lot of information, like the table schemas both in the source and target database. From "Identity insert on" I suspect the tables have an ID column that is an IDENTITY and primary key in the source. If you only care about new records, you can write a source query that only reads data since the last execution, eg :

SELECT s.ID,.....and other columns 
FROM PRODDB.dbo.source_table AS s
where ID>@maxId

Where @maxId is a query parameter supplied to the source query. There’s no need for batching, SSIS can do that based on the data source, target settings for batch size etc.

This query should only be used for loading new data. To make the initial copy use a different dataflow whose source query doesn’t filter anything. Using something like Where ID>-1 would return all data but only after scanning the entire index. Why do that when we’re already going to copy all the data?

The ID should be a primary key in the target table as well. This will speed up the select MAX(ID) from target operation that loads the parameter’s value. It will also detect and prevent the inevitable duplication errors. No matter how careful we are, others always make mistakes that can result in duplicate data.

You can improve import performance by disabling the index before insertion and reenabling it after the import operation.

That’s just one way you can detect changes and copy data. Another technique is to enable Change Tracking in the source table and retrieve rows modified since the last job run.

You could also copy modified data into a staging table and join that with the target to INSERT/UPDATE changes. The ID or change tracking can be used to find the modified data. This has the benefit of freeing up the source quickly.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement