Kimball Method SCD Optimization

Basic Optimization

There are a couple things that the component itself will suggest you do to improve performance - sort the inputs according to the Business Key, and provide it with a guaranteed seed value for the Surrogate Key (should you configure the component to manage it).

Sorting Inputs

The component should work best when the inputs are sorted according to the business key. The sort order does not have to be exactly the same as the business key columns - but of course, making it that way is the simplest option. As always, it is a best practice to avoid using the SSIS Sort component - try to read in the flows in optimal sorted order.
In those cases where you can retrieve the data in some form of order - but that order does not exactly match the business keys, you may still be able to use it. If columns "A" and "C" are (in that order) the Business Key of the flow, having the flow sorted by columns "A, B, and C" is acceptable. However, the sort specification of one input must also be a subset of the sort specification of the other input. This means that if one input is sorted "A, B, C", then it's OK for the other input to be sorted "A, C" or "A, C, D", but it is not OK for the other input to be sorted "C, A".

Providing a Guaranteed Surrogate Key Seed

If the component is managing Surrogate Keys for you, then it will likely need to assign new surrogate keys - to all rows coming out of the New and New SCD2 outputs. In order to do so, it needs to know what values to use. To determine what seed value to use, it starts with the value you supply in the editor, then watches all the rows that enter the component from the Existing Dimension input, incrementing the seed you supplied if necessary. Since the component can't know if the seed it currently has won't result in a key collision unless it examines all of the rows on the Existing Dimension input - it holds up all processing until it's seen every row. This can result in very high memory usage, as all of the Existing Dimension input rows are cached.
In order to optimize that scenario, it's suggested that you determine the "next" Surrogate Key value through another Task prior to the Data Flow Task that contains the Kimball SCD. An Execute SQL Task with a SELECT MAX statement usually does the trick, but a Data Flow Task to extract the highest key value used may be necessary (or lowest key - depending on your keying mechanism). Store that "next" key value in a variable, supply it to the Kimball SCD component, and tell the component it's guaranteed. This frees the component from ensuring that surrogate key collisions won't happen, and allows it to process rows immediately.

Advanced Optimization

Buffer Sizes, etc.

The typical optmization advice to increase or reduce the buffer sizes may or may not have any effect on the Kimball SCD component. Internally, the component does not use SSIS buffers for caching information, and does not adhere to any of those settings. However, the component's interaction with SSIS buffers are governed by these settings, and adjusting them up or down may improve memory usage or performance - slightly. I would not expect changes to have any significant effect.

Coordinating Input Flow Arrival

The largest opportunity for improving data flow performance is in coordinating when rows arrive to the component. The effect of any adjustments here depend on many factors - here are some guidelines:

Last edited Jan 26, 2010 at 3:12 AM by toddmcdermid, version 1

Comments

Knallfrosch Sep 8, 2010 at 4:44 PM 
Hello Todd,

I'm interested in "Coordinating Input Flow Arrival", but i miss the guidelines.
Where are they gone? :)

Regards
Michael