Dimension Merge SCD Optimization

There are a few techniques that you can use to improve performance of the component. Some of these techniques will also work for other ways of processing SCDs.

Basic Optimization

The component itself will suggest that you do certain things to improve performance - sort the inputs according to the Business Key, and provide it with a guaranteed seed value for the Surrogate Key (should you configure the component to manage it).

Sorting Inputs

The component should work best when the inputs are sorted according to the business key. The sort order does not have to be exactly the same as the business key columns - but of course, making it that way is the simplest option. As always, it is a best practice to avoid using the SSIS Sort component - try to read in the flows in optimal sorted order.
In those cases where you can retrieve the data in some form of order - but that order does not exactly match the business keys, you may still be able to use it. If columns "A" and "C" are (in that order) the Business Key of the flow, having the flow sorted by columns "A, B, and C" is acceptable. However, the sort specification of one input must also be a subset of the sort specification of the other input. This means that if one input is sorted "A, B, C", then it's OK for the other input to be sorted "A, C" or "A, C, D", but it is not OK for the other input to be sorted "C, A".

Providing a Guaranteed Surrogate Key Seed

If the component is managing Surrogate Keys for you, then it will likely need to assign new surrogate keys - to all rows coming out of the New and New SCD2 outputs. In order to do so, it needs to know what values to use. To determine what seed value to use, it starts with the value you supply in the editor, then watches all the rows that enter the component from the Existing Dimension input, incrementing the seed you supplied if necessary. Since the component can't know if the seed it currently has won't result in a key collision unless it examines all of the rows on the Existing Dimension input - it holds up all processing until it's seen every row. This can result in very high memory usage, as all of the Existing Dimension input rows are cached.
In order to optimize that scenario, it's suggested that you determine the "next" Surrogate Key value through another Task prior to the Data Flow Task that contains the Dimension Merge SCD. An Execute SQL Task with a SELECT MAX statement usually does the trick, but a Data Flow Task to extract the highest key value used may be necessary (or lowest key - depending on your keying mechanism). Store that "next" key value in a variable, supply it to the Dimension Merge SCD component, and tell the component it's guaranteed. This frees the component from ensuring that surrogate key collisions won't happen, and allows it to process rows immediately.

Advanced Optimization

The following techniques require an intimate understanding of the SCD process, your business needs for the dimension table, the technical capabilities of your source system, and SSIS' Data Flow architecture. These optimization techniques are not guaranteed to work for you. If this stuff was easy, the tool(s) would already do it for you - it's not, so they can't.

Only Use Current Dimension Rows

Typical usage involves sending ALL rows from your dimension table through the Dimension Merge SCD to make persisting SCD type 1 changes easy by using the surrogate key. However, this can add unnecessary overhead. If only "current" records are retrieved from your warehouse, you can perform the type 1 updates by using the business key.

Use CDC

If your source system is on SQL Server 2008 or later, or you have a CDC (change data capture) technology or technique for your source system, you can use that to your advantage. (Maintaining a "last updated" column on OLTP tables is one such technique.)
If you supply the component with ONLY the changed rows from the source system, it will have (hopefully) much fewer rows to process. However, if configured normally, this will result in the otherwise unchanged dimension rows being sent out the "Expired" or "Deleted" outputs. There is no possible mitigation for this result without performing lookups on the source system, which defeats the purpose of CDC. Your usage of the component - and utility of the resulting dimension table - will be diminished because you can no longer expire records when they have truly been removed from the source system. (Unless you have a method of tracking this change separately!)

Hashing Business Keys

If your business key is large and/or has multiple columns, then the Dimension Merge SCD component has a lot of memory to move around when comparing rows to each other. Hashing business keys (with the Multiple Hash component, for example) may improve this comparison performance significantly.

Buffer Sizes, etc.

The typical optimzation advice to increase or reduce the buffer sizes may or may not have any effect on the Dimension Merge SCD component. Internally, the component does not use SSIS buffers for caching information, and does not adhere to any of those settings. However, the component's interaction with SSIS buffers are governed by these settings, and adjusting them up or down may improve memory usage or performance - slightly. I would not expect changes to have any significant effect.

Coordinating Input Flow Arrival

The largest opportunity for improving data flow performance is in coordinating when rows arrive to the component. The effect of any adjustments here depend on many factors - here are some guidelines:
  • If either your dimension table OR your source system flow has significantly fewer rows than the other, attempt to arrange for the smaller flow to completely arrive at the component as soon as possible. Once one of the inputs has completely arrived at the component (the immediately upstream component turns green), then the Dimension Merge SCD can optimize cache handling even more than it does normally. How do you do this? Make sure your smaller flow is as direct and high-performance as possible. For example, place any transformation of that flow into another Data Flow Task, and write to a RAW file. Read directly from the RAW file into the Dimension Merge.
  • Use the FlowSync component from SQLBI.EU to feed the component rows proportionally. The Dimension Merge SCD has to cache rows that it can't find matches for. If it receives rows (from sorted inputs) at about the same rate, then it's likely this caching can be reduced.

Last edited Apr 13, 2011 at 7:35 PM by toddmcdermid, version 1

Comments

No comments yet.