Deletions - having to load the complete data set?

Feb 11, 2010 at 6:57 PM

What is the Best practice when bringing data into the Kimball Component?

It appears you have to bring the complete data set of the Source and Dimension otherwise any records that don't appear from the Source will be deleted (marked as historic) records.

Is this the best method? my concern was relating to Performance as millions of rows coming into both the Existing Dimension and the Source, could this impact performance?


Feb 12, 2010 at 5:05 AM

That is the best method - that's what this component was designed for.  If you have an immense dimension and a (relatively) small "source" - and pulling the dimension records is prohibitively expensive, you may want to try the SCD Wizard.  It does lookups against the dimension table, therefore it doesn't load the whole thing.  However, that architecture may not work better than this.

I think you have it backwards though.  If you don't give the component all of the existing dimension records, you risk having it categorize source records as "new", not "deleted".

I believe that the best recipe for performance is to have both inputs sorted, and arrange (somehow) to deliver those inputs to the component in as "lock-step" a manner as possible, such that rows of matching business keys tend to arrive at the component at the same time.  The only mechanism I know of to actually control two separate flows within the same Data Flow is Alberto Ferrari's FlowSync component (see  The relative delivery rate of each input would depend heavily on your own situation.  In general, it is likely that keeping both in "percentage step" would work best - such that when the existing dimension input had received 10% of it's rows, the source system input should also have 10% of its rows.  But again, this depends heavily on your situation.

Feb 12, 2010 at 8:53 PM

Great!!! thanks for the advice. I will look at the methods you've suggested and also FlowSync.

yes, I meant if you have business keys in the Existing Dimension, if they don't come through in the Source then they will be deleted (marked as historic) records.


Feb 13, 2010 at 12:06 AM

This component will (correctly) mark rows that are on the Existing Dimension, but don't come in the Source System as "deleted" (expired).  The SCD Wizard will not do that.