Using Kimball SCD for ummm....creating fact trans for small fact tables

Apr 11, 2010 at 4:13 PM

Todd,

The Kimball SCD is absolutely a great timesaver and a high performing SCD object.  Thank you so much for making this available.  I am in the midst of upgrading from 1.4 to 1.5 and already look forward to the 1.6 release.

This is not a question, but rather just a brief comment on how I've used (or abused) the Kimball SCD for something it was not really intended.

At the risk of abusing the true intent of the Kimball SCD, I've morphed it to work with a smaller (> 2 million) fact table.   That is, I create a "staged" fact table and use it as the existing dimension, extract the source facts, apply the Kimball SCD, and use the outputs to create additive transactions to the real fact table. 

I used this method because I did not have another quick method for creating such transactions and this seems to fit.  My main issue with doing something else was that the source system has the capability to hard delete facts.  So, I was forced into comparing all the dimension keys for the fact in the same manner that the Kimball SCD handles business key comparisons.  

I have not moved this scheme to production yet, but the test harness seems to be working just fine in generating not only new transactions but reversing, additive transactions for any deleted measure that are discovered.

Tim

 

Coordinator
Apr 11, 2010 at 7:53 PM

Very interesting on several points...

Have you tried the TableDifference component?  I know that it's now under the auspices of CozyRoc, so it's no longer free - but it's still relatively cheap.  That component does a more "multipurpose" comparison of two "tables" (flows).

Your "off label" use is quite interesting... ;)

Apr 11, 2010 at 8:43 PM

I knew of TableDifference but as a Kimball Dimensional Modeling Stepford child the Kimball SCD component has everything I need for managing SCD's.   I really had no need to test drive TableDiff.  But, perhaps I should look at it for adding value as a generic table comparison component.

Apr 13, 2010 at 4:11 PM

OK, I'm ready to go into QA with this approach.  My single biggest finding is one that is relevant to anyone using this component on dimensions > 300k rows.  If you have the opportunity, using numeric keys greatly improves performance.  In my case the issue was exasperated because I'm going "off label" and using Kimball SCD to generate fact transactions.  By it's very nature the fact "business key" is typically a composite of more than 1 key, in my case, 7 alphameric keys.  Frankly, the component permanently hung (as in no CPU or any other thread activity) part way through the dimension comparison apparently choking on the Herculean task of hashing/comparing through so many alphameric combinations.  I was fortunate to be in a position where I could convert the 7 alphameric business keys into 7 numeric keys.  After the change, the component worked flawlessly with decent performance (throughput of 515,829 source and existing dimension rows in 5 minutes).

 

Coordinator
Apr 14, 2010 at 5:09 PM

The issue isn't necessarily either of what you've identified - but the end result is.  The "smaller" the business key you can provide, the better.  Since numeric fields are typically smaller (an integer is 4 bytes vs a 10 character unicode string which is 20 bytes) they can be stored and compared more quickly.  If your dimension has a large business key, you may want to look at the Multiple Hash component to assist in distilling that large key into a smaller surrogate BK.

The hang wasn't due to comparing keys - it's likely due to improper thread management on my part.  I hope to have that ironed out permanently with a simpler mechanism in v1.6.

Apr 14, 2010 at 5:21 PM

I confess that I at least stared at the Multiple Hash component while agonizing over what to modify in my routine.  The conclusion I came to for now is that I am not confident that hashing algorithms by themselves can avoid collisions and that I am not sure what I need to do in order to help it avoid collisions.  And, this particular component does not seem to have a high level of adoption yet.  But, I will definately keep this in mind as a future reference.  In a future release, is it possible to include a key collapsing routine into expose the output of the hashing algorithm the the Kimball SCD uses so that we could store  it as a "Kimball SCD" business key in the dimension table?

Coordinator
Apr 29, 2010 at 6:10 PM

I've heard a couple good reviews of the Multiple Hash component from people I know.  I'd recommend you take a second look at it.

And yes, having the KSCD use a hash in place of a business key is a requested feature: http://kimballscd.codeplex.com/WorkItem/View.aspx?WorkItemId=3543  But having the component "expose" its own internal key representation hasn't been one of those - although it's probably a good extension to that issue.

Apr 29, 2010 at 6:19 PM

thanks for the feedback.   What about the possibility of hashing collisions...they can occur, right?

Also, tomorrow we release our fact table for which I am using KSCDto process my facts as a "dimension".  By using KSCD, I can identify hard deletes and easily created ledger type, offsetting facts.  Also, if a measure "changes" for a prior time period, those attributes are defined as SCD 1.  From the SCD1 output I have the old staged, fact key as well as the new meaure value.  With this data, I can create new,  ledgered transactions to reflect that those measures changed by n amount today.  Good stuff for fact tables with a few million rows.

Apr 29, 2010 at 9:51 PM

Did some exploring on the whole hash collision thing...so need to answer.  Although I'm sure there are many references, for me a good concise link was found at http://sqlblog.com/blogs/michael_coles/archive/2010/04/17/find-a-hash-collision-win-100.aspx