Kimball SCD producing erratic results

Topics: Issues 3: Executing the Component (Run-Time)
Jul 27, 2010 at 3:23 PM
Hi
Configuration: Kimball SCD V1.5 32bit; SQL 2008 R2 installed locally running SSIS; Data stored on SQL2000 Server.

We have a dimension and staging table that contain identical records. The Kimball control was used to update an empty dimension table from the staging table, this works fine and at this stage both tables appear to be identical. When the package is run a second time (without the source date having been refreshed) we would expect all records to be output to 'Unchanged'. This is not the case, some records are 'Deleted' and as a result 'New' records are created. If the package is then run again the same happens but the number of records deleted, added and unchanged changes. We have run several iterations and the numbers of 'Unchanged', 'New' and 'Deleted' varies from each iteration. From the screenshots you will see that the package does nothing with the outputs other than to count them (the initial run exported the 'New' records to the then empty Dimension table.

Can this behaviour be explained?

Thanks

Roy Bland

 

Jul 28, 2010 at 2:55 PM

Roy,

I had this problem when I did not have both the dimension and source sorted correctly.

I hope this helps.

Jul 28, 2010 at 3:23 PM
Hi You're a star, I had not sorted the source data, this is working as expected now. I never thought to look for the obvious. Thanks Roy
Coordinator
Jul 31, 2010 at 1:53 AM

Wait, wait, wait!

I could understand this being the case in rsleight's scenario - IF what he means by not having the inputs "sorted correctly" he means that he had manually told SSIS the sort order of the data flow (via the Advanced Tab of an OLE DB Source for example) - but hadn't actually sorted the flow that way using an ORDER BY.

BUT - simply having the sources (either of them) unsorted shouldn't cause erratic behaviour.  I'd appreciate it if both of you could clarify.  Rsleight - did you mean what I said in the last paragraph?  Roy, was that your scenario as well?

Jul 31, 2010 at 7:18 PM
Todd, I saw the Wait, wait, wait! and could possibly add something - it has to do with SK assignments for me. Take this scenario: Simple Dimension Table for processing status: DimKey int (surrogate key, not an identity) BK1 int - (BusinessKey) Attr1 varchar(20) - (type code description) Attr2 varchar(20) - (Last Run Date) There is no type 2 required as it is running log of each processing event, so Dimension Unique Key is BK1 + Attr2 for Dimensional Purposes) Business Source: (Fields: BusinessKey, Desc, LastRunDate) 0, 'Code 0', '12/31/2010' 1, 'Code 1', '1/1/2010' 1, 'Code 1', '1/2/2010' 2, 'Code 2', '2/2/2010', Source Table SQL: select BusinessKey, Code, LastRunDate from BusinessSourceTable Order By BusinessKey, LastRunDate Special Key SQL: select -1,0,'Unknown','1/1/1901' When I load the table here are the dimension results: -1,0,'Unknown','1/1/1901' 0, 1, 'Code 1', '1/1/2010' 2, 0, 'Code 0', '12/31/2010' 1, 2, 'Code 2', '2/2/2010', 3, 1, 'Code 1', '1/2/2010' Was expecting: -1,0,'Unknown','1/1/1901' 0, 0, 'Code 0', '12/31/2010' 1, 1, 'Code 1', '1/1/2010' 2, 1, 'Code 1', '1/2/2010' 3, 2, 'Code 2', '2/2/2010', KSCD Configuration: 1. Set surrogate key in KSCD component to variable that was initialized with max(DimKey)+1 2. Set BusinessKey and LastRunDate as BusinessKeys in KSCD Component 3. Using Business Source, Existing Dimension Source, and Special Records Source 4. Using Updated and New output rows, all other outputs not set. Used Sort Task first in all input rows then with the order by in the OLE DB and no Sort Task in the source system flow (left the sorts in the other flows). Got the same results either way. I simplified this greatly, but the real source table has about 20 rows in one case and about 4,000 in another. The SK keys get set this way when reloading from scratch. This DW is not using SCD type 2 because the entire warehouse is snapshots of daily runs of the source data so in effect the SCD Type 2 changes are carried in the snapshot fact tables. Grain is 1 row for every run date for each combination of the key dimensional attributes. It is done this way so that the analytic front end can recreate the entire values of the source system on any day but stored dimensionally to save on space, albeit not as much as if we could do this with transaction facts. Its not an issue with regard to Kimball really - I use the surrogate keys as they are and so the cube builds just fine. It is kind of interesting to troubleshoot when you have to sort several different ways to find gaps in etl code. Marvin
Aug 2, 2010 at 8:15 AM

Hi Todd

That was our scenario, we had told SSIS that the data was sorted correctly but had not sorted the data using the ORDER BY clause.

Thank you for following up this enquiry.

Cheers

Roy

From: toddmcdermid [mailto:notifications@codeplex.com]
Sent: 31 July 2010 02:53
To: r.j.bland@shu.ac.uk
Subject: Re: Kimball SCD producing erratic results [kimballscd:221572]

From: toddmcdermid

Wait, wait, wait!

I could understand this being the case in rsleight's scenario - IF what he means by not having the inputs "sorted correctly" he means that he had manually told SSIS the sort order of the data flow (via the Advanced Tab of an OLE DB Source for example) - but hadn't actually sorted the flow that way using an ORDER BY.

BUT - simply having the sources (either of them) unsorted shouldn't cause erratic behaviour. I'd appreciate it if both of you could clarify. Rsleight - did you mean what I said in the last paragraph? Roy, was that your scenario as well?

Read the full discussion online.

To add a post to this discussion, reply to this email (kimballscd@discussions.codeplex.com)

To start a new discussion for this project, email kimballscd@discussions.codeplex.com

You are receiving this email because you subscribed to this discussion on CodePlex. You can unsubscribe on CodePlex.com.

Please note: Images and attachments will be removed from emails. Any posts to this discussion will also be available online at CodePlex.com

Aug 3, 2010 at 2:11 PM
Todd You are exactly right. In my scenario I had manually told ssis that I had sorted the output but then for got the order by clause.