Kimball SCD Historical Load

Apr 21, 2009 at 6:19 PM

PROBLEM

My existing dimension and source system are SQL Server 2008 tables.

In my Existing Dimension I added the following metadata attributes with NULL value to be used with Kimball Slowly Changing Dimension Component

       [DateEffective]
       [DateExpires]
       [DimPostKey]
       [IsCurrent]

After applying  different SCD  types,Roughly My existing Dimension table looks like the following

   [DimPostKey] [int] NULL,                    SURROGATE KEY

   [PostSerialNumber] [int] NOT NULL,          BUSINESS KEY

 [PostChangeEffectiveDate] [datetime] NULL,                SCD1

 [PostChangeEffectiveComp] [int] NULL,                     SCD1

 [PostProcessingDate] [datetime] NULL,                     SCD1

 [PostProcessingComp] [int] NULL,                          SCD1

 [PostOrganizationalUnitCode] [varchar](4) NULL,           SCD2

 [PostCategoryIDCode] [varchar](4) NULL,                   SCD2

 [PostGradeIDCode] [varchar](4) NULL,                      SCD2

 [PostTypeCode] [smallint] NULL,                           SCD2

 [PostEffectiveDate] [datetime] NULL,                      SCD2

 [PostExpirationDate] [datetime] NULL,                     SCD2

 [PostRemark] [varchar](255) NULL,                         SCD1

 [PostNatureCode] [smallint] NULL,                         SCD2

 [PostDoubleIncumbencyIndicator] [tinyint] NULL,           SCD1

 [PostPersonIndexNumber] [int] NULL,                       SCD1

 [PostFunctionalIDCode] [int] NULL,                        SCD1

 [PostFunctionalTitlePrefixCode] [smallint] NULL,          SCD1

 [PostModificationExpirationDate] [datetime] NULL,         SCD1

 [PostReportingLineSerialNumber] [int] NULL,               SCD1

 [PostCaseReportingLineSerialNumber] [int] NULL,           SCD1

 [PostAuditLogHeaderTransactionSerialNumber] [int] NULL,   SCD1

 [PostSiteID] [int] NULL,                                  SCD1

 [PostSiteDescription] [varchar](50) NULL,                 SCD1

 [PostInstallationIDCode] [varchar](4) NULL,               SCD1 

 [numberank] [bigint] NULL,                                SCD1

 [FlagStatus] [varchar](5) NULL,                           SCD1

 [DateEffective] [datetime] NULL,                 SCD2 Effective Date

 [DateExpires] [datetime] NULL,                   SCD2 Expiry Date

 [IsCurrent] [varchar](5) NULL                             SCD1 

To explain a little bit my data logic.My existing dimension based on Uniqueness of [PostSerialNumber](the unique identifier of a Post) and some attributes (those with SCD2 Type) on which I would like to capture historical data.

My business key is [PostserialNumber].

Therefore my dimension could contain duplicate Business key [PostserialNumber] but with different SCD2 Type attributes.

When I apply the Kimball SCD component, I would expected to have the following output:

-Number of New Rows
-Number of Updated SCD1 Rows
-Number of Expired SCD2 Rows
-Number of Unchanged Rows
-Surrogate Key Generated

But instead my results are:

-The only output is the invalid data.
-My surrogate Key[DimPostKey] is not generated automatically neither my [DateExpires] and [DateEffective].

I'm aware of the fact that because of repetitive business key [PostserialNumber](even there are non duplicated rows), KimballSCD Component sends Invalid rows

My questions:

1-There is a solution on how to make kimballSCD component  recognize row uniqueness or I could load in once my historical data? Business key may be duplicated but rows are different because of other changing attributes for the same business key [PostserialNumber]

2-How [DateExpires] and [DateEffective] need to be handled? I know [DateExpires] can be NULL but how about [DateEffective]? Do I need to set up an EffectiveDate first or the SCD Kimball component can populate it?

I have already applied  configuration options such as "allow nulls" in the "Existing Dimension Column Usage" tab as my field tables can allow NULL values.

Of Course my source system has the same data structure.

Below a data sample of my existing dimension  for reference

I may not comprehend a process or miss something. Help/Guidance will be much appreciated.
Thx,
Cyr.


PostSerialNumber PostChangeEffectiveDate PostChangeEffectiveComp PostProcessingDate PostProcessingComp PostOrganizationalUnitCode PostCategoryIDCode PostGrade IDCode PostTypeCode PostEffectiveDate PostExpirationDate PostRemark PostNatureCode PostDoubleIncumbencyIndicator PostPersonIndexNumber PostFunctionalIDCode PostFunctionalTitlePrefixCode PostModificationExpirationDate PostReportingLineSerialNumber PostCaseReportingLineSerialNumber PostAuditLogHeaderTransactionSerialNumber PostSiteID PostSiteDescription PostInstallationIDCode numberank FlagStatus DateEffective DateExpires DimPostKey IsCurrent
1 29/03/2009 0:00 118627200 01/04/2009 14:18 118316511 7074         P    3 4 29/03/2009 0:00 31/03/2009 0:00 Post created for R. Abaza 10 0 NULL 326 NULL 31/03/2009 0:00 NULL NULL 126659641 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
1 01/07/2005 0:00 236736000 23/06/2005 15:11 237372500 6172          D    2 2 01/01/1976 0:00 30/06/2006 0:00 Extended STA No.9899-CUA-0178-ST01 dated 30/06/98 10 0 NULL 1 NULL 30/06/2006 0:00 NULL NULL 71482554 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
1 01/07/2002 0:00 331430400 14/06/2002 14:46 332846014 1702          D    2 2 01/01/1976 0:00 30/06/2006 0:00 Extended STA No.9899-CUA-0178-ST01 dated 30/06/98 10 0 NULL 1 NULL 30/06/2003 0:00 NULL NULL 71482554 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
4 01/01/1976 0:00 1167609600 16/09/1993 19:03 608705805 1702        ASG  1 2 01/01/1976 0:00 30/06/2006 0:00 NULL 2 0 NULL 1 NULL 30/06/1997 0:00 NULL NULL 71482554 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
4 01/07/2002 0:00 331430400 14/06/2002 14:46 332846014                     1702            FS  6 2 01/01/1976 0:00 30/06/2009 0:00 Extended STA No.9899-CUA-0178-ST01 dated 30/06/98 10 0 NULL 1 NULL 30/06/2003 0:00 NULL NULL 113749520 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
8 01/07/2008 0:00 142041600 09/06/2008 17:09 143880657 6172          FS   6 2 01/01/1976 0:00 30/06/2009 0:00 Extended STA No.9899-CUA-0178-ST01 dated 30/06/98 10 0 NULL 1 NULL 30/06/2009 0:00 NULL NULL 113749520 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
8 01/07/2002 0:00 331430400 14/06/2002 14:46 332846014 1702          FS   6 2 01/01/1976 0:00 30/06/2007 0:00 Extended STA No.9899-CUA-0178-ST01 dated 30/06/98 10 0 NULL 1 NULL 30/06/2003 0:00 NULL NULL 84785798 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
9 01/07/2006 0:00 205200000 08/06/2006 12:26 207142388 6172          FS   6 2 01/01/1976 0:00 30/06/2007 0:00 Extended STA No.9899-CUA-0178-ST01 dated 30/06/98 10 0 NULL 1 NULL 30/06/2007 0:00 NULL NULL 84785798 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
9 01/07/2002 0:00 331430400 14/06/2002 14:46 332846014    1702             FS   6 2 01/01/1976 0:00 30/06/2005 0:00 Extended STA No.9899-CUA-0178-ST01 dated 30/06/98 10 0 NULL 1 NULL 30/06/2003 0:00 NULL NULL 59626641 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
10 01/07/2004 0:00 268272000 15/06/2004 11:16 269613803  6172           FS   6 2 01/01/1976 0:00 30/06/2005 0:00 Extended STA No.9899-CUA-0178-ST01 dated 30/06/98 10 0 NULL 1 NULL 30/06/2005 0:00 NULL NULL 59626641 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
10 10/09/2001 0:00 356832000 20/05/2002 16:01 335001505  152           FS   6 2 01/01/1976 0:00 30/06/2009 0:00 Extended STA No.9899-CUA-0178-ST01 dated 30/06/98 10 0 NULL 1 NULL 31/12/2001 0:00 NULL NULL 113749520 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
10 01/07/2008 0:00 142041600 09/06/2008 17:09 143880656 6172           FS   6 2 01/01/1976 0:00 30/06/2009 0:00 CLC: as instructed by A. TANMIZI and K. St. LOUIS. Post is returned to the original Org Unit. 10 0 NULL 1 NULL 30/06/2009 0:00 NULL NULL 113749520 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL
10 01/01/2002 0:00 347068800 09/10/2002 10:49 322751451 6082            FS   6 2 01/01/1976 0:00 30/06/2009 0:00 CLC: as instructed by A. TANMIZI and K. St. LOUIS. 10 0 NULL 1 NULL 31/01/2002 0:00 NULL NULL 113749520 8 UNHQ - New York UNHQ 1 TRUE NULL NULL NULL NULL

 

Coordinator
Apr 21, 2009 at 11:51 PM
I'm a little confused - but lets see if we can work through this.  First, I want to understand what you're attempting to do here - the big picture.  I can't answer your questions in a straightforward manner - sorry - there's not enough as well as too much information here!

If I understand correctly, you already have a dimension table in SQL Server - with (roughly) the columns you've listed above.  You've got a package that retrieves all the rows from that dimension table using an OLE DB Source component (or equivalent).  You have attached the output of that to the "Existing Dimension" input of the Kimball SCD component.  You have added another Source component to retrieve all of the rows from your other SQL table that contains the data from your OLTP system, and you have directed the output of that component to the "Source System" input of the Kimball SCD component.  You have configured the Kimball SCD component's first tab - the "Existing Dimension Input Column Usage" tab - to identify the Surrogate Key and other column types as you've indicated above.  Please let me know if any of that description doesn't sound right to you.

First - I see a few things that may not be set correctly on your first tab.  Please review them, and let me know if you agree, or why you have identified them the way you have:
- [DimPostKey] is allowed to be NULL.  Typically, it's "very bad" for a Surrogate Key column to contain NULLs - I'd recommend you change your dimension table definition to require values in this column.
- Most of your other columns allow NULLs.  I've allowed that behaviour in the component for rare cases - but if you review the Kimball methodology, allowing NULL values in dimension attributes is generally frowned upon, as business users don't fully comprehend what it means, or what effects it has on totals.  I would recommend altering your dimension table definition to make all columns not nullable.
- I see an [IsCurrent] column - but it's marked as an SCD1.  This seems counter-intuitive to me.  Every row for the same business key in your dimension will have the same value for this column - so they'll all be "current" or none of them will be.  There is a special Usage Type that may be more appropriate - the SCD2 Current Record.

Second - you say that your source system has the same structure as the dimension table.  That is not typical, and I suspect that's not really the case.  Your source system should have no concept of "effective" and "expiry" dates, and should therefore only have ONE record per Business Key.  The Business Key should be a unique constraint in your source system.  Your source system should also NOT have a surrogate key column, for the same reasons, nor an [IsCurrent] column.

Your description of your data having multiple rows with the same Business Key is completely valid - that's what a dimension table typically looks like.  The group of rows of that have same Business Key should all have mutually exclusive time periods identified in the [Date Effective] and [Date Expiry] fields - none of the time periods should overlap.  You had a question about what the "first"/"oldest"/"earliest" record should have for its [Date Effective] value - that's up to you.  Configure what you want that value to be (NULL is OK) on the Options tab.  The same goes for the "last"/"newest"/"latest" row of the group - you choose what value you want in there (NULL is OK), and tell the component on the Options tab.

Now - your #1 problem is that the component is sending every row to the "Invalid Input" output.  This means that the component detected serious problems with every single row it saw coming in, and "rejected" them all.  There are several reasons that it can reject rows - but you should notice there is a new column on that output that should tell you why the Kimball SCD rejected the row.  It could be NULLs where there shouldn't be (like in your Surrogate Key field), or duplicate business keys for example.  Let me know what the message is, and that will allow us to move further ahead.
Apr 22, 2009 at 3:55 PM

Thanks Todd for your guidance.Below in bracket and bold additional clarifications.

Q: If I understand correctly, you already have a dimension table in SQL Server - with (roughly) the columns you've listed above.  You've got a package that retrieves all the rows from that dimension table using an OLE DB Source component (or equivalent).  You have attached the output of that to the "Existing Dimension" input of the Kimball SCD component.  You have added another Source component to retrieve all of the rows from your other SQL table that contains the data from your OLTP system, and you have directed the output of that component to the "Source System" input of the Kimball SCD component.  You have configured the Kimball SCD component's first tab - the "Existing Dimension Input Column Usage" tab - to identify the Surrogate Key and other column types as you've indicated above.  Please let me know if any of that description doesn't sound right to you.
A: [Absolutely correct]

Q: First - I see a few things that may not be set correctly on your first tab.  Please review them, and let me know if you agree, or why you have identified them the way you have:
- [DimPostKey] is allowed to be NULL.  Typically, it's "very bad" for a Surrogate Key column to contain NULLs - I'd recommend you change your dimension table definition to require values in this column.
A: [I agree. Good catch, I,ve  changed it. ]

Q:- Most of your other columns allow NULLs.  I've allowed that behaviour in the component for rare cases - but if you review the Kimball methodology, allowing NULL values in dimension attributes is generally frowned upon, as business users don't fully comprehend what it means, or what effects it has on totals.  I would recommend altering your dimension table definition to make all columns not nullable.
A:[ I've changed it .However there would be an implication while populating my dimension if attributes in my source contain mix of Nulls and valid data?]

Q:- I see an [IsCurrent] column - but it's marked as an SCD1.  This seems counter-intuitive to me.  Every row for the same business key in your dimension will have the same value for this column - so they'll all be "current" or none of them will be.  There is a special Usage Type that may be more appropriate - the SCD2 Current Record.
A:[I Agree.My Mistake.I've changed it]

Q: Second - you say that your source system has the same structure as the dimension table.  That is not typical, and I suspect that's not really the case.  Your source system should have no concept of "effective" and "expiry" dates, and should therefore only have ONE record per Business Key.  The Business Key should be a unique constraint in your source system.  Your source system should also NOT have a surrogate key column, for the same reasons, nor an [IsCurrent] column.
A: [It's not the case. I misexpressed my thought. I simply wanted to say that every Dimension columns are linked with the source columns but the three metadate attributes mentioned above.Effectively, the source system doesn't have effectiveDate, ExpiresDate and IsCurrent.  ]

Your description of your data having multiple rows with the same Business Key is completely valid - that's what a dimension table typically looks like.  The group of rows of that have same Business Key should all have mutually exclusive time periods identified in the [Date Effective] and [Date Expiry] fields - none of the time periods should overlap.  You had a question about what the "first"/"oldest"/"earliest" record should have for its [Date Effective] value - that's up to you.  Configure what you want that value to be (NULL is OK) on the Options tab.  The same goes for the "last"/"newest"/"latest" row of the group - you choose what value you want in there (NULL is OK), and tell the component on the Options tab.

Q:Now - your #1 problem is that the component is sending every row to the "Invalid Input" output.  This means that the component detected serious problems with every single row it saw coming in, and "rejected" them all.  There are several reasons that it can reject rows - but you should notice there is a new column on that output that should tell you why the Kimball SCD rejected the row.  It could be NULLs where there shouldn't be (like in your Surrogate Key field), or duplicate business keys for example.  Let me know what the message is, and that will allow us to move further ahead.
A: [After applying changes following your advise, I still have invalid output and the error description message is: "Duplicate Business Key In Source_System"]

I'm a little confused - but lets see if we can work through this.  First, I want to understand what you're attempting to do here - the big picture.  I can't answer your questions in a straightforward manner - sorry - there's not enough as well as too much information here!
 [I try to load historical data in once.I must apply SCD to do so. Instead of capturing varoius output,I'm receiving only invalid output because of the repetitive business key.How to deal with it? As you can see on the data reference previously submitted, I have for instance three rows with the same business key "1" but different other attributes.I can't send  to Kimball SCD one row (let say the most recent one) because I need the two other historical rows with the same business key ) ]

Please let me know if I'm still unclear.
Thx,
Cyr.
 

Apr 22, 2009 at 5:55 PM

Hi Todd,

Q: Additional note for the last question regarding te holistic view of my problem:

I'm a little confused - but lets see if we can work through this. First, I want to understand what you're attempting to do here - the big picture. I can't answer your questions in a straightforward manner - sorry - there's not enough as well as too much information here!

A: I try to load historical data in once.I must apply SCD to do so. Instead of capturing various output,I'm receiving only invalid output because of the repetitive business key.How to deal with it? As you can see on the data reference previously submitted, I have for instance three rows with the same business key "1" but different other attributes.I can't send to Kimball SCD one row (let say the most recent one) because I need the two other historical rows with the same business key.How to create [DateEffective] and [DateExpires] of those rows based on[PostChangeEffective] attribute of my source].

Your Kimball Sample SCD 2008 generates constant dates (1753-01-01 and 9999-12-31)

---------[ Received Mail Content ]----------

Subject : Re: Kimball SCD Historical Load [kimballscd:54016]

Date : 22 Apr 2009 07:55:53 -0700

From : cyrmomo

To : cyrmomo@lycos.com

From: cyrmomo

Thanks Todd for your guidance.Below in bracket and bold additional clarifications.Q: If I understand correctly, you already have a dimension table in SQL Server - with (roughly) the columns you've listed above. You've got a package that retrieves all the rows from that dimension table using an OLE DB Source component (or equivalent). You have attached the output of that to the "Existing Dimension" input of the Kimball SCD component. You have added another Source component to retrieve all of the rows from your other SQL table that contains the data from your OLTP system, and you have directed the output of that component to the "Source System" input of the Kimball SCD component. You have configured the Kimball SCD component's first tab - the "Existing Dimension Input Column Usage" tab - to identify the Surrogate Key and other column types as you've indicated above. Please let me know if any of that description doesn't sound right to you.

A: [Absolutely correct]Q: First - I see a few things that may not be set correctly on your first tab. Please review them, and let me know if you agree, or why you have identified them the way you have:

- [DimPostKey] is allowed to be NULL. Typically, it's "very bad" for a Surrogate Key column to contain NULLs - I'd recommend you change your dimension table definition to require values in this column.

A: [I agree. Good catch, I,ve changed it. ]Q:- Most of your other columns allow NULLs. I've allowed that behaviour in the component for rare cases - but if you review the Kimball methodology, allowing NULL values in dimension attributes is generally frowned upon, as business users don't fully comprehend what it means, or what effects it has on totals. I would recommend altering your dimension table definition to make all columns not nullable.

A:[ I've changed it .However there would be an implication while populating my dimension if attributes in my source contain mix of Nulls and valid data?]Q:- I see an [IsCurrent] column - but it's marked as an SCD1. This seems counter-intuitive to me. Every row for the same business key in your dimension will have the same value for this column - so they'll all be "current" or none of them will be. There is a special Usage Type that may be more appropriate - the SCD2 Current Record.

A:[I Agree.My Mistake.I've changed it]Q: Second - you say that your source system has the same structure as the dimension table. That is not typical, and I suspect that's not really the case. Your source system should have no concept of "effective" and "expiry" dates, and should therefore only have ONE record per Business Key. The Business Key should be a unique constraint in your source system. Your source system should also NOT have a surrogate key column, for the same reasons, nor an [IsCurrent] column.

A: [It's not the case. I misexpressed my thought. I simply wanted to say that every Dimension columns are linked with the source columns but the three metadate attributes mentioned above.Effectively, the source system doesn't have effectiveDate, ExpiresDate and IsCurrent. ]Your description of your data having multiple rows with the same Business Key is completely valid - that's what a dimension table typically looks like. The group of rows of that have same Business Key should all have mutually exclusive time periods identified in the [Date Effective] and [Date Expiry] fields - none of the time periods should overlap. You had a question about what the "first"/"oldest"/"earliest" record should have for its [Date Effective] value - that's up to you. Configure what you want that value to be (NULL is OK) on the Options tab. The same goes for the "last"/"newest"/"latest" row of the group - you choose what value you want in there (NULL is OK), and tell the component on the Opti! ons tab.Q:Now - your #1 problem is that the component is sending every row to the "Invalid Input" output. This means that the component detected serious problems with every single row it saw coming in, and "rejected" them all. There are several reasons that it can reject rows - but you should notice there is a new column on that output that should tell you why the Kimball SCD rejected the row. It could be NULLs where there shouldn't be (like in your Surrogate Key field), or duplicate business keys for example. Let me know what the message is, and that will allow us to move further ahead.

A: [After applying changes following your advise, I still have invalid output and the error description message is: "Duplicate Business Key In Source_System"]

Q:I'm a little confused - but lets see if we can work through this. First, I want to understand what you're attempting to do here - the big picture. I can't answer your questions in a straightforward manner - sorry - there's not enough as well as too much information here!

A:[I try to load historical data in once.I must apply SCD to do so. Instead of capturing varoius output,I'm receiving only invalid output because of the repetitive business key.How to deal with it? As you can see on the data reference previously submitted, I have for instance three rows with the same business key "1" but different other attributes.I can't send to Kimball SCD one row (let say the most recent one) because I need the two other historical rows with the same business key ]

Please let me know if I'm still unclear.

Thx,

Cyr.




Thanks,
CM.

"Pray in confidence and always look ahead,regret and sorrow only narrow your mind"
Cyrille L. Momo, 2005.
Coordinator
Apr 22, 2009 at 7:48 PM

OK - thanks for that info, Cyr - I think I understand the issue.

I think it boils down to this:
A) Your dimension table is empty.
B) Your "source" contains a "change log" of business entity "states".  Essentially, it contains multiple versions of each business entity from when your system started collecting data up until now.
C) The Kimball SCD component is expecting a "snapshot" of the Source System data for a specific date.

What I think you're trying to do is what I call "loading a change stream".  You basically want to start out the dimension table with nothing in it, and feed the component one row from your source system at a time.  Each row the component gets from the source system gets checked against what's in the dimension table (right now) and then the component decides what to do with it.  The Kimball SCD doesn't do that (although I believe the SCD Wizard included in SSIS does).  The Kimball SCD component wants you to give it a "snapshot" of the state of your Source System.  It will compare that state with the state of your dimension table.  Since it's expecting a "snapshot" of the source system table, it's also expecting that there's only one instance of each "business entity" (identified by the Business Key(s)).  If it sees two records from the Source System with the same business key, it has to assume that your source data has a referential integrity problem (since you ought to be able to put a unique index on a business key in your source system table).

The main problem that I have with a "change stream" type of SCD processing (and one of the problems with the Wizard) is that you can never "expire" a dimension member.  Because you never get a row in the change stream when a row is deleted from your source system - there's no "updated row" to pass on.

Anyway, back to your issue and how to solve it!  There is a way you could make this work.  I believe I suggested this to someone else over on the MSDN forums and it worked for them. 
1. Take your "change stream" source, and select the DISTINCT "effective dates" from it.  (Execute SQL Task -> object SSIS variable.)
2. Iterate over those dates, placing the date in a "loading date" SSIS variable. (Foreach Loop over the object SSIS variable which is actually an ADO recordset.)
3. Place a Data Flow inside the Foreach.
  3a. Place a source to read your dimension table.
  3b. Place a source to read your Source System table - filtered on the "loading" date.
  3c. Multicast your Dimension Source, and Merge Join it with your Source System (full join).  Pick ALL COLUMNS from BOTH inputs.  (You'll get 2x the columns.)
  3d. Use a Derived Column after the Merge Join to create a 3rd complete set of columns.  Use ISNULL on the Business Key column(s) from the Source System to determine if the Source System columns have data.  If they do - set ALL of the 3rd set of columns to be the Source System column values.  If the Source System BK is NULL, use the Dimension column values.
  3e. Put a Kimball SCD component in the flow, attaching the Derived Column output to the Source System input, and the Multicast output to the Existing Dimension input.

The above process will basically turn your "change stream" back into a "snapshot".  If you "get it", then great!  If it doesn't make sense or just sounds crazy, just let me know and I'll explain more.  (Why waste time explaining if you already "get it"? :) )

Apr 23, 2009 at 1:39 PM
HiTodd, you have appropriately pinpointed my issue. I will try your solution and feedback you the outcome.
Thx,
Cyr.
Apr 23, 2009 at 10:59 PM
Hi Todd,
I've followed your steps and still getting invalid ouput. Please let me get clarification  on point :
3d. Use a Derived Column after the Merge Join to create a 3rd complete set of columns.  Use ISNULL on the Business Key column(s) from the Source System to determine if the Source System columns have data.  If they do - set ALL of the 3rd set of columns to be the Source System column values.  If the Source System BK is NULL, use the Dimension column values.
To translate it , for each derived column expression,  I came up with something like :

ISNULL(PostSerialNumber_source_system) ? PostSerialNumber_dimension : PostSerialNumber_source_system

PostSerialNumber being my BK

Once again thanks for your help.
Cyr.

Coordinator
Apr 24, 2009 at 6:09 AM
You are close.  That expression will probably work, but this one is better for your "PostSerialNumber" column:

ISNULL(DimPostKey_source_system) ? PostSerialNumber_dimension : PostSerialNumber_source_system

To do the equivalent thing for PostChangeEffectiveDate, this expression would be appropriate:

ISNULL(DimPostKey_source_system) ? PostChangeEffectiveDate_dimension : PostChangeEffectiveDate_source_system

Note that I use the same column in the ISNULL - the first column that will only ever be NULL if that row does not exist in the Source System.  If you'd used another column, you may get incorrect results because some of your columns may allow NULLs.  (Or maybe you changed that.  Regardless, this will work.)

If this kind of expression doesn't work - and you still get rows sent to the invalid input output, let me know what reason the component gives for rejecting them.
Apr 24, 2009 at 2:42 PM
Thank you so much , I will keep you posted
Cyr.
Apr 27, 2009 at 10:12 PM

Hi todd,
Still Having issue . All my dimension rows are NULL and the reason for rejection is: "Invalid Business Key fields".
It seems weird, because at my source, when I read my source system table filtering on  the "loading date" variable I get a unique Business Key(PostSerialNumber).So I'm wondering why the business key could be invalid. I fixed the NULLs as well.

I suspect that I'm missing something related to the container.... Because I did a test by removing the data flow task from the container, I create a test source system table with an unique business key and it works fine.I replaced the data flow taks into the container... and no ouput except an one invalid row that could see because I hooked a data viewer.. no "new",  "unchanged" or "expired" etc..

Thanks in advance for your help!
Cyr.

Coordinator
Apr 28, 2009 at 6:45 AM
Hi Cyr,

I'm afraid I don't understand what you're currently describing.

I think your root issue is you need to understand what the component is doing.  The best thing you can do to solve your problem is to:
1. Reduce the amount of data going in, so you have a firm grasp of what that data is like.  I mean reducing the data to something like ten rows per input or LESS.  As few rows as possible so that you can clearly see what is going on, without any doubt on your part that you understand what the component is doing with that data.  This will also help me to help you - because you can (relatively) easily write that information here, since it will only be a very few records.
2. Place data viewers on each input and each output of the component.  Again, this will help you understand where the data is going, and what it looks like.
Apr 28, 2009 at 8:31 PM
Thanks Todd,
I was using already a small pool of data but I didn't place  data viewers on all my outputs. And the For Each Loop container was running fast and I couldn't see a number flashing on my different dataflow paths. Also My source was returning me unexpected data ...Now it works but  as following:
1-The Kimball component receives data extracted during the first loop, processes them and output results (New rows).Fine so far....
2-However when the data extracted at the second loop is sent to Kimball component...nothing happens,
3- Those  extracted at the third loop nothing happens again...... and so on
Any idea ?-:)
Thanks Again,
Cyr.
Coordinator
Apr 29, 2009 at 5:47 PM
You may be experiencing the same kind of issue as Workitem 2812.  Please review that and use the described workaround (use an Execute Package Task) to see if that works for you.  The issue is on my to-do list to investigate, and I'll add this discussion as a reference to that.
Apr 30, 2009 at 7:24 PM
Hi Todd,
It worked for me!
Another question towards  enhancements:
I placed SCD2 type on several columns and there are changes. But my "Row Change Reason" captures only change on one column  which is the "effective date".There is any way to see for other SCD2 columns the row change reason ?
Coordinator
May 4, 2009 at 10:46 PM
At the moment, the component only shows the first column it detects that has an SCD2 change to it.  I've done this for a couple reasons:
1. The more information I put in that "row change reason" column, the more memory I consume for EVERY row that goes out that output.  Remember, string columns in SSIS are NOT varchar-like, they consume the defined amount of bytes regardless of whether they're space padded or not.
2. The component performs "lazy" evaluation on the row changes - why check every column when we know that (at least) one column is changed?  I don't know how much of a performance hit you'd take by checking every column - but I'm sure there would be a hit.
May 6, 2009 at 3:10 PM
Edited May 6, 2009 at 3:15 PM
I agree with your lazy evaluation approach as it is the same one I have use within a conditional split component to perform this logic (using conditional OR in which the first statement - column - that evaluates to true ends the processing).
This has me thinking more about what would be desireable for the row change reason from both an SCD1 and SCD2 perspective.
For both SCD1 and SCD2, I don't think using Row Change Reason to log the first evaluated changed column is useful.
In both cases most people, Cyr included, would expect a complete list of changed columns.

With SCD2 you are not overwriting the previous version of the record then you can easily compare the previous record with the current record to see what has changed.
As such, I'd prefer a simple customizable message along with a separate column that shows the prevous record Audit ID.
This later column would permit a parent child view to be easily created to view changes.
In the past I've assigned a value of "Expired record replaced by record 123" to the previous record and assigned a value of "Current updated record replacing record 101" to the new record.
Splitting this into a row change reasons (customizeable by output type) like "Expired Record" and "Current Updated Record" along with another column to reference the record that has been replaced would be useful.

With SCD1 you overwrite the previous version of the record and cannot compare the previous record with the current record to see what has changed.
In the past I've assigned a value of "Initial Record" to the original or first record and assigned a value of "Updated Initial Record" to the updated SCD1 record.
Perhaps an optional "Row Audit Output" could be added to enable the original row to be preserved to another Audit version of the Dimension.
This would enable a table join view of current and older views of the the SCD1 records with the MAJOR bonus of enabling a way to restore a data mart back to a previous point in time.
The restore would be done by simply 
1. Removing all records having an effective date greater than the specified date
2. Copying back the applicable SCD1 records from the Audit verison of the Dimension effective at that point in time.
3. Changing the Current indicatiors for the last records to 1 (or Y or Yes)

 
Coordinator
May 8, 2009 at 9:53 PM

I like where you're going with that - specifically the ability to "roll back" an entire SCD processing run "easily".  It's too much for me to think about at the moment - but I've added an Issue for it (2995).  Doubtful I'll do it in v1.5 - there's too much more important stuff in there for now, but I don't want to forget it.

Aug 24, 2009 at 2:36 PM
Edited Aug 24, 2009 at 2:39 PM

Hi there,

I'm having the same 'problem' in my project.

We got a 'Source-Environment' which has a full history of all changed attributes rows in the source. This is a kind op copy of the source system over time including registration date and historical date.
I got a function with a parameter (@Date) which shows me the state of the source system on a given day.
This function is the input of the Source System for the component (only 1 business key allowed). I'm looping through all dates with a for-each-loop-container,
getting the dates in a object variable. This works fine. The insert/Updates work fine.
The problem is that this method is a little bit slow. I'm doing a small dimension with 138 'Historical Records' in 4 minutes, imagine a dimension of
1.000.000 records taking more then a day!

It should be nice to handle more of the same business keys without a for-each loop, or is there a better method?

Coordinator
Aug 26, 2009 at 10:06 PM

Sorry - there really isn't a nicer way to do what you want.

Issue #304 should describe what you want the component to do.  Unfortunately, modifying/adding this capability is pretty involved.  It's basically a different paradigm for receiving source row information.  Currently, this is one of the few major differences between this component and the SCD Wizard in SSIS with regards to how they approach processing SCDs.  The SCD Wizard does this "streaming" approach, where only changed records are introduced to the component, and multiple "versions" of the same business key can be supplied.

The thought process I had when designing this component (and not liking the SCD Wizard) was that this "streaming" type of process was (I thought) only a likely activity on the initial load of a dimension table from "historical" OLTP data.  In my case, even that wasn't true - I have "snapshots" of historical data, not a "changelog" - so designing the component the way I did made complete sense to me.  Add on to that the fact that using a "change stream" has no simple way of representing the fact that rows get deleted... and suddenly your Data Warehouse can't provide accurate business entity counts!  I didn't like any of that, so I designed this component to operate on "complete snapshots" rather than changestreams or something similar.

Is your scenario an "initial load" situation?  If so - my "best" response is to "suck it up" and take the performance hit.  Your only other alternatives are to use the stock SCD Wizard - which would likely hit you with worse performance (you'd have to try), or roll your own SCD process using Lookups and Conditional Splits...