Complete Delete before starting Inserts?

Topics: Possible Bugs/Missing Features?
Jan 5, 2011 at 12:37 PM
Edited Jan 5, 2011 at 12:40 PM

Hi guys, I ran into a problem today with my usage of v1.5 Kimball SCD.

I am updating my target table which has some additional unique constratints on it. For simplicity lets say it is a User table and there is a house key of Id, business key of SourceId and I have a username field which I also want to ensure is unique. Again this is a table where I am only using SCD type 1.

So someone deletes an entry in the source but creates a new entry with the same login. When the SCD component runs it will fail as it will try to add the new row before it has deleted the old one.

I know I have options to work around this but none that I can think of are very nice. If someone has a simple work around to this I'd be glad to hear it. The best I think I have is to write inserts to a raw file, then run that afterwards, but it doubles the number of dataflows.

My question is, wouldn't it be useful for there to be an option to hold back all inserts until all deletes had been completed. I know the component can only tell when it has sent all the data to the next component and maybe there isn't a good way to do this (could it poll/receive an event of a executable/dataflow item given in configuration)?

It seems something that could be quite a common requirement for me as I always like to have as many unique constraints as possible to protect against bad ETL procs. Even if I was doing a SCD type 2 on this I would want to have a unique constraint on Username and ExpiryDate and would experience the same problem there.

Feb 22, 2011 at 10:15 PM

It doesn't really sound like you're populating a Data Warehouse with this component.  A dimension table doesn't require any columns to be unique except for the surrogate key, and combination of the business keys and SCD2 dates.

There's no way for the component to "wait" before inserting anything - because it doesn't actually insert anything.  Subsequent components do.

If you're certain that the "net result" of all the operations will be correct, could you disable the constraints until the dataflow completes, then re-enable them?  (I can't say whether that would work - it's up to you to be comfortable with whether the net result will always be correct - according to your logic.)

Feb 23, 2011 at 9:54 AM

Thanks for the reply todd.

Yeah I understand that there isn't a simple way for this to happen. However what about my suggestion about you being able to say to kscd - "don't start sending inserts until pipeline component "DeleteRows" and "UpdateRows" have completed". You can pick the 2 pipeline components from a droplist.

I appeciate that dim's don't "require" uniqueness, however that doesn't mean that they wouldn't be useful or helpful in ensuring good data - e.g. including scd2 end date in the uniqueness against the fields that the source system would have stated should be unique will ensure that original data constraints are preserved through good and bad etl processes - would it not? I come from a philosophy that says keep your database tables as constrained as you can and everything else will be simpler from there (less testing failures, less fixing bad data on live systems etc). Is there some reason why this isn't a good idea in dimensions that I am not seeing?

Your final suggestion would require some kind of transaction and my luck with enabling DTC transactions so far through my ETL has not been great resulting in me having to turn most of them off due to conflicts of update/deletes occurring at the same time on different connections. My suggestion wouldn't require the additional overhead or complexity of transactions or DTC and so in many cases cleaner.

Apr 8, 2011 at 8:07 PM

Interesting.  I know that Jamie Thomson has asked the SSIS team to work on "flow synchronization" - like ensuring one component completes before another starts, etc. - but that's a very difficult issue.  Your suggestion of "holding up" output might work - but there's no guarantee.

For example, you might tell my component to hold up outputting data on output A until it's sent everything to output B.  Fine and dandy.  But if you happen to have other transforms on output B... the (delayed) rows pushed out output A may reach their final destination before the last rows pushed out output B get to their final destination.  I can't control that.

But you can.

If you send the rows you want to delay to a RAW file (or SQL table), then you can place another data flow (or Execute SQL Task) constrained to operate AFTER the data flow with the SCD in it.  That's what I do with UPDATES.  Push them to a RAW file or table, then do a batch update in a subsequent task.