Performance Improvement?

Coordinator
Nov 20, 2008 at 10:18 PM
Edited Nov 20, 2008 at 10:19 PM
Hi everyone,

As you could probably guess from my statement on the Home Wiki page, I've seen substantial performance improvements over the stock SCD component.  However, I want to know if my results are representative of what others are experiencing.  Please post up a little information about what performance differences - good or bad - you've seen with this component versus the stock SCD.  Some of this is purely for interest's sake, but I'd like to get an insight into what kind of loads I can/should optimize the component for further.  Please put up whatever information you can, here are some suggested pieces that I think woud be informative:

How does the total processing time compare?
How does memory usage compare?
What SQL version, OS version, and hardware are you running on?
What's your load look like?  (As in:)
    How many rows are on each of your inputs?  
    How many columns total on your dimension table?
    How many business key columns?
    Are your BKs strings, dates, ints?
    On a typical run, how many rows get directed to which outputs?

I appreciate any information you're willing to share, and I'm sure that there are other "browsers" here who would appreciate the information as well.
(My test hardware is currently being rebuilt, I'll be posting my numbers ASAP.)
Coordinator
Jan 16, 2009 at 6:38 PM
The next release is coming, and I've unintentionally improved performance of the component.  By unintentionally, I mean that I didn't intend to work on performance, the need grew organically from my DW project, and me not wanting to wait around.

I will be including a "sample database and package" in the next release.  Hopefully with a large enough data set that I can do performance comparisons with - and you can check them.  Regardless of whether or not that's possible, I have my new hardware, and will be benchmarking this component versus the stock SCD component, and posting those results.

By the end of January!
Coordinator
Jan 21, 2009 at 5:45 PM
Edited Jan 31, 2009 at 9:07 PM
With my latest build (soon to be released as v1.4), here are some of my numbers.  I'm currently running Vista x64 on a Dell Precision M6400, T9400 (Core2 Duo 2.5GHz), with 8GB.  The data set I ran will be available with the release, but to be brief, I constructed it by joining AdventureWorks' SalesOrder and SalesOrderDetail tables to arrive at a 121317 member dimension table (a 13MB text file).  (Nevermind that the sales order data isn't what I'd call a dimension - it's just the largest data set I could easily build from publicly available data.)  The dimension table was loaded with that, and the same data was changed slightly (by hand) to run through the two SCD components (33 records had SCD1 changes, 42 had SCD2 changes).  The results:

SSIS Standard SCD Wizard:
1777 seconds (= 29 minutes) - peak 40MB of RAM used

Kimball Method SCD:
15 seconds - peak 400MB of RAM used

In this instance - for 10x more memory usage, I get a 118x increase in performance.  BTW - the DB server hosting the dimension table is also running on my laptop - so there are no network latency issues that would further punish the SCD Wizard on top of it's half hour run...

Update Feb 1, 2009: For "real world" numbers (with a caveat), the component will push through a dimension like this:
  • 1,000,000 (one million) rows in the existing dimension flow
  • 100,000 (one hundred thousand) rows in the source system flow
  • rows 1005 bytes long (as reported by SQL Server for the existing dimension table)
  • 6 "housekeeping" columns (auditing, scd2 dates, surrogate key)
  • 2 business key columns (an int and a char(2))
  • 26 scd1 columns
  • 38 scd2 columns
...in 80 seconds
The caveats: my laptop was on battery power (CPU limited to 50%), and this was run inside BIDS

Another example:
  • 1.9 million rows in the existing dimension flow
  • 80,000 rows in the source system flow
  • rows 62 bytes long
  • 6 "housekeeping" columns
  • 7 business key columns (all char)
  • 2 scd2 columns
... in 126 seconds
(same caveats as above)
Aug 25, 2009 at 2:05 PM

If you have a surrogatkey without gaps (so nr 1,2,3, etc), it's easy to get the maximum surrogatkey so the component can
use this to assign the next key without scanning the whole table.

You can use this function in a SQL-task and assign the output (resultset) to a variable you can use in the component!

CREATE FUNCTION [dbo].[Get_TableRowcount](
@Object_Name nvarchar(257)
) RETURNS INT
AS
BEGIN

DECLARE @NOROWS INT
    SET @NOROWS = 0
    SELECT @NOROWS = t1.rows FROM sys.sysindexes t1
    WHERE t1.id = OBJECT_ID(@Object_Name)
    AND t1.indid in (0, 1, 255)
    RETURN @NOROWS
END

 

Now, create a Execute SQL task and add this statement and set the resultset (1 row) to the variable:

SELECT  dbo.Get_TableRowcount('DimensionName') AS MaximaleSurrogatKey

The number of rows should be equal to the maximum value of the last surrogatkey in your dimension.

 

This is a nice performance improvement on large dimension instead of using the MAX(SurrogatKey) function :-)

Coordinator
Aug 26, 2009 at 10:32 PM
Edited Aug 26, 2009 at 10:41 PM

If you happen to have such a restricted content table, I would think a much better and simpler way to get the next surrogate key is:

SELECT COUNT(*) + 1 FROM table

Sep 8, 2009 at 2:56 PM

Hi Todd,

I understand that you can do this but try this on a 1 million record dimension. You will see that this takes much longer then my method :-) You can say "whatever" but it's nice to get better performance with this little trick.

Coordinator
Sep 8, 2009 at 6:38 PM

Gotcha - yes that could take a while.

However, I restate that I don't like that method because it's so very easy to get gaps in your key.  I'd read a blog post by Lawrence Overbey on SQL Server Central - you should too: Get Row Counts Fast!

Nov 18, 2009 at 10:43 AM

Hello,

My name is João Costa and I'm developing a Datamart for Project Management (based on PMBOK concepts). I'm using your usefull and very promissing component (to me it is promissing because I can't get it to work with the desired performance).

I've followed your videos and got the component to work (last stable version).

The dimension has almost 9000 records (very small) and about 30 attributes with only one being a SCD type 2. Currentlly it has no information so it should just insert all 9000 new.

The package runs but when it gets to your component it just doens't go beyond yellow, without any row being inserted (more than 30 minutes) and in the database I only have one activity "Runnable" with the following SQL:

create table #tmpDBCCinputbuffer ([Event Type] nvarchar(512), [Parameters] int, [Event Info] nvarchar(512))
insert into #tmpDBCCinputbuffer exec ('DBCC INPUTBUFFER(56)')
select [Event Info] from #tmpDBCCinputbuffer

but not for the user of the connection and the rest of the activities are Sleeping - Waiting Command. If I stop the package all the Sleeping - Waiting Command are killed.

I really think this is not an issue regarding performance of your component but if I run the "normal" SCD used in VS2005 it runs slow but not that slow.

I'm not using any of the performance issues you mention because at this point I prefer not to make it more complex and ecause it is the first insert and I should be fast.

 

Best regards and thank you in advance for your help,

João Costa

 

 

Nov 30, 2009 at 4:08 AM

Hi Todd,

Thanks for you efforts in having a go at this component - it looks very interesting and I can't wait to give it a go - I have a suitable project for this coming up in January.

In the meantime, as a longtime defender/apologist for the "out of the box" SCD component, I would note that it always performs much better when there is an index (ideally unique) on the Business Key column(s) you have specified in the Wizard.  Such an index speeds up the SELECTS and UPDATES that it issues row-by-row.

I'll get back to you in late January with my results.

Mike

Coordinator
Nov 30, 2009 at 2:03 PM

I look forward to your thoughts on how this component compares - especially against an "optimized" Wizard.  It will still suffer the same issues with the "update" as the SCD Wizard does, as updating the back-end database is usually performed the same way - using the slooow OLE DB Command component.  Although with this component, you can feel free to alter that design without fear of your change(s) getting whacked later by re-editing the SCD component :)

Coordinator
Mar 25, 2010 at 8:43 AM

I stumbled upon this testimonial: SSIS - Kimball SCD Component vs. SCD Component.  Sounds like he's experiencing the kind of performance improvement I'd expect.

Jul 21, 2010 at 10:48 AM

Hi Todd,

First I would like to say; thank you for this component!

Second I would like to share with you some of the results.

We are loading fact tables with over 35 million rows using your component. At the moment your component is able to process 35 millions rows in roughly 2 hours. Not bad is it ?

Still, do you know any way to improve the performance even further?

Coordinator
Apr 8, 2011 at 7:42 PM

EXCELLENT - thank you very much for the feedback on your experience.  Could you flesh out your numbers a little bit more?  Can you tell me how "wide" those rows are, and how many BKs, SCD1 and SCD2 columns there are.  Also (on average) how many SCD type 1 and type 2 changes occur per load, and how beefy is the server running this (RAM, CPUs)?  How long did it take with other method(s) - if you tried them?  I'm sure others will really appreciate that kind of info.

To answer your question - I think you should try v1.6.  I did some thread scheduling improvements that might help a bit more...  You may also want to consider using hashes on your BKs.