buffer size recommendations?

Oct 16, 2009 at 9:32 PM

Do you have any buffer size recommendations (i.e. DefaultBufferMaxRows and DefaultBufferSize settings on the data flow) for packages which use Kimball SCD?

I have one package which is a million row dimension with a fair amount of string columns. It's optimized to provide the surrogate key to Kimball SCD, and it sorts the inputs. With the buffer size at about 30,000 rows, it used 0.75 GB memory max. When I doubled the buffer size, then the package used 1.5 GB max. That honestly surprised me since I figured the main consumer of memory would be the Kimball SCD component, not the pipeline buffers.

So what are the factors that play in this decision when the Kimball SCD component is used? What if the source system input is slow, but the existing dimension input is fast? How does Kimball SCD manage input buffers? Are there certain situations in which Kimball SCD might be less efficient with memory? How does parallelism work in Kimball SCD (i.e. is it one thread per buffer, thus smaller buffers means more parallelism, or something like that?)

I hope you see what I'm getting at. Just looking for advice, if you can give us any rules which might help steer us. (Otherwise, we'll just try several different buffer sizes and pick the one that's optimizes speed and memory usage.)

Oct 17, 2009 at 11:01 PM

That is an awesome question.  Short answer - no, I don't.

I can't say I'm surprised that your mem usage doubled when you doubled the buffers - although I would have expected a little less than double.  I'll describe what goes on inside the component, and that will help me answer your remaining questions.  First, I have very limited control over the buffers I receive.  The only control I have is to stop reading them in.  I have no choice or control about which input I want to receive buffers from.  If I do stop reading them in, SSIS will queue a few buffers up in it's pipeline, but will soon get the message that I'm snubbing it, and it will pause the other components upstream.

When I do get passed a buffer I'm told which input it came from.  The main thread of the component creates a hash key (based on the BKs) for each row and adds those rows to my internal cache for that input.  Another thread pool - the "match" pool - is (almost) constantly scanning the input caches, attempting to find hashkey matches between inputs - I call those "determined" keys.  As in "I've determined you definitely have a match, or you definitely don't."  Since your inputs are sorted, when the "match" threads do find a match, they are then also able to mark every prior row that came from the input as a "no match will ever come for you."  (Essentially dooming them to being categorized as "New" or "Deleted".)  Another special case for the "match" threads is when all but one of the inputs is known to have been completely received, AND there are no other input rows cached.  If that's the case, then it's guaranteed that if the rows coming in from the one input won't match anything.  The "match" threads add all those "determined" hashkeys to a queue for the "processing" thread pool to read from.  The processing thread pool reads hashkeys from the queue, pulls the rows out of the input caches corresponding to those keys, and does the SCD column-by-column comparison, figuring out what to do with the row(s).  Once the processing threads figure that out, they add a row to the appropriate output and set all the column values.

So - given the above behaviour, the best scenario my component could encounter to reduce memory usage would be:
* The smallest buffers possible
* Sorted inputs
* Buffers delivered to the component alternating from one input to the other such that matches happen almost instantly.  (This may mean a 2:1 ratio, depends on a lot.  Basically want the teeth in the gears to mesh, if you think of each input as a gear.)

The "worst case" I can think of is a combination of these (although some of them are mutually exclusive):
* Huge buffers (will likely cause more rows to get buffered with no hope of matching until the next buffer from the other input arrives)
* One input in "ascending" BK order, and the other in "descending" BK order (will cause about half of each input to be completely buffered before matches are made)
* One (unsorted) input is super fast until the last buffer, the other very slow - with the fast input having tons more records (will cause the fast and large input to be almost completely buffered)

Remember I said I have no control over what inputs I get buffers from?  If I did, I could have the component manage memory a lot better, because I could have it do the third point up there by itself.  (See Connect issue 391181: https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=391181)

In general (no empirical tests to back this up), I think that the component would work best if you can sort both inputs (outside of SSIS) and provide the buffers to it in a ratio correlated to the relative number of rows from each source.  So if the ED has 10M rows and the SS has 1M - feed the component with ED rows 10x faster than SS rows.  But you have no control over that.  Of course, all of the above is assuming that the BKs are "random".  I don't mean that rigidly - I just mean that I know I have dimensions where the OLTP system assigns BKs as identities (they're already SKs - just not Dimensional SKs).  In those "nonrandom" cases, I find that there are some "really old" ED records that are already "deleted" for good from the SS, and those should be pumped through fast first.  But again, you have no control over that.

If you have a case where one input is significantly smaller than the other - and the only scenario I can think of that matches this is a very rapidly changing "slowly changing dimension" (like a customer).  You may have 10K SS rows - the "active" customers, and 1M ED rows - the "history" of the customers.  (Customers averaging 100 versions of themselves.)  Then I'd try to arrange pushing in the SS input as fast as possible, and slow down the ED until the SS was done.  Why?  Because caching 10K rows isn't "that bad", and it's a known quantity.  If I fully cache the SS, then the component can use its shortcut logic as the ED rows come in.  The incoming ED rows will either match an SS row - causing my cache level to drop, or they won't find a match - but I know they'll _never_ find a match because I've seen all the SS keys.  Therefore, ED rows never stay in my cache (although they do hit it in transition).  My cache level will zoom up to the max required to hold the SS, then decrease as ED rows go through.  That ought to be preferable to caching a sizeable amount of ED rows, not knowing if I can process them because I haven't seen all the SS rows yet.

A whole bunch of "it depends"es...

That said - try out v1.5.  It won't be faster than v1.4... but it will show you cache levels in the Audit output - max and avg.  That will help you figure out how much memory it's consuming (not exactly, given .Net GC) versus the other SSIS pipeline and components...

Oct 19, 2009 at 4:33 PM

Great info, Todd. I suspect you're right about smaller buffers being better for Kimball SCD some of the time. (That's generally opposite from general SSIS performance tuning advice, I think. But it makes some sense for Kimball SCD.)

One other related question. Does buffer size have anything to do with parallelism inside Kimball SCD? How does your component decide how many threads to use and when to use them? This info might be helpful if I see several Kimball SCD packages running and pegging the CPU (or several running and not pegging the CPU) and want to tune to make better use of appropriate resources.

I know this is a side project for you, and we appreciate all your hard work on it! Looking forward to using 1.5 soon.

Oct 19, 2009 at 11:12 PM

I'd have to talk it over with the SSIS devs - but one reason I can think that the "larger buffers is better" is generally due to the (sane) recommendation to stay away from asynchronous components.  Synchronous components would be doing less "busywork" with larger buffers, and spend more time on actual data processing.  But asynchronous components hold up the data flow - and having larger chunks to hold up then get moving again doesn't seem to make sense for them.

Buffer size has nothing to do with parallelism inside the component.

In the options for the component (more sanely organized in v1.5) you're allowed to manually set the number of threads for each pool that I described.  If left up to the component (the default), it creates one thread per core per pool.  So for my dual-core machine, I get one main thread (for moving the input rows to my caches), two "matching" threads (to link up cached rows and queue them), and two "processing" threads (to take rows out of the queue and caches and output them).  You can reduce that to one thread per pool.  If you tweak the source, you could make the component single-threaded (that's an option I made only available to me, I think, for debugging).

Wish I had more time... I'm currently cursing at MS Access...

Aug 5, 2011 at 4:47 PM

Hi Todd,


I know this is the resurrection of a very old thread, but  I am having terrible issues with a relatively small dimension.

So, first things first, this is running on my x86 3GB ram dev laptop. I am processing approx 220k rows and when dumped to raw, fills a 180Mb file.

The issue I am having is OutOfMemory exceptions. I am finding it hard to get a complete run without an exception. I have tried large buffers and small buffers. I am using flow-sync from Cozyroc to try to keep the two source buffers in line. Both source and existing pipelines are sorted asc. The BK is a composite of DT_I2, DT_I4 and DT_WSTR,8. The surrogate key is handled by the component, but is initialised by me with don't-scan-input enabled. I don't know what else I can optimise.

When I set everything to SCD1 (with the exception of one column so I don't get SCD2 audit nags), everything runs fine. When I set the majority of columns to SCD2 as they need to be, I hit the memory issues. I have been running buffers from 10,000 to 100 (smaller seems better). I have replaced the Source System pipeline with a raw file. 

The ED is sorted by the DB, the SS is sorted by SSIS directly after the RAW output. I am about to pre-sort the raw and see if that helps.

I appreciate that my dev machine isn't great, but it is only 220k rows, 200Mb. It makes me think that there is a fundamental problem that I haven't worked out. I have used the component for years without issue, but this is probably the biggest SCD it has handled.