Performance Topic for larger dimension tables, memory usage

Mar 3, 2010 at 12:55 PM

Hi,

We enjoyed working with Kimball Method SCD Component the last few days.
Nice tool and really fast compared to the SSIS solution.

I would like to ask you kindly one question (unfortunately I'm not a native C# programmer, so I do not dare to contribute to your code):

The setting: Large Dimension Table with around 500.000 recs with 88 Columns about 300Mb

                    No surrogate Key generation within your component: "Surrogate Key assignment is managed somewhere else" +

                    Surrogate Key in dimension table already exists and for new records it is defined by autoincrement in the dimension table later on.
                
    Solution Variant I):
        Sort the input and the current dimension table via SSIS sort + your Component. This works fine, but
        i) Takes about 35 Minutes
        ii) uses ~2Gb memory for the sortings + Kimball only uses a small amount of memory,
            i.e. starts to process data using a small cache.
        
    Solution Variant II):
        Sort the input directly in the data base (<5 Minutes) and give this to the component.

        This also works fine, i.e. (overall 7 minutes) but now the Kimball component uses about 2Gb memory
        which is OK for now, but for this dimension (customer dimension) I expect a larger increment for the next two years,
        so memory for sure will become an issue.

        We tried to configure the output of the DB-Source in a way,
        that we set IsSorted=TRUE, but this unfortunately does not help.
       
Do you have any ideas how to get your component to reduce the amount of memory needed?
Or do I have missed a certain point in the usage of the component?

Kind regards
John

PS: Nice piece of work and well documented.

Coordinator
Mar 5, 2010 at 11:57 PM

Glad you're finding a use for it.

Have you suppressed the performance warnings in the component about having inputs sorted?  If you have, please disable that option (leave it unchecked).  The component will now tell you accurately if it can use sort order to help it process faster with less memory usage.

The steps you need to take to properly inform SSIS that your source is delivering data in sorted order is to set the IsSorted=True, but you must also go to the Output Columns node, expand it to see the columns being output by the source, and set the SortKeyPosition properties of the columns that participate in the sort order.  Set the first column to "1", the second to "2", etc...  Once you do that, the warning from the KSCD component should go away.

If you don't see an improvement in memory usage after that, please let me know.

Mar 9, 2010 at 10:13 AM
Edited Mar 9, 2010 at 3:53 PM

Hi Todd,

thanks a lot.

That helps partly, i.e. the process starts to process rows rather early (not from the very beginning, but after ~100.000 rows).

It still uses 1.8Gb. So my "feeling" is that it still builds up a cache for all the records without doing any garbage collection in between.

So, do you have any ideas about that (the buffer_size and the buffer_row_size have not been changed, i.e. 10MB and 10.000 records).

kr John

 

Coordinator
Mar 14, 2010 at 6:25 PM

Buffer sizing tends to be not that significant in my experience for any performance issue.  This one in particular.  If, in fact, the component is caching a ton of rows, then the chunk size of the rows its caching before it can start matching is really irrelevant.

I've been running some performance profiling on my data, and what seems to be the best indicator of performance (so far) is making sure that both inputs get fed proportionally.  That is to say that if your dimension has 500K records, and your source has 100K (you didn't say) - then the best results tend to happen when the existing dimension gets fed to the component 5 times faster than the source.  That is assuming that the sort order (assuming again that they're sorted) results in an even distribution of "matches" between the source and dimension along the entirety of the dimension table.  Please see my recent blog post about this topic, where I talk about a dimension table that isn't like that.

On a technical note - were you able to get your Existing Dimension source component to deliver the rows in a sorted order, such that SSIS thinks that the flow is sorted?

Now - how do you get a 5:1 (five to one) delivery of rows?  You can't control that through any SSIS mechanism.  There is a component out there called FlowSync by Alberto Ferrari (please check the SSIS Community Tasks and Components site) that I haven't tried, but alleges to be able to synchronize separate flows.  That may be useful for you.

If you can't get a 5:1 delivery ratio, then it may be beneficial for you to guarantee a more lopsided - but beneficially lopsided delivery.  Can you arrange for ALL (or most) of the Source System rows to get to the component first?  If you can, then you can limit the amount of cache that the component will take to a maximum of that required to cache the Source System.  It won't need to cache any of the Existing Dimension, because as soon as it gets rows from the Existing Dimension, it can process them.