Arquivo

Archive for the ‘Concepts’ Category

Slowly Changing Dimensions Alternatives

Kimball Design Tip #53: Dimension Embellishments
By Bob Becker
When developing dimensional models, we strive to create robust dimension tables decorated with a
rich set of descriptive attributes. The more relevant attributes we pack into dimensions, the greater
the users’ ability to evaluate their business in new and creative ways. This is especially true when
building a customer-centric dimension.

We encourage you to embed intellectual capital in dimensional models. Rather than applying
business rules to the data at the analytical layer (often using Excel), derivations and groupings
required by the business should be captured in the data so they’re consistent and easily shared
across analysts regardless of their tools. Of course, this necessitates understanding what the
business is doing with data above and beyond what’s captured in the operational source. However,
it’s through this understanding and inclusion of derived attributes (and metrics) that the data
warehouse adds value.

As we deliver a wide variety of analytic goodies in the customer dimension, we sometimes become
victims of our own success. Inevitably, the business wants to track changes for all these interesting
attributes. Assuming we have a customer dimension with millions of rows, we need to use minidimensions
to track customer attribute changes. Our old friend, the type 2 slowly changing dimension
technique, isn’t effective due to the large number of additional rows required to support all the
change.
The mini-dimension technique uses a separate dimension(s) for the attributes that frequently change.
We might build a mini-dimension for customer demographic attributes, such as own/rent home,
presence of children, and income level. This dimension would contain a row for every unique
combination of these attributes observed in the data. The static and less frequently changing
attributes are kept in our large base customer dimension. The fact table captures the relationship of
the base customer dimension and demographic mini-dimension as the fact rows are loaded.
It is not unusual for organizations dealing with consumer-level data to create a series of related minidimensions.
A financial services organization might have mini-dimensions for customer scores,
delinquency statuses, behavior segmentations, and credit bureau attributes. The appropriate minidimensions
along with the base customer dimension are tied together via their foreign key
relationship in the fact table rows. The mini-dimensions effectively track changes and also provide
smaller points of entry into the fact tables. They are particularly useful when analysis does not
require consumer-specific detail.

Users often want to analyze customers without analyzing metrics in a fact table, especially when
comparing customer counts based on specific attribute criteria. It’s often advantageous to include the
currently-assigned surrogate keys for the customer mini-dimensions in the base customer dimension
to facilitate this analysis without requiring joins to the fact table. A simple database view or
materialized view provides a complete picture of the current view of the customer dimension. In this
case, be careful not to attempt to track the mini-dimension surrogate keys as type 2 slowly changing
dimension attributes. This will put you right back at the beginning with a large customer dimension
growing out of control with too frequent type 2 changes.

Another dimension embellishment is to add aggregated performance metrics to the customer
dimension, such as total net purchases last year. While we normally consider performance metrics to
be best handled as facts in fact tables (and they should certainly be there!), we are populating them
© Copyright Kimball Group, 2004. All rights reserved. 2 of 2

in the dimension to support constraining and labeling, not for use in numeric calculations. Business
users will appreciate the inclusion of these metrics for analyses. Of course, populating these
attributes in our dimension table places additional demands on the data staging system. We must
ensure these aggregated attributes are accurate and consistent.

An alternative and/or complementary approach to storing the actual aggregated performance metrics
is grouping the aggregated values into range buckets or segments, such as identifying a credit card
customer as a balance revolver or transactor. This is likely to be of greater analytic value than the
actual aggregated values and has the added benefit of assuring a consistent segment definition
across the organization. This approach works particular well in combination with the mini-dimension
technique.

Design Tip #127 Creating and Managing Mini-Dimensions

setembro 2, 2010 — Carlos Alberto Lorenzi Lima

I wrote a Design Tip last year on creating junk dimensions; I’ve decided to extend that article into a series on implementing common ETL design patterns. Warren Thornthwaite – Kimball Group.

This Design Tip describes how to create and manage mini-dimensions.  Recall that a mini-dimension is a subset of attributes from a large dimension that tend to change rapidly, causing the dimension to grow excessively if changes were tracked using the Type 2 technique.  By extracting unique combinations of these attribute values into a separate dimension, and joining this new mini-dimension directly to the fact table, the combination of attributes that were in place when the fact occurred are tied directly to the fact record. (For more information about mini-dimensions, see Slowly Changing Dimensions Are Not Always as Easy as 1, 2, 3 at Intelligent Enterprise and Design Tip # 53: Dimension Embellishments.)
 
Creating the Initial Mini-Dimension

Once you identify the attributes you want to remove from the base dimension, the initial mini-dimension build is easily done using the brute force method in the relational database. Simply create a new table with a surrogate key column, and populate the table using a SELECT DISTINCT of the columns from the base dimension along with an IDENTITY field or SEQUENCE to create the surrogate key.  For example, if you want to pull a set of demographic attributes out of the customer dimension, the following SQL will do the trick:
 
INSERT INTO Dim_Demographics
SELECT DISTINCT col 1, col2, …
FROM Stage_Customer                         
 
This may sound inefficient, but today’s database engines are pretty fast at this kind of query.  Selecting an eight column mini-dimension with over 36,000 rows from a 26 column customer dimension with 1.1 million rows and no indexes took 15 seconds on a virtual machine running on my four year old laptop. Once you have the Dim_Demographics table in place, you may want to add its surrogate key back into the customer dimension as a Type 1 attribute to allow users to count customers based on their current mini-dimension values and report historical facts based on the current values.  In this case, Dim_Demographics acts as an outrigger table on Dim_Customer.
 
Again, the brute force method is easiest. You can join the Stage_Customer table which still contains the source attributes to Dim_Demographics on all the attributes that make up Dim_Demographics. This multi-join is obviously inefficient, but again, not as bad as it seems.  Joining the same million plus row customer table to the 36 thousand row demographics table on all eight columns took 1 minute, 49 seconds on the virtual machine.  Once all the dimension work is done, you will need to add the mini-dimension key into the fact row key lookup process.  The easy way to do this during the daily incremental load is to return both the Dim_Customer surrogate key and the Dim_Demographic surrogate key as part of the customer business key lookup process.
 
Next, you will need to add the mini-dimension key into the fact row key lookup process.  The easy way to do this during the daily incremental load is to return both the Dim_Customer surrogate key and the Dim_Demographic surrogate key as part of the customer business key lookup process.
 
Ongoing Mini-Dimension Maintenance

Ongoing management of the dimension is a two-step process: first you have to add new rows to the Dim_Demographics table for any new values or combinations of values that show up in the incoming Stage_Customer table.  This could be done using the same basic logic as the original build SQL, except by using the MERGE command, you only insert rows when the combination of values is actually new. In the SQL Server MERGE syntax, you can leave out the WHEN MATCHED clause and it will ignore the combinations that already exist in the Dim_Demographics dimension.
 
MERGE INTO Dim_Demographics AS target
USING Stage_Customer AS source
ON source.Payment_Type = target.Payment_Type
{additional join columns}
WHEN NOT MATCHED THEN
            INSERT (Payment_Type, {list of additional columns})
            VALUES (source.Payment_Type, {additional source columns});
 
This should process very quickly because you are only joining to the incremental data set in the customer staging table. It took 11 seconds against the full source customer dimension in my sample data.
 
Next, once all the attribute combinations are in place, you can add their surrogate keys to the incoming incremental rows.  The same brute force, multi-column join method used to do the initial lookup will work here. Again, it should be faster because the incremental set is much smaller.
 
By moving the Type 2 historical tracking into the fact table, you only connect a customer to their historical attribute values through the fact table. This may not capture a complete history of changes if customer attributes can change without an associated fact event.  You may want to create a separate table to track these changes over time; this is essentially a factless fact table that would contain the customer, Dim_Demographics mini-dimension, change event date, and change expiration date keys. You can apply the same techniques we described in Design Tip #107 on using the SQL MERGE statement for slowly changing dimension processing to manage this table

 

What is a degenerate dimension – Fact Relationship type

Information Management Online, December 9, 2003

Joe Oates, Chuck Kelley, Clay Rehm

What is a degenerate dimension and how is it used?
Chuck Kelley’s Answer: A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example, if you have a dimension that only has Order Number and Order Line Number, you would have a 1:1 relationship with the Fact table. Do you want to have two tables with a billion rows or one table with a billion rows. Therefore, this would be a degenerate dimensionand Order Number and Order Line Number would be stored in the Fact table.

Advertisement

http://ad.doubleclick.net/adi/information-management.com/;pg=data_modeling;sz=468×60;pos=3;tile=5;ord=14057271?

Joe Oates’ Answer: Many data warehouse transaction fact tables have a control number, such as an invoice number, purchase order number or policy number. Over the course of years reflected in the data warehouse, there could be hundreds of thousands or even millions of each particular control number such as invoice number. It isn’t practical to have a drop-down list of each of these numbers to find data typically found in the legacy system invoice header such as invoice number, invoice date, invoice type. Consequently, these data items are put into the line-item fact table to slice and dice as the users wish. These control numbers are called degenerate dimensions.

If you were to have a dimension table for invoice, you would have nearly as many entries in the dimension table as you have in the line-item fact table. The line-item fact table is generally the largest table by far in the data warehouse. So joining the multimillion or multibillion row fact table to a multimillion or multibillion row dimension table will cause your data warehouse to take up much more disk storage that it should as well as significantly degrading performance

Clay Rehm’s Answer: A degenerate dimension is a dimension that is stored in the fact table rather than the dimension table. It eliminates the need to join to a Dimension table.