Arquivo

Archive for the ‘Dimensions’ Category

Reference Materialized Dimension = Wrong Results

Link:http://sqlblog.com/blogs/alberto_ferrari/archive/2009/02/25/ssas-reference-materialized-dimension-might-produce-incorrect-results.aspx

Reference dimensions let you create a relationship between a measure group and a dimension using an intermediate dimension to act as a bridge between them. In the Adventure Works demo, for example, they are used to link reseller sales and orders to the geography dimension.

mg

Now, something that is less evident is how this specific kind of relationship is handled by SSAS. Let us look further in the definition of the relationship:

mat

If the relationship is not materialized (something SSAS will dislike) then the relationship will be resolved during query time, with slow performances. Nevertheless, everything will work fine.

If, on the other hand, we follow SSAS suggestion and decide to materialize it, we will need to understand what is going on under the cover, to avoid a very subtle misfeature of SSAS.

When the relationship is materialized, SSAS will resolve it during processing time issuing a more complex query, during the partition processing (not dimension, I have written the right word: partition!) issuing a JOIN between the fact table and the intermediate dimension, to gather the value of the key for the final dimension from the intermediate one. This will lead to poor processing performances, but it is something that we already know. Anyway, what is important to point out is that the relationship is resolved during partition, not dimension processing.

The real problem comes when we need to update the relationship. We might think (at least this is what I thought) that issuing a ProcessUpdate on the intermediate dimension would refresh the relationship and perform all the necessary updates. This is not true. Since the materialized relationship is resolved during partition processing, any dimension processing will not refresh it. You can try to ProcessUpdate the intermediate or the final dimension: no way, the relationship still uses the data that was there when the partition was processed first time.If we want to refresh this relationship, the only way is to reprocess the partition, at that point we will have the new data correctly available.

This might be a very big problem if, for very large cubes, we decide to reprocess only the newest partitions, reprocessing only the needed aggregations for the older ones. If older partition will not be processed they will use the older version of the relationship. We will end up with an inconsistent situation where some partitions use a relationship and other use a different one, depending on when they have been last processed.

Clearly, if we rely on non-materialized relationships, everything will work fine since the relationship will rely only on dimensional data and does not need to be resolved during partition processing. In fact, issuing a ProcessUpdate on the intermediate dimension will imediately refresh the non materialized relationship, as it is supposed to do.

I think BOL should be much clearer on this, materializing a relationship is not just a matter of query speed, it has subtle consequences that, at their extreme, might lead to incorrect results from the cube.

Categorias:Reference Dimension

Parent Child – Create Dimension

Link:http://dwbi1.wordpress.com/2010/07/23/parent-child-dimension/#comment-1898

In this post I’d like to point out 3 things:

  1. The column contains ParentID, not the ChildID
  2. MembersWithData property
  3. What happen to node without parent

1. The column contains the ParentID, not the ChildID

When we crate a Parent Child Dimension, we have ID and name column, then we have a third column that points to the ID column. This third column should contains the parent ID, not the child ID:

On the table we specify the parent node rather than the child node because the hierachy is in pyramid format, where by a node can only have 1 parent, but a node can have several children. Hence we can’t set the third column to be the ChildID, because a node can have several children.

Note that in the table:

a)    The top node has no parent, and

b)    The ChildID (NodeID) is the primary key of the table. Meaning that we can’t have duplicate node numbers, because it means that a node can have more than one parent.

When we create the SSAS dimension, we make the ChildID (NodeID) as the key of the dimension. The “self referencing column”* contains ParentID, not ChildID. Like this:

*a “self referencing column” is a column that points to another column on the same own table.

Let’s create the table first:

create table PC
( ID nvarchar(10) null,
Name nvarchar(10) null,
ParentID nvarchar(10) null
)

insert into PC values (1, ‘Node1′, NULL)
insert into PC values (2, ‘Node2′, 1)
insert into PC values (3, ‘Node3′, 2)
insert into PC values (4, ‘Node3′, 2)
select * from PC

Then create the DSV table: (click the ParentID column and drop it onto the ID column)

On the DSV, the relationship is: Source = ParentID, Destination = ID:

Then create the dim (from existing table):

No need to define the “Usage” property of ParentID attribute as “parent”. BIDS automatically does this for us.

Process the dim, then browse:

2. MembersWithData property

In the above picture, the nodes are repeating because of a property called “MembersWithData”. This property is currently set to NonLeafDataVisible:

PS. This is the property of the ParentID column, not the property of the dimension.

If we set it to NonLeafDataHidden, process the dim, browse the dim, and reconnect, the repeating nodes disappear:

3. What happen to node without parent (orphan nodes)

For this let’s add a row with ParentID = null to the table:

insert into PC values (5, ‘Node5′, NULL)
select * from PC

Process the dimension, browse:

We see that a node without parent will become a top level node.

4. Unknown

I know I said 3 points, but oh well, 1 more:

In the above picture the Unknown member is there because the UnknownMember property is set to Visible. Set to Hidden and the Unknown member disappears:

(This is the property of the dimension, not the property of the attribute)

Note that making the unknown member hidden is not recommended because fact rows mapped to the unknown member of this dim will “disappear from the cube”, causing the total of the measure to be incorrect (less than what it should be).

Yes, the usual counter argument is: “but we have set the dim key to 0 on the ETL, so there would be no rows on the fact table with dim key = null”.

Well in Data Warehousing there’s an old saying: “If it is not enforced, it will happen” :) Meaning: things like RI, dup, and nullability, if they are not physically enforced they will happen. May be not on the first month of going live, but eventually they will happen.

PS. RI is enforced using FK, duplication is prevented using PK and nullability is enforced using “Allow Nulls” column property.

“But the FKs would slow down the ETL”. I know, I know, … classic.

Categorias:Parent-Child

Different granularity in a single dimension

Handling different granularity (for example actual and plan values) can get a
little bit complicated. Of course there are standard methods, like splitting up
the less granular data in order to meet the finer granularity. Or you could use
a parent-child structure as this allows you to store data at different levels in
the tree-like structure. Or you could supply ‘unknown’ elements to map the less
granular information.

For this post I want to show a different approach. Usually for each dimension
we are linking all fact tables that refer to this dimension to the same key in
the dimension table (the dimension’s primary key). However, the data source view
also allows us to link facts to different key columns in the same dimension
table.

In my simple scenario I have time dimension (called DimDate) and two fact
tables: Order and Order Plan. The orders are on a daily basis while the order
plan is on a monthly basis. We want to link both fact tables to the same time
dimension as shown below:

image_thumb11

The link from the fact order table to the time dimension DimDate (marked as 1
in the sketch above) is the usual link from the fact table to the primary key of
the DimDate table. The time dimension is on daily granularity and so are the
order facts. But for the plan value fact table FactOrderPlan, the link to the
time dimension is realized by using two key columns: Year (as the year number,
eg. 2008) and month (as the month number, eg. 11 for November), so the link in
the data source view looks like this:

image1_thumb

For our cube it is important to specify the right granularity attribute.
While the order table is linked to the Day (granularity attribute), we link the
order plan fact table to the month attribute and define the proper key mappings
for that.

image_thumb1

Now, the dimension usage looks like this:

image_thumb3

We also defined proper attribute relationship for the time dimension
accordingly to build up a year->quarter->month->day hierarchy.

So let’s take the first look at the cube created by this method:

image_thumb5

This first look is as expected. As long as we are at a common granularity
level shared by both fact tables, we can see the values correctly. Also, the
aggregation of both fact sources works fine (although they are at different
granularity).

Now, let’s drill down to the day level which is not present in our planning
data:

image_thumb7

The behavior here is exactly the same as for other unrelated dimensions! The
value of the nearest matching hierarchy is taken for the levels below. Sometimes
this behavior of the cube confuses the users, but we can still change this
behavior by changing the parameter IgnoreUnrelatedDimensions:

image_thumb12

Now, the planning values below the month granularity level have
disappeared:

image_thumb10

However, if you want to create a calculated measure that is also based on the
planning values, you should be aware of the fact, that the values are simply not
existing any more at the day level. For example, let’s define a calculated
measure PlanFulFillment using the following expression:

[Measures].[Amount] / [Measures].[Amount
Plan]

At the day level, the measure Amount Plan does not exist, so this results in
computation errors:

image_thumb14

You could still use the non-empty behavior for the calculate measure (set to
“Amount Plan”) in order to have these computations disappear. However, if you
want to refer to the monthly value, you can simple use the ValidMeasure MDX
function that is always helpful in conjunction with
IgnoreUnrelatedDimensions=false. So after defining our calculated measure as
[Measures].[Amount] / ValidMeasure([Measures].[Amount Plan])
the result looks like this (at day level, the monthly values for the planning
data is taken)

image_thumb16

So, surprisingly enough (at least for me), everything behaves exactly like we
wanted it to do and this makes the approach to an alternative in some scenarios.
Again, please check your attribute relationship carefully and also spend some
time on testing the result as the approach can get dangerous for more
complicated attributes structures.

I also checked this design with more attributes and parallel hierarchies in
the time dimension (for example calendar week) and more fact tables (for example
production plan) and the aggregation was still correct. Having
IgnoredUnrelatedDimensions set to false is helpful here to clearly see, which
fact is selected at the right granularity level.

Following is an example with three fact tables (one at day level, one at
month level and one at week level):

image_thumb2

UDM, New Dimension Types and Analysis Services 2005

One of the key goals of Analysis Services 2005 is to “bring together the best of both relational and OLAP reporting into a single Unified Dimensional Model”.  This is a very ambitious goal and most would agree that it would be difficult to fully achieve.  However, Analysis Services 2005 certainly does take huge strides in this direction.

 

An example is that the UDM schema assumes many characteristics that were traditionally reserved solely for the relational world.

 

The most powerful of these has to be the attribute-based model where there is roughly a one-to-one relationship between columns in the relational source and dimension attributes.  As each attribute typically has its own little hierarchy, straight off the bat we can get 20 or 30 hierarchies per dimension.  This is very different to the behaviour in Analysis Services 2000 where multiple hierarchies per dimension are no more than a naming convention.  This really does support much richer reporting requirements.

 

Furthermore, UDM schemas are far more flexible.  In Analysis Services 2000, we were constrained to very rigid hierarchies defining the metadata of our cubes.  Whereas in 2005, we can model complex relationships between our database entities such as many-to-many relationships.  Detail-level reporting was also primarily in the realm of relational reporting.  It too is addressed by the UDM as discussed below.

 

The new dimension types in Analysis Services 2005 contribute greatly to these objectives.  Let’s look at each one.

 

 

Fact Dimensions

 Fact Dimensions are otherwise known as degenerate dimensions (as coined by Ralph Kimball).  Let’s say our fact table contains information that would be useful to query by or even aggregate by; for example, courier tracking number.  In Analysis Services 2000 we would have to create a TrackingNumber dimension table and populate it as part of the ETL.  But this causes unnecessary duplication of data, and adds extra complexity to the design.  Analysis Services 2005 overcomes this limitation because the fact table can refer to itself doubling up as a dimension table.  With fact dimensions, we can drill down to the very lowest level of detail (where there is a member for each fact table row) natively in the UDM.  We use the fact table primary key as the key attribute of the fact dimension.

 

In Analysis Services 2000, we required techniques such as drill through to achieve this.  But drill through is an added layer of complexity and often doesn’t work as well as direct relational reporting for detail-level reports.  Another option is to revert to relational reporting altogether, but this introduces inconsistency because we no longer have a single meta-data model for data access; we have to use TSQL for some queries, MDX for others, potentially different reporting tools, some databases are optimised for relational reporting while others are geared to OLAP reporting …

 

Note that Autoexists does apply when querying across a fact dimension with another dimension.  For those who have not heard of Autoexists, it is a concept that has been introduced with Analysis Services 2005.  The attribute-based model used by the UDM is vastly richer than the dimensional model used by Analysis Services 2000.  However, this means the number of attribute permutations can become overwhelming.  Hence Autoexists resolves the problem by not returning non-existent cells when performing cross-joins.  For example, the query “Select (Date.[Calendar Year].[2005] * Date.[Calendar Quarter].Members) …” will return only the quarters of 2005 cross-joined with 2005.  The query will not return nulls for the 2003 quarters cross-joined with 2005, 2004 quarters cross-joined with 2005, and so on.  Note that Autoexists only affects data pertaining to dimensions; it has nothing to do with fact table data.  Furthermore, Autoexists only applies to the cross-join of attribute hierarchies within a single dimension; with the exception of fact dimensions.

 

Another feature of Analysis Services 2005 that was not in 2000 is that we can have a MOLAP measure group related to a ROLAP dimension.  So we can browse MOLAP data as usual and then, once we decide to use a dimension that is stored as ROLAP, the query will switch to ROLAP mode.  The reason this is relevant to fact dimensions is that they are often good candidates for ROLAP storage.  We typically wouldn’t want to store millions of OrderDetail records in a MOLAP store because there would be no aggregation and it would not utilise the strengths of MOLAP.  When I tested the performance of ROLAP vs. MOLAP queries in this scenario, there was hardly any difference at all.  However, the cube-processing window can be reduced with ROLAP fact dimensions.

 

The following screenshot shows the dimension designer for a fact dimension.  The things to note are as follows.

  1. The dimension table is a fact table (FactInternetSales).
  2. The KeyColumns property of the key attribute (Internet Sales Order) is the primary key of the fact table (SalesOrderNumber and SalesOrderLineNumber).

 Fact dimension designer

 

 

The following screenshot shows how a fact dimension is set up in the Dimension Usage tab.  The dimension and measure group tables are one and the same.

 Fact dimension displayed in Dimension Usage tab

Role-Playing Dimensions

 A single dimension can be related to a set of facts through multiple roles.  For example, FactOrderDetails can have order date, a ship date, and a receipt date.  In this case our time dimension fulfils 3 roles with respect to dimensioning orders.

 

The way this is represented in a Data Source View is with a foreign key relationship for each role.  As shown by the following screenshot, there are 3 foreign-key relationships between FactInternetSales and DimTime.  One is selected and the columns forming the relationship are displayed.  In this case the source column is ShipDateKey, but the other foreign keys use OrderDateKey and DueDateKey respectively.

 Role-playing dimension

The only way to do this in Analysis Services 2000 was to implement a separate dimension for each role, which results in data replication and management overhead such as extra processing time etc.  In fact in Analysis Services 2005, each role actually appears as a separate dimension to the end user.  But in terms of storage, maintenance and design, there is significantly less overhead.

 

 

Reference Dimensions

 With a reference dimension, the dimension table is indirectly related to the fact table through another dimension table.  This is similar to virtual dimensions in Analysis Services 2000.

 

This is shown by the following example.  Customer is a regular dimension and Geography is a reference dimension.

 Reference dimension

One option would be to have a CustomerGeography dimension that covers both of the dimension tables in the snowflake schema.  CustomerGeography would contain both customer and geography attributes.

 

The case where we would want Geography as a separate reference dimension is if we want to reuse it against another dimension (or fact) such as Warehouse.  This would reduce storage and processing.

 

Note that we do not get Autoexists behaviour across the two dimensions and we cannot have user hierarchies that span both the reference and the referenced dimensions.

 

Reference dimensions in Analysis Services 2005 can be materialized whereas virtual dimensions in 2000 were calculated at runtime.

 

 

Many-to-Many Dimensions

 In Analysis Services 2005, we can have a single fact related to multiple dimension members.  This is represented by a many-to-many dimension.  Some use cases for many-to-many dimensions include the following.

 

  • A bank customer can have multiple accounts while an account can have joint owners.  When querying balance by customer, this has to be taken into consideration.  As a side note, this example would require that the balance measure be semi-additive.
  • A demographic can contain many customers and a customer can be in many demographics.

Both the above cases require a many-to-many mapping table between 2 dimension tables.  This is the “factless fact table”; a term coined by Ralph Kimball. The example in the AdventureWorks sample database is a bit simpler.  Goods are bought over the Internet and customers can select multiple reasons that prompted the purchase (e.g. advertising, promotion, word of mouth, etc.).  In this case, we only require a many-to-many mapping table between the sales fact table and the dimension table (instead of between 2 dim tables).

 

This is illustrated by the following screenshot.  FactInternetSalesReason is considered a fact table by Analysis Services 2005, so it implies a measure group with no measures.  In relational database terminology, it is nothing more than a regular many-to-many mapping table.

 Many-to-many dimension relationship

 

 

The following screenshot shows how the dimension is set up in the Dimension Usage tab and where the intermediate measure group is specified.

 Many-to-many dimension in Dimension Usage tab

 

 

An important thing to bear in mind is that many-to-many dimensions do not necessarily calculate totals using simple summations.  As shown by the following screenshot, the sum of each Sales Reason Type does not equal the Grand Total.  This is because facts can be included in the aggregation none, one or many times.

 Many-to-many dimension aggregation

 

 

Note that many-to-many dimensions are not pre-aggregated across dimensions (analogous to a materialised reference dimensions) and my understanding is that Microsoft currently has no plans to create a “materialised many-to-many dimension”.

 

Conclusion

 I must say that I am very impressed with Analysis Services 2005.

 

The IDE using the Visual Studio shell is really a first-class development environment.  The disconnected model where nodes in the Solution Explorer are in fact XML files (conforming to XMLA) is very well thought through.  Aside from enabling integration with SourceSafe and version management, this results in much better multi-tasking capabilities for BI developers and the IDE utilises good multi-threaded functionality.  How many times have you tried to do something in Analysis Manager 2000 only to discover that someone else is processing the cubes and you cannot do anything?  The disconnected model overcomes this.  When I demonstrated Analysis Services 2005 to one of our customers recently, I started deploying and processing the project and then casually moved onto other tasks in the IDE in parallel.  I found it amusing when a member of the audience exclaimed “wow, Analysis Services 2000 doesn’t even allow single users to multi-task, let alone multiple users!”.

 

I have really only touched the tip of the iceberg in terms of the work that has gone into Analysis Services 2005.  Other areas include massive MDX enhancements.  The “procedural” model with debugging/browsing capabilities and not worrying about pass, solve order and infinite recursion are great achievements.  I can see sophisticated MDX calculations that were pretty-much unworkable being straightforward with 2005.

 

We have already done much work with Analysis Services 2005 on the Ascend program and it appears that one of our big customers will participate in the Early Adopter Program in the Beta 3 timeframe, which is great news.

 

Slowly Changing Dimensions Alternatives

Kimball Design Tip #53: Dimension Embellishments
By Bob Becker
When developing dimensional models, we strive to create robust dimension tables decorated with a
rich set of descriptive attributes. The more relevant attributes we pack into dimensions, the greater
the users’ ability to evaluate their business in new and creative ways. This is especially true when
building a customer-centric dimension.

We encourage you to embed intellectual capital in dimensional models. Rather than applying
business rules to the data at the analytical layer (often using Excel), derivations and groupings
required by the business should be captured in the data so they’re consistent and easily shared
across analysts regardless of their tools. Of course, this necessitates understanding what the
business is doing with data above and beyond what’s captured in the operational source. However,
it’s through this understanding and inclusion of derived attributes (and metrics) that the data
warehouse adds value.

As we deliver a wide variety of analytic goodies in the customer dimension, we sometimes become
victims of our own success. Inevitably, the business wants to track changes for all these interesting
attributes. Assuming we have a customer dimension with millions of rows, we need to use minidimensions
to track customer attribute changes. Our old friend, the type 2 slowly changing dimension
technique, isn’t effective due to the large number of additional rows required to support all the
change.
The mini-dimension technique uses a separate dimension(s) for the attributes that frequently change.
We might build a mini-dimension for customer demographic attributes, such as own/rent home,
presence of children, and income level. This dimension would contain a row for every unique
combination of these attributes observed in the data. The static and less frequently changing
attributes are kept in our large base customer dimension. The fact table captures the relationship of
the base customer dimension and demographic mini-dimension as the fact rows are loaded.
It is not unusual for organizations dealing with consumer-level data to create a series of related minidimensions.
A financial services organization might have mini-dimensions for customer scores,
delinquency statuses, behavior segmentations, and credit bureau attributes. The appropriate minidimensions
along with the base customer dimension are tied together via their foreign key
relationship in the fact table rows. The mini-dimensions effectively track changes and also provide
smaller points of entry into the fact tables. They are particularly useful when analysis does not
require consumer-specific detail.

Users often want to analyze customers without analyzing metrics in a fact table, especially when
comparing customer counts based on specific attribute criteria. It’s often advantageous to include the
currently-assigned surrogate keys for the customer mini-dimensions in the base customer dimension
to facilitate this analysis without requiring joins to the fact table. A simple database view or
materialized view provides a complete picture of the current view of the customer dimension. In this
case, be careful not to attempt to track the mini-dimension surrogate keys as type 2 slowly changing
dimension attributes. This will put you right back at the beginning with a large customer dimension
growing out of control with too frequent type 2 changes.

Another dimension embellishment is to add aggregated performance metrics to the customer
dimension, such as total net purchases last year. While we normally consider performance metrics to
be best handled as facts in fact tables (and they should certainly be there!), we are populating them
© Copyright Kimball Group, 2004. All rights reserved. 2 of 2

in the dimension to support constraining and labeling, not for use in numeric calculations. Business
users will appreciate the inclusion of these metrics for analyses. Of course, populating these
attributes in our dimension table places additional demands on the data staging system. We must
ensure these aggregated attributes are accurate and consistent.

An alternative and/or complementary approach to storing the actual aggregated performance metrics
is grouping the aggregated values into range buckets or segments, such as identifying a credit card
customer as a balance revolver or transactor. This is likely to be of greater analytic value than the
actual aggregated values and has the added benefit of assuring a consistent segment definition
across the organization. This approach works particular well in combination with the mini-dimension
technique.

Design Tip #127 Creating and Managing Mini-Dimensions

setembro 2, 2010 — Carlos Alberto Lorenzi Lima

I wrote a Design Tip last year on creating junk dimensions; I’ve decided to extend that article into a series on implementing common ETL design patterns. Warren Thornthwaite – Kimball Group.

This Design Tip describes how to create and manage mini-dimensions.  Recall that a mini-dimension is a subset of attributes from a large dimension that tend to change rapidly, causing the dimension to grow excessively if changes were tracked using the Type 2 technique.  By extracting unique combinations of these attribute values into a separate dimension, and joining this new mini-dimension directly to the fact table, the combination of attributes that were in place when the fact occurred are tied directly to the fact record. (For more information about mini-dimensions, see Slowly Changing Dimensions Are Not Always as Easy as 1, 2, 3 at Intelligent Enterprise and Design Tip # 53: Dimension Embellishments.)
 
Creating the Initial Mini-Dimension

Once you identify the attributes you want to remove from the base dimension, the initial mini-dimension build is easily done using the brute force method in the relational database. Simply create a new table with a surrogate key column, and populate the table using a SELECT DISTINCT of the columns from the base dimension along with an IDENTITY field or SEQUENCE to create the surrogate key.  For example, if you want to pull a set of demographic attributes out of the customer dimension, the following SQL will do the trick:
 
INSERT INTO Dim_Demographics
SELECT DISTINCT col 1, col2, …
FROM Stage_Customer                         
 
This may sound inefficient, but today’s database engines are pretty fast at this kind of query.  Selecting an eight column mini-dimension with over 36,000 rows from a 26 column customer dimension with 1.1 million rows and no indexes took 15 seconds on a virtual machine running on my four year old laptop. Once you have the Dim_Demographics table in place, you may want to add its surrogate key back into the customer dimension as a Type 1 attribute to allow users to count customers based on their current mini-dimension values and report historical facts based on the current values.  In this case, Dim_Demographics acts as an outrigger table on Dim_Customer.
 
Again, the brute force method is easiest. You can join the Stage_Customer table which still contains the source attributes to Dim_Demographics on all the attributes that make up Dim_Demographics. This multi-join is obviously inefficient, but again, not as bad as it seems.  Joining the same million plus row customer table to the 36 thousand row demographics table on all eight columns took 1 minute, 49 seconds on the virtual machine.  Once all the dimension work is done, you will need to add the mini-dimension key into the fact row key lookup process.  The easy way to do this during the daily incremental load is to return both the Dim_Customer surrogate key and the Dim_Demographic surrogate key as part of the customer business key lookup process.
 
Next, you will need to add the mini-dimension key into the fact row key lookup process.  The easy way to do this during the daily incremental load is to return both the Dim_Customer surrogate key and the Dim_Demographic surrogate key as part of the customer business key lookup process.
 
Ongoing Mini-Dimension Maintenance

Ongoing management of the dimension is a two-step process: first you have to add new rows to the Dim_Demographics table for any new values or combinations of values that show up in the incoming Stage_Customer table.  This could be done using the same basic logic as the original build SQL, except by using the MERGE command, you only insert rows when the combination of values is actually new. In the SQL Server MERGE syntax, you can leave out the WHEN MATCHED clause and it will ignore the combinations that already exist in the Dim_Demographics dimension.
 
MERGE INTO Dim_Demographics AS target
USING Stage_Customer AS source
ON source.Payment_Type = target.Payment_Type
{additional join columns}
WHEN NOT MATCHED THEN
            INSERT (Payment_Type, {list of additional columns})
            VALUES (source.Payment_Type, {additional source columns});
 
This should process very quickly because you are only joining to the incremental data set in the customer staging table. It took 11 seconds against the full source customer dimension in my sample data.
 
Next, once all the attribute combinations are in place, you can add their surrogate keys to the incoming incremental rows.  The same brute force, multi-column join method used to do the initial lookup will work here. Again, it should be faster because the incremental set is much smaller.
 
By moving the Type 2 historical tracking into the fact table, you only connect a customer to their historical attribute values through the fact table. This may not capture a complete history of changes if customer attributes can change without an associated fact event.  You may want to create a separate table to track these changes over time; this is essentially a factless fact table that would contain the customer, Dim_Demographics mini-dimension, change event date, and change expiration date keys. You can apply the same techniques we described in Design Tip #107 on using the SQL MERGE statement for slowly changing dimension processing to manage this table

 

What is a degenerate dimension – Fact Relationship type

Information Management Online, December 9, 2003

Joe Oates, Chuck Kelley, Clay Rehm

What is a degenerate dimension and how is it used?
Chuck Kelley’s Answer: A degenerate dimension is data that is dimensional in nature but stored in a fact table. For example, if you have a dimension that only has Order Number and Order Line Number, you would have a 1:1 relationship with the Fact table. Do you want to have two tables with a billion rows or one table with a billion rows. Therefore, this would be a degenerate dimensionand Order Number and Order Line Number would be stored in the Fact table.

Advertisement

http://ad.doubleclick.net/adi/information-management.com/;pg=data_modeling;sz=468×60;pos=3;tile=5;ord=14057271?

Joe Oates’ Answer: Many data warehouse transaction fact tables have a control number, such as an invoice number, purchase order number or policy number. Over the course of years reflected in the data warehouse, there could be hundreds of thousands or even millions of each particular control number such as invoice number. It isn’t practical to have a drop-down list of each of these numbers to find data typically found in the legacy system invoice header such as invoice number, invoice date, invoice type. Consequently, these data items are put into the line-item fact table to slice and dice as the users wish. These control numbers are called degenerate dimensions.

If you were to have a dimension table for invoice, you would have nearly as many entries in the dimension table as you have in the line-item fact table. The line-item fact table is generally the largest table by far in the data warehouse. So joining the multimillion or multibillion row fact table to a multimillion or multibillion row dimension table will cause your data warehouse to take up much more disk storage that it should as well as significantly degrading performance

Clay Rehm’s Answer: A degenerate dimension is a dimension that is stored in the fact table rather than the dimension table. It eliminates the need to join to a Dimension table.

“Multiple parent” (parent-child relationship)

link:http://biwithjb.wordpress.com/2011/05/23/bom-bill-of-materials-scenarios-with-sql-server-analysis-services-2008-r2/#comment-22

While working on a Business Intelligence project where the data source is SAP R/3 and data is fetched directly from the SAP tables using SSIS 2008 with the Microsoft BizTalk SAP Connector; everything was smooth until one day I had to replicate a multiple parent parent-child relationship (aka bill-of-materials) that SAP creates in the Finance module.

These relationships occur when a child node can have more than one parent. Even though these types of relationships are not supported by default in SQL Server Analysis Services 2008 R2 I kept trying to implement them, until I’ve found a solution that works! In this document I explain how I did it.

In all bill-of-materials scenarios we have one fact table, one dimension table and one bridge or intermediate table that connects the fact table with the dimension table. This document is structured in 4 sections that will cover the 3 tables mentioned before plus one section for the required Analysis Services configuration.

Section 1 – Data Source Scenario

The data source tables are introduced and their structure explained in this section.  The fact table and its relationship with the dimension table are explained here.

This section will be quite familiar to those who have worked in a SAP environment, extracting data directly from SAP tables. If your scenario is not this specific SAP one, you can jump this section completely and go straight to section 2.

Section 2 – Building the Cost Element Dimension

A bill-of-materials dimension table has a slightly different structure than a normal dimension table in a non-BOM case. Section 2.3 covers this difference. The rest of the section details how to flatten the SAP hierarchy of nodes a leaves. If in your case you have a ready-to-go dimension table I still strongly suggest to read section 2.3.

Section 3 – Relate the Fact table with the Dimension table

This section describes the importance of the bridge or intermediate table that will connect the fact table with the dimension table.

Section 4 – Configuring the cube

Finally, section 4 details the configuration needed in Analysis Services 2008 R2 to make all parts work together in a bill-of-materials scenario.

Section 1 – Data source scenario

The specific solution covered in this document retrieves data for the Finance department to report on the expenditure of each department. In SAP terms that means the Cost Centers and Cost Elements, which are stored in the following tables’ structure:

Definition:

  • CSKS: Cost Center definition table
  • CSKA: Cost Element definition table

These two tables contain each individual Cost Center/Element identified by the OBJNR field in CSKS table and the KSTAR field in CSKA table.

Both Cost Centers and Cost Elements are structured in a hierarchy defined in the SETHEADERT, SETNODE and SETLEAF tables:

  • SETHEADERT: Description (long text) for the
    SETNODES
  • SETNODE: Declares for each node (SETNAME) which
    are its children (SUBSETNAME). This table is a normal recursive parent-child
    table. When a node doesn’t have any child it is considered a SETLEAF.
  • SETLEAF: Contains only the last members on the
    SETNODE hierarchy, i.e. the leaf nodes. Each leaf node spans a range of Cost
    Centers/Cost Elements (OBJNR/KSTAR); this range is declared in the VALFROM and
    VALTO fields.

The big problem is that each OBJNR/KSTAR can be in more than one SETLEAF and that one SETNODE can be in more than one SETNODE hierarchy.
This is known as a multi-parent parent-child relationship or bill-of-materials.

To complete the scenario and have an example let’s include our fact table.

COEP table is a fact table that stores for each expense in the company the amount and the cost center and cost element which that expense is related to.

For example, you buy a computer that goes to X (OBJNR) cost center and Y (KSTAR) cost element.

Relationships:

  • COEP is related with CSKS (Cost Center) by OBJNR field.
  • COEP is related with CSKA (Cost Element) by KSTAR field.
  • The rest of the fields are fact data.

Let’s have a simple example:

COEP Fact table:

  • OBJNR: KS100000120000
  • KSTAR: 0000404050

To make things simpler, in this document we will follow the solution for the Cost Element object. To implement the Cost Center object the same steps needs to be
followed.

The KSTAR 0000404050 appears in 2 leaf nodes (SETLEAF), PURSERV and MAINTREN. PURSERV node (SETNODE) appears under at least 2 different
parents as the following screenshot shows:

So, the expense that we have here needs to be reported under 2 different hierarchies but can’t be double billed if there is a parent node for the two hierarchy trees. This is a normal BOM scenario.

Section 2 – Building the Cost Element Dimension

2.1. Retrieve all SETNODE AND SETLEAF records

Using SSIS and the Microsoft BizTalk SAP Connector fetch all nodes and columns from the SAP tables described before and import them to a staging database in SQL Server.

The rows are filtered by the SETCLASS and SUBCLASS columns to retrieve only the ones that I’m interested. In my case SETCLASS = 0101 and SUBCLASS = QA10. These values could, and most probably would, be different in your company.

2.2. Detect the hierarchies and build the staging relationship tables

Hierarchy table for nodes (SETNODE)

Using a recursive stored procedure with a cursor inside, I build  a relationship table that contains all possible paths for the SETNODE hierarchies. The store procedure “walks” the SETNODE table and inserts rows in the Z_SETNODE_HIERARCHY table with the appropriate information.

Notice how the node with SETNAME equal to “109200” appears in 2 different hierarchies; one under “QA10_BJP” and the other under “QA10_CC”. It’s important to understand that this table has ALL possible hierarchy trees, that’s why node “109200” appears multiple times.

Complete LEAF information

A second table will store the complete SETLEAFs information crossed with the definition table CSKA. This table will be called Z_CSKA_SETLEAF.

To build this table I’m using another stored procedure, called p_Z_CSKA_SETLEAF.

Hierarchy table for LEAFs (SETLEAF)

A third table containing the relationships between SETLEAFs and SETNODEs is needed. In this table we have one entry for each KSTAR (SETLEAF) related to a SETNODE. 

If you are interested in the stored procedures that create the tables described above please click in the following link to a Word document which contains the code for the stored procedures in the order that should be executed: BOMStoredProcedures

Note: To execute the stored procedures I use the Execute SQL Taskin SSIS.

The resulting tables from this process are:

2.3. Create the Dimension table DimCostElement

With the previous 2 tables now we can load them into ONE single Dimension table, DimCostElement. First we will load the nodes from the Z_SETNODE_HIERARCHY table and later the leaves from theZ_CSKA_SETLEAF_SETNODE table.

By joining these two tables we have “flattened” the hierarchy. Now as leaves and nodes are in the same table, they can be considered the same object type.

Remember that this Dimension table contains ALL the possible paths in the hierarchy, so a child will appear multiple times in the table. Also remember that a child now can be not only a leaf node member but also normal node member. There is no more distinction between leaf and node members.

The structure of this table is very important, let’s have a deeper look at it:

DimCostElement table contains a CostElementID and a ParentID field. The CostElementID field acts as ID for the table, while the ParentID field just refers to the former to create a parent-child relationship.

The SETNAME field contains the SETNAME for the SETNODEs and the KSTAR for the LEAFs. In this way we have unified the SETNAME for the two different object types. Be aware here that one SETNAME can have multiple CostElementIDs due to that SETNAME will appear in more than one hierarchy path.

The concept of decoupling the name of an object from its database ID, allowing the object name to have more than one ID is what allows us to have multiple-parent parent-child relationships. This is the most important concept to learn in this exercise.

The last two fields are description fields.  

Section 3 – Relate the Fact table with the Dimension table

The fact table, COEP, has for each entry one KSTAR field that relates the described expense with one Cost Element in the dimension table; but because one Cost Element can be in more than one hierarchy the relationship between the fact table and the dimension table cannot be direct. We need an intermediate table.

The intermediate table pairs each KSTAR from the fact table with all possible CostElementIDs in the DimCostElement table. Let’s remember that in our case the KSTAR field from the fact table contains the same value (here is the relationship) as the SETNAME field from the DimCostElement table, but because of the multi-parent parent-child hierarchy nature of the Cost Elements,
one SETNAME can have multiple CostElementIDs. The intermediate table stores for each KSTAR from the fact table all the CostElementIDs from the DimCostElement table (many-to-many relationship).

You can create another stored procedure that fills the fact-less table with the result of a join query between the fact table and the dimension table.

In SQL Server 2008 this type of intermediate tables is known as Fact-less tables. You can search the Books Online for further information about them.

In this solution we cross one of the golden rules in data warehousing. KSTAR values in the Fact table should be replaced by their equivalent CostElementID from the DimCostElement table. But because one KSTAR value can have more than one CostElementID (remember that one KSTAR can be in more than one leaf member), this is a better solution that saves us of creating another intermediate table.

With this last table the database is ready for Analysis Services. In the next section I describe how to configure the cube to allow the BOM relationships and properly aggregate them.

Section 4 – Configuring the cube

4.1. Adding the fact and dimension tables

First step will be to add the 3 previous tables (fact, dimension and intermediate) to the data source view and create the appropriate relationships between tables (if you are as me that I don’t create any relationships in the database itself).

Create an empty cube and let’s add the tables manually to fully control how Analysis Services understand them.

The 3 tables will be added and configured in the cube as follows:

  • The Fact COEP table will be added to the cube a fact table with one measure group.
  • The intermediate Fact-less table will be added to the cube as both a dimension and a fact table.
  • The DimCostElement table will be added to the cube as a dimension table.

The Fact-less table should only be configured as a fact table, as it doesn’t contain any data, but SSAS needs 2 dimensions when configuring many-to-many relationships. The reason behind this is that SSAS approaches many-to-many relationships as 2 regular relationships put together.

You can add first the Fact-less table as a dimension and later as a fact table or on the opposite order, it really doesn’t matter. To not confuse the final user set the “Visible” attribute of the dimension to “False” and do the same for each measure of the fact. This will prevent the false dimension and fact-less table to appear in any client software that queries the cube.

4.2. Configuring the Fact-less dimension

Attributes:

The Fact-less dimension will have 3 attributes:

  • Factless-Cost Element ID: This attribute is the field CostElementID field from the Fact-less table
  • FactCOEPActual – KSTAR: This attribute is the KSTAR field from the Fact COEP table (main fact table)
  • DimCostElement – Cost Element ID: This attribute is the CostElementID field from the DimCostElement table

Note: I like to name the attributes in these special tables with the name of their source table preceding their name for clarity purposes.

These 3 attributes are needed to later build the many-to-many relationships between the tables, as it is explained in the next section of this document.

4.3. Configuring the Dimension Usage

The Dimension Usage tab allows us to define the relationship between the dimensions and the measure groups (fact) contained in a cube. It is here where we can tell Analysis Services that we have a multi-parent parent-child between our 3 tables.

The screenshot below shows the final scenario for the Cost Center (Factless OBJNR Cost Center) and Cost Elements (Factless KSTAR Cost Element). As this document only covers the Cost Element case, let’s focus our attention on the Actual Values and Factless KSTAR Cost Element columns/measure groups and Dim Cost Element (Cost Element) and Factless KSTAR Cost Element rows/dimensions.

In this document I will not describe the different types of dimension relationships. I encourage you to read this topic in the Microsoft MSDN here: http://msdn.microsoft.com/en-us/library/ms175669(v=SQL.105).aspx

First relationship

(Dimension) Factless KSTAR Cost Element with (Measure G)
Factless KSTAR Cost Element:

This is a Regular relationship as the dimension and the measure group are jointly directly by a key column.

Configure the relationship as in the following screenshot:

Second relationship

(Dimension) Dim Cost Element with (Measure G) Factless KSTAR
Cost Element:

This is a Regular relationship as the dimension and the measure group are jointly directly by a key column.

Configure the relationship as in the following screenshot:

Third relationship

(Dimension) Factless KSTAR Cost Element with (Measure G) Actual Values:

This is a Regular relationship as the dimension and the measure group are jointly directly by a key column.

Configure the relationship as in the following screenshot:

Fourth relationship

(Dimension) Dim Cost Element with (Measure G) Actual Values:

This is the most important relationship to define. The Cost Element Dimension is related in a many-to-many fashion to the fact table. Analysis Services needs of an intermediate table (fact) in order to build the relationship.

As our intermediate table has all possible paths for all the hierarchies and the dimension table has a parent-child relationship built-in, this combination will give us the bill-of-materials scenario.

Configure the relationship as in the following screenshot:

After all the configuration is done, build the cube (first build the dimensions and later the measure groups) and start enjoying a multi-parent parent-child relationship (or Bill-of-Materials) in SQL Server Analysis Services 2008.

Categorias:Parent-Child