Archive for the ‘Data Mining’ Category

Correlation and causality – The bases of a good prediction


Categorias:Data Mining

SSAS: Explaining the State Transitions Viewer for Sequence Clustering

Pingback from: SSAS:

Original article (before edits) by Allan Mitchell, SQL Server MVP and principal consultant at Konesans Limited in the UK.  You can find Allan’s website .
NOTE: Alternatively, you can download a Word file of this whitepaper atTechNet Gallery . This is for several potential reasons: (1) To easily print it. (2) To see the original version. (3) To customize it for personal use. (4) To share an offline copy.


In SQL Server 2005 and SQL Server 2008, Microsoft has added some fantastic visualizations around data-mining algorithms. These visual aids allow us to see exactly what a particular algorithm is predicting or describing—making a difficult subject easier to understand.

In this article:


In this article, I look at the State Transitions viewer for the Microsoft Sequence Clustering algorithm and explain exactly what you are seeing and why. Although it is not necessary, this will be easier to understand if you have a statistics background or previous experience with data mining. You can find an introduction to the Microsoft Sequence Clustering Algorithm here: .

The image below is an example screen from the State Transitions viewer.

The screenshot is taken from the Sequence Clustering model that is deployed as part of the sample Analysis Services Project for SQL Server 2008. You can download the Analysis Services Project here:


The screenshot presents a very simple, clean view of a transition. When you look at your models, you will undoubtedly see a lot more nodes, lines, and numbers. However, take heart that everything we learn here is equally applicable in larger sequences. I just wanted a clean, uncomplicated view of a transition.

Back to top

The Viewer in General

This slider on the left side allows us to gradually filter in or out weaker or stronger links between items in a sequence. The links are determined by the transition probability. If the slider is at the top then, no matter how slight the probability, all the links will be displayed.  If the slider is at the bottom, only the strongest links will be displayed.

Across the top, there are many tabs and two dropdown combo-boxes. Once you click each of these tabs, you see another viewer for our Sequence Clustering model. The dropdown on the left allows us to choose our mining model, and the dropdown on the right allows us to choose the viewer. This is important; you might find that you use the viewer dropdown quite frequently.

Of the two available viewers, this article concentrates on the Microsoft Sequence Cluster Viewer. The Microsoft Sequence Cluster Viewer is the more graphical of the two, and I find it easier to understand. The second viewer, the Microsoft Generic Content Tree Viewer, is not graphical, but it contains more information, which is extremely useful when you want to dig deeper into the algorithm.

Back to top

The Viewer in Detail

Sequence Start and End

Let’s dive straight in and try to work out what the model viewer is telling us. In this article, I have chosen to look at Cluster 13 for this particular sequence clustering model. One of the first things I noticed and often get asked about is the “triangles with balls.” I know this is not a very technical name, but in the absence of anything else, I am using the term here.

Here is the ball on the flat edge of the triangle:

This describes that ML Mountain Tire is the first state in the sequence.

Here is the ball on the point of the triangle:

This describes that the ML Mountain Tire state is the last in the sequence and nothing comes after it.

Description of the numbers in the viewer

Looking at the original screenshot, we see that for this particular sequence we would start with the ML Mountain Tire state 61 percent of the time and the Mountain Tire Tube state 39 percent of the time. Where ML Mountain Tire is the first item in the sequence, we can expect it to be followed by the Mountain Tire Tube state in 32 percent of cases and in 68 percent of cases where it is the end of the sequence. When Mountain Tire Tube is the starting sequence, we can expect it to never be followed by anything—we know this because of the value 1.00 (see Fig 1) and the end-of-sequence ball on the point of a triangle. Each sequence has the unique color to mark its states (including start and end states), link, and probabilities.

Explanation of the source of the numbers

Now let’s query the metadata around the model using Data Mining eXpressions (DMX). Although a detailed look at DMX is out of scope for this article, I will explain my queries as we go along.

For this cluster, the first thing I want to know is the probabilities of each state being first in the sequence. Earlier we saw in the viewer that the probabilities are 61 and 39 percent respectively. The query below shows how we can retrieve the same information when using a DMX (Data Mining Expression) query.

Here is the query:

SELECT FLATTENED NODE_UNIQUE_NAME, (SELECT ATTRIBUTE_VALUE AS [Product 1], [Support] AS [Sequence Support], [Probability] AS [Sequence Probability] FROM NODE_DISTRIBUTION WHERE [Support] > 0 ) as t FROM [Sequence Clustering].CONTENT WHERE NODE_TYPE = 13 AND [PARENT_UNIQUE_NAME] = 13

Query 1

The outer part of Query 1 selects from the [Sequence Clustering] model and asks for a NODE_TYPE of 13. This NODE_TYPE is the type that holds the first states of possible sequences. Remember that a cluster can have multiple possible starting points for sequences/runs of states. Query 1 also asks for where PARENT_UNIQUE_NAME is 13. This means we want to look at Cluster 13.

What might be slightly confusing are the FLATTENED keyword and the nested table in Query 1. The following code reads from a nested table that is returned as part of the [Model].CONTENT request.

(SELECT ATTRIBUTE_VALUE AS [Product 1], [Support] AS [Sequence Support], [Probability] AS [Sequence Probability] FROM NODE_DISTRIBUTION WHERE [Support] > 0 ) as t

Query 2

Here I am asking for the Value of the first sequence state, the probability of the state, and the amount of cases that support that state where there is at least some support.

Here are the results:

NODE_UNIQUE_NAME t.Product 1 t.Sequence Support t.Sequence Probability
884722 ML Mountain Tire 168.4872 0.613636
884722 Mountain Tire Tube 106.0845 0.386364

Table 1

We can see that the numbers returned correlate nicely with what the viewer shows us.

Now let’s move a little further and see what the probabilities are of the next states in the sequence when ML Mountain Tire is the first state.  For this I am going to use a slight variation of Query 1:

SELECTFLATTENED NODE_UNIQUE_NAME, (SELECT ATTRIBUTE_VALUE AS [Product 1], [Support] AS [Sequence Support], [Probability] AS [Sequence Probability] FROM NODE_DISTRIBUTION) as t FROM [Sequence Clustering].CONTENT WHERE NODE_TYPE = 13 AND [PARENT_UNIQUE_NAME] = 13

Query 3

The only difference from Query 3 to Query 1 is that I have not restricted the nested table to only show items that have [Support] > 0.

Here are the results of Query 3:

Table 2

Query 3 shows us what the first states are for Cluster 13.  As we can see, there are only two possibilities.  We are concentrating on ML Mountain Tire here.  To find out what states follow ML Mountain Tire, we need to count down the rows in Table 2 until we reach ML Mountain Tire.  Counting should start at 0; Row 0 is always the Missing state.  If we count, we should get to 14.  We now need to go look at the transition states.

The following query will tell us where to find the node that holds the second states for sequence state 14.  We also restrict the PARENT_UNIQUE_NAME to the NODE_UNIQUE_NAME we retrieved by looking at first states of sequences in Query 1.

SELECT NODE_UNIQUE_NAME FROM [Sequence Clustering].CONTENT WHERE NODE_DESCRIPTION = ‘Transition row for sequence state 14’ AND [PARENT_UNIQUE_NAME] = ‘884722’

Query 4

We now need to take the result of Query 4, 884737, and use it to get the second state items.

SELECTFLATTENED (SELECT ATTRIBUTE_VALUE AS Product2, [Support] AS [P2 Support], [Probability] AS [P2 Probability] FROM NODE_DISTRIBUTION) AS t FROM [Sequence Clustering].CONTENT WHERE NODE_UNIQUE_NAME = ‘884737’

Query 5

As you can see, Query 5 is pretty much the same query as before, but NODE_UNIQUE_NAME has changed to the node returned in the previous query. Here are the results of Query 5:

Table 3

The probability column shows us that there is a 31 percent probability that Mountain Tire Tube will follow ML Mountain Tire and a 68 percent chance that nothing will follow.  This correlates nicely with what we see in the viewer.

Perform the same query for the Mountain Tire Tube state.  If we go back to the results in Table 2, we see that Mountain Tire Tube is the 17th transition State Row.  To find the second states, this changes our query to the following:

SELECT NODE_UNIQUE_NAME FROM [Sequence Clustering].CONTENT WHERE NODE_DESCRIPTION = ‘Transition row for sequence state 17’ AND [PARENT_UNIQUE_NAME] = ‘884722’

Query 6

As before, we take the results of Query 6, 884740, and use it to find the second states:

SELECTFLATTENED (SELECT ATTRIBUTE_VALUE AS Product2, [Support] AS [P2 Support], [Probability] AS [P2 Probability] FROM NODE_DISTRIBUTION) AS t FROM [Sequence Clustering].CONTENT WHERE NODE_UNIQUE_NAME = ‘884740’

Query 7

The results below correlate again very nicely with what we see in the viewer.  Here 0.99999999 is identical to probability 1 (100%).  It is what statisticians call “almost surely.”  We will never be able to run this sequence enough times to say that is is always true (surely), but we are very sure it will happen.

Table 4

Whenever Mountain Tire Tube is the State in a sequence for this cluster, we always end on that state.

Back to top


Microsoft is making great strides into providing a means for more and more people to use Data Mining without feeling that it is too complex.  The viewers provided with SQL Server 2005 and 2008 are excellent ways to visualize what the model has learned about your data.  This article has shown how to interpret one of those viewers.  Thank you for reading, and please send us your feedback to let us know how useful this was or if you have any other thoughts.

About the original author (before wiki edits): Allan Mitchell is a SQL Server MVP based in the UK.  He specializes in the Microsoft SQL Server BI stack with a passion for Data Mining and SQL Server Integration Services.You can find Allan’s website at .

Data Mining Startup Websites


Stanford Data Mining Video Lessons – Statistical Aspects of Data Mining


Creating your first Data Mining Structure and Model


Data mining is a great way to help your company make decisions and predict future values. The Data Mining Algorithms built into SQL Server Analysis Services gives you this power.
The Adventure works data base comes with views that are already set up to perform data mining. Let’s take a look at one of these views. Open the Target Mal view by right clicking on it in SSMS and select Design.
The design view shows you the query used to create the view. The dim Customer Table and another view are used to create the Target Mail view. From Dim Customer we get a lot of the customer information and the other view has the bike purchase information. This gives you a simple view of who has bought a bike and the attributes of those customers.
In the query you can see a Case Statement that just gives a one for bike buyers and a zero for non bike buyers.

Now that you understand the data you can go into BIDS and perform some data mining.
In BIDS start an Analysis Services project, and then create a Data Source and a Data Source View containing the Target Mail View.
The first step is to create a mining structure. You can think of the mining structure as the blue print for the data mining models that are going to be created on the mining structures.
1. Right Click on the Mining Structures folder in the Solution Explorer and select New Mining Structure. Next->

2. Click the radio button next to From Existing relational database or data warehouse. Next->

3. Select Microsoft Decision Trees. Next->

4 Select the Data source view that contains the Target Mail View. Next ->
5. Place a check next to the vTargetMail view under case. Next –>
Now you are on the screen asking what data you want to use to train your mining model. Here you will place a check next to the columns you want to use in determining which members will most likely be bike buyers based on your existing customers. You also need to select a Key and a Predict column. The Key will be the Customer Key and should already be checked. The Predict columns will be the BikeBuyer because that is what we are trying to predict.
6. Place a check next to Bike Buyer under Predict.
7 .Place a check under input next to the following columns: Age, Commute Distance, Gender, House OwnerFlag, Mairtal Status, NumberCarsOwned, NumberChildrenHome, Region, TotalChildren, and Yearly Income. Next->

8. Click the detect button. This is setting the data types for the mining model.  A description of each of these types can be found here.

9. Leave the Percentage for testing set to 30%. Next->

10. Name the Structure and Model and select Allow Drill Through. Finish->

You have now created the Mining Structure and a Mining model using the Decision Tree Algorithm. Now it is time to process and deploy the model. Click on the Mining Models Tab and view the Model. Notice the Bike Buyer is set to Predict Only and the Customer Key is the Key. The rest of the columns are set to input.

11. Click on the Mining Model Viewer, you will receive a popup asking to process and deploy, click yes. You might receive other popup warning about the number of rows, click yes.

12. Click Run in the process screen.

13. Click Close on the process screen once the process is complete.


14. Click Close the process screen too.
Now you will be in the Mining Model Viewer and be able to see the Decision Tree model Notice how age is in the first level of the tree. Now browse the model and you can see what traits of the customers are most likely to buy a bike. Change the background to 1 to see the most likely bike buyers. Slide the level over to the right to see all levels.
Click on a level and look at the properties on the right and see the probability of them being a bike buyer.

Congratulations you have created you first Data Mining Structure and Model.