Data-Driven Analytics in SAS Viya – Decision Tree Icicle Plot and Variable Importance
- Article History
- RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Printer Friendly Page
- Report Inappropriate Content
Getting Started
In today’s post, we'll finish looking at the results of a decision tree in SAS Viya by examining icicle plots and variable importance. In my previous post of this series, we examined autotuning and the tree-map of a decision tree built in SAS Visual Statistics. I also discussed how to create predicted values and invoke the interactive mode. Moving forward we will continue to focus on the part of the AI and Analytics lifecycle that involves developing and interpreting robust models. Specifically, we will finish examining the remaining pieces of output from the decision tree that was built using variable annuity (insurance product) data.
Insurance Data
Remember, the business challenge is trying to identify customers who are likely to respond to a variable annuity marketing campaign and make a purchase. The develop_final table that was introduced previously contains just over 32,000 banking customers. The input variables reflect both demographic information as well as product usage captured over a three-month period. The target variable is named Ins which is a binary variable.
Select any image to see a larger version.
Mobile users: To view the images, select the "Full" version at the bottom of the page.
Decision Tree Results
You may remember from my last post we ended by covering the following results in some detail:
- The summary bar across the top of the page.
- The Decision Tree, which is an interactive, navigational tree-map.
Today we will discuss the remaining output:
- The Icicle Plot revealing a detailed hierarchical breakdown of the tree data.
- The Variable Importance Plot displaying the importance of each variable.
- The Leaf Statistics revealing counts and percentages for each node.
Icicle Plot
Introduced in the 1980's icicle plots evolved from treemap visualizations to better represent hierarchical structures like decision trees. Unlike traditional tree diagrams, which can spread outward and quickly become unwieldy, icicle plots provide a compact, stacked visualization. Using color and dimension, icicle plots allow users to trace decision paths, identify feature importance, and compare different trees efficiently.
The icicle plot for our insurance data captures the essence of what happens within the decision tree. Starting at the top with the longest blue bar we have the root node. By highlighting the top of the icicle plot, we can see in the pop-up that it represents Node ID 0 with all 32,264 observations and the majority (65%) are non-purchasers. Just like the original decision tree, this bar is labeled as Saving Balance which determines our first split in the tree. Moving down to the next line, we see two bars representing the result of that first split. If we highlight the blue BIN_DDABal bar, it reveals Node ID 1 with 24,770 observations and the majority (73%) are also non-purchasers. The pop-up also reveals that the split for Saving Balance occurred at a value of approximately $1,550. The yellow Saving Balance bar reveals that we'll have a second split occurring on the saving balance. However, at this second tier of the icicle plot, we find Node ID 2 with the remaining 7,494 observations and the majority (61%) are purchasers. The blue bars make it easy to identify the majority non-event nodes (non-purchasers), while the yellow bars represent the majority event nodes (purchasers). We can continue to work our way down the icicle plot validating the same information we found on the Decision Tree. In fact, the two items are linked such that if you highlight a node in the decision tree, the same node is highlighted in the icicle plot. The icicle plot displays an interesting visualization that allows us to view the decision tree results from a slightly different angle.
Variable Importance
Variable importance and variable importance plots can be very useful when trying to identify which variables have the greatest impact on a model's predictions. When working with models where an input can be included multiple times like decision trees or ensemble models such as random forests and gradient boosting machines, it can be difficult to decipher which inputs are most useful overall. Variable importance helps identify the most influential features and can include visualizations that provide ranking.
Examining the variable importance plot for our insurance data helps validate our understanding of the decision tree. Ranked at the top of the plot is Saving Balance. Since we remember that the first split in our tree is also based on Saving Balance, it is no surprise that it ranks near the top. Keep in mind that this is not always true, sometimes the most important variables might not be the ones near the top of the tree. In SAS Visual Statistics, the variable importance is a RSS-based variable importance measure. In other words, the variable importance measures are based on the change in the residual sum of squares when a split is found at a node. For our insurance data, we see a couple of "really important" inputs, followed by a few others that are less important, followed by inputs that don't even make it into the model. If you examine the details table, you will discover that inputs not included in the model have an importance value of 0.
Leaf Statistics
The final pieces of output that we will examine for decision trees involves frequency counts and percentages of the leaf nodes. They basically provide insights into the distribution of the data in these terminal nodes.
Of course, frequency counts represent the number of samples that end up in a particular leaf. With classification trees, they show how many instances belong to each class within a leaf. With regression trees, they indicate the number of samples or observations that contribute to the predicted value. When examining the count plot for our insurance data, it is clear that Node Id 9 contains the largest number of customers. The longer blue bar in that node indicates that the majority of customers in that node are non-purchasers (the non-event). Count values are available by mousing over the individual bars or by opening the details table.
Frequency count statistics in decision trees help assess the reliability of predictions. Leaves that contain higher sample counts tend to provide more stable and generalizable outputs. These counts can all assist in detecting things like class imbalances. In a world were building models that are fair and unbiased is important, examining frequency counts can ensure that decisions are not dominated by underrepresented classes. Finally, frequency counts can indicate weak nodes or overfit trees which can be addressed with pruning.
Percentages (also known as proportions of samples in a leaf) represent the fraction of total samples that fall into a given leaf node. With classification trees, they indicate the proportion of each class (event or non-event) in a leaf. This makes class dominance easy to identify. With regression trees, percentages help assess how much of the data influences a particular prediction. The percentage plot for our insurance data reveals that 3 nodes (6, 18, and 30) are the "purest" nodes in the table, with Node ID 6 having the highest percentage (73%) of purchasers. Percentage values are available by mousing over the individual bars or by opening up the details table.
Percentage statistics in decision trees are useful for many of the same reasons as were identified for frequency counts. The help assess confidence and reliability of predictions, they aid in detecting class imbalances, and they assist in pruning and model evaluation.
Thanks for joining me in discussing the three remaining pieces of output that are available when building decision trees in SAS Visual Statistics. This completes our examination of building and interpreting a decision tree model in SAS Viya. In my next series of posts, we'll examine ensemble models that can be built with SAS Visual Statistics. We'll use the same annuity data to keep things consistent. If you are ready to learn more about decision trees, I can suggest the following two courses: SAS® Visual Statistics in SAS® Viya®: Interactive Model Building and Tree-Based Machine Learning Methods in SAS® Viya®. See you next time and never stop learning!
Find more articles from SAS Global Enablement and Learning here.