CHAID decision tree - Explained with algorithm

Data Science

Date : 03/18/2024

Data Science

Date : 03/18/2024

CHAID decision tree - Explained with algorithm

Learn more about leveraging CHAID decision trees to explain black-box models, enhance interpretability, and build trust in machine learning predictions.

Vishal Sachan Singh

AUTHOR - FOLLOW
Vishal Sachan Singh
Associate Manager, Data Science

Leveraging CHAID Decision Trees in Blackbox Models for Enhanced Explanations

Like the blog

Table of contents

CHAID decision tree - Explained with algorithm

Quick Intro Decision Trees
CHAID Trees
Why CHAID Trees?
Implementation
Pros
Cons
Conclusion

Like the blog

Table of contents

CHAID decision tree - Explained with algorithm

Quick Intro Decision Trees
CHAID Trees
Why CHAID Trees?
Implementation
Pros
Cons
Conclusion

Leveraging CHAID Decision Trees in Blackbox Models for Enhanced Explanations

Quick Intro Decision Trees

Decision trees are popular algorithms used for both classification and regression tasks.

The idea of the decision tree is to split the dataset into smaller subsets based on the feature until reaching a node where all data points share a single label. It consists of the following types of Nodes:

Root Node- This represents the starting point with the entire dataset; the root node acts as the parent node to all subsequent nodes within the tree.
Terminal Nodes- These represent the bottommost nodes, also known as leaf nodes.
Splitting Criterion- This is the algorithm/logic used to select the best feature to split the data at each node. E.g., Gini, Entropy.
Decision Nodes- These nodes are split into multiple sub-nodes based on some splitting criterion, also called internal nodes.

decision trees

CHAID Trees

CHAID (Chi-squared Automatic Interaction Detection) is a decision tree technique that identifies relationships between variables. It's useful for segmentation, prediction, and enhancing explainability in AI models.

CHAID trees were proposed by Gordon Kass in around 1980. As the name suggests, the basic criterion for recursively splitting the independent features is based on Pearson Chi-square statistics.

CHAID Trees work like decision trees, except the splitting criterion is chi-square instead of entropy or information gain. Additionally, it differs in the way it performs a multiway split for splitting the features into multiple decision and leaf nodes.

Algorithm-

The outcome variable can be continuous or categorical, but the predictor variables are categorical only, creating non-binary trees. That is, they can have more than two splits. A workaround for continuous predictor variables is to bin them before feeding the features into the algorithm.
The CHAID algorithm involves testing the dependence of two variables at each step. If the dependent variable is categorical, a chi-square test is used to determine the best next split at each node. If the variable is continuous, an F-test is used to determine the next best split.
CHAID works recursively by dividing the dataset into subsets according to the categorical predictor variable that exhibits the most significant correlation with the response variable. CHAID computes the chi-squared test of independence between the response variable and each categorical predictor at every iteration. The variable showing the strongest association is selected as the split variable, and the data is partitioned into subsets based on its categories. This iterative process continues until the predefined stopping criteria are satisfied.

Why CHAID Trees?

The CHAID algorithm finds extensive application in fields like market research and social sciences, where the analysis of interactions is crucial. Its ability to handle categorical predictors and identify significant interactions makes it a valuable tool in these domains.

CHAID uses multiway splits by default (multiway splits mean that the current node is split into more than two nodes). On the other hand, the different decision tree does binary splits (each node is split into two child nodes) by default.
Trees built using CHAID algorithms generally prevent overfitting. They ensure that a node is split only if it meets specific significance criteria, which keeps the model strong and reliable.

Implementation

Let's read the Titanic dataset from Seaborn, and to speed up the training, we'll keep only categorical features to build the tree quickly.

titanic dataset

We will be using the plot_tree method from sklearn to analyze how our decision tree is built.

using the plot tree method

Furthermore, we notice that the tree's depth is 5+, and features are repeated across splits in sub-branches to further divide the data based on different values of a categorical feature (such as the "embark_town" variable in our case). For instance, the "embark_town" variable is split multiple times, both at level 1 and level 2 of the tree. Since it's a binary tree, the depth (complexity) increases as the data's cardinality increases

sub branches

In certain instances, this may also lead our decision tree to overfit the data, making it challenging to understand which features are more prominent and significant according to business and marketing needs.

To better visualize the importance of features and overcome the above problems, the CHAID tree comes to our rescue. Let's look at how we can construct the tree,

We have used “to_graphviz()“ method to convert the tree in diagraph format.

CHAID Trees

Text Representation of CHAID tree

CHAID Trees

Graphical Representation

graphical representation

Figure :-Diagraph with multiway splits on the root node

It's evident from the CHAID tree's construction and the differences with the decision tree built from the same data that the CHAID tree tends to be simpler with multiway splits, making it less susceptible to overfitting. This characteristic becomes particularly apparent when a node has more than two child nodes. Additionally, when a split is insignificant (with p-values higher than the alpha value), the tree stops splitting at that node.
Furthermore, the features positioned at the top of the CHAID tree hold higher importance and can significantly help the decision making process.

Pros

CHAID tree produces wider trees with multiple branches, in contrast, other decision trees are binary, and they can show only a couple of outcomes.
CHAID produces easily interpretable results, offers insights into the decision-making process of the algorithms, helps understand the predicting power of the features, and thus can be used in market segmentation, brand tracking etc.

Cons

It may not always yield satisfactory results; other algorithms may outperform.

Conclusion

The CHAID tree offers a straightforward tool for analyzing features and making significant predictions on predictors, enhancing our understanding of the tree construction process.

However, careful consideration should be taken when selecting the algorithm for a specific use case based on the nuances of the dataset, as it may occasionally underperform in specific scenarios. This emphasizes the importance of thoughtful evaluation and comparison with alternative methods.

Vishal Sachan Singh

AUTHOR - FOLLOW
Vishal Sachan Singh
Associate Manager, Data Science

Topic Tags

CHAID Trees

Decision Trees

Data Science

Black Box Model

Next Topic

What is time series analysis? - A complete guide

Continue reading

Next Topic

What is time series analysis? - A complete guide

Continue reading

our categories

Telecom, Media, Technology

Travel & Hospitality

Healthcare & Life Sciences

Banking & Financial Services

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.

recommended articles

Data driven decision making: Turning Data Science Insights into Measurable Business Impact

Blog

Data driven decision making: Turning Data Science Insights into Measurable Business Impact

Data Engineering Automation : An In-Depth Guide

Blog

Data Engineering Automation : An In-Depth Guide

Quantifying Customer Spending Uplift from Loyalty Programs via Causal Analysis

Blog

Quantifying Customer Spending Uplift from Loyalty Programs via Causal Analysis

×

Thank you for a like!

Stay informed and up-to-date with the most recent trends in data science and AI.

Share this article

×

Ready to talk?

Join forces with our data science and AI leaders to navigate your toughest challenges.