Quick Intro Decision Trees
Decision trees are popular algorithms used for both classification and regression tasks.
The idea of the decision tree is to split the dataset into smaller subsets based on the feature until reaching a node where all data points share a single label. It consists of the following types of Nodes:
- Root Node- This represents the starting point with the entire dataset; the root node acts as the parent node to all subsequent nodes within the tree.
- Terminal Nodes- These represent the bottommost nodes, also known as leaf nodes.
- Splitting Criterion- This is the algorithm/logic used to select the best feature to split the data at each node. E.g., Gini, Entropy.
- Decision Nodes- These nodes are split into multiple sub-nodes based on some splitting criterion, also called internal nodes.
CHAID Trees
CHAID trees were proposed by Gordon Kass in around 1980. As the name suggests, the basic criterion for recursively splitting the independent features is based on Pearson Chi-square statistics.
CHAID Trees work like decision trees, except the splitting criterion is chi-square instead of entropy or information gain. Additionally, it differs in the way it performs a multiway split for splitting the features into multiple decision and leaf nodes.
Algorithm-
- The outcome variable can be continuous or categorical, but the predictor variables are categorical only, creating non-binary trees. That is, they can have more than two splits. A workaround for continuous predictor variables is to bin them before feeding the features into the algorithm.
- The CHAID algorithm involves testing the dependence of two variables at each step. If the dependent variable is categorical, a chi-square test is used to determine the best next split at each node. If the variable is continuous, an F-test is used to determine the next best split.
- CHAID works recursively by dividing the dataset into subsets according to the categorical predictor variable that exhibits the most significant correlation with the response variable. CHAID computes the chi-squared test of independence between the response variable and each categorical predictor at every iteration. The variable showing the strongest association is selected as the split variable, and the data is partitioned into subsets based on its categories. This iterative process continues until the predefined stopping criteria are satisfied.
Why CHAID Trees?
The CHAID algorithm finds extensive application in fields like market research and social sciences, where the analysis of interactions is crucial. Its ability to handle categorical predictors and identify significant interactions makes it a valuable tool in these domains.
- CHAID uses multiway splits by default (multiway splits mean that the current node is split into more than two nodes). On the other hand, the different decision tree does binary splits (each node is split into two child nodes) by default.
- Trees built using CHAID algorithms generally prevent overfitting. They ensure that a node is split only if it meets specific significance criteria, which keeps the model strong and reliable.
Implementation
- Let's read the Titanic dataset from Seaborn, and to speed up the training, we'll keep only categorical features to build the tree quickly.
- We will be using the plot_tree method from sklearn to analyze how our decision tree is built.
- Furthermore, we notice that the tree's depth is 5+, and features are repeated across splits in sub-branches to further divide the data based on different values of a categorical feature (such as the "embark_town" variable in our case). For instance, the "embark_town" variable is split multiple times, both at level 1 and level 2 of the tree. Since it's a binary tree, the depth (complexity) increases as the data's cardinality increases
In certain instances, this may also lead our decision tree to overfit the data, making it challenging to understand which features are more prominent and significant according to business and marketing needs.
To better visualize the importance of features and overcome the above problems, the CHAID tree comes to our rescue. Let's look at how we can construct the tree,
We have used “to_graphviz()“ method to convert the tree in diagraph format.
Text Representation of CHAID tree
Graphical Representation