How do CART models work

Apr 22, 2020·
Ruslan Klymentiev
Ruslan Klymentiev
· 16 min read
1# r import
2library(tidyverse)
3library(knitr)
4library(kableExtra)
5library(Rgraphviz)
6library(rpart)
7library(rattle)
8library(reticulate)
1# python import
2import pandas as pd
3from sklearn import tree
4import matplotlib.pyplot as plt

What is CART?

Classification and Regression Trees (or CART for short) is a type of supervised machine learning algorithm1 that is mostly used in classification problems2, but can be applied for regression problems3 as well. It can handle both categorical and numerical input variables.

In this tutorial, we are going to go step-by-step and build a tree for the classification problem. A way to solve the regression problem will be discussed at the end.

Imagine that we have medical data about patients and we have two drugs (Drug A and Drug B) that work differently for each patient based on their health metrics. We have such data for 15 patients. Now, the 16th patient comes in and we want to assign a drug to him based on this information. You can load the sample data set here.

1patients_df <- read_csv("DecisionTree_SampleData.csv")
2kable(patients_df, format = "markdown", caption = "<b>Patients Data</b>")
Patient IDGenderBMIBPDrug
p01Mnot normalnot normalDrug A
p02Mnot normalnormalDrug B
p03Mnot normalnot normalDrug A
p04Mnormalnot normalDrug A
p05Mnot normalnot normalDrug A
p06Fnot normalnormalDrug B
p07Fnormalnot normalDrug A
p08Fnot normalnormalDrug B
p09Mnot normalnot normalDrug A
p10Mnormalnot normalDrug A
p11Fnot normalnot normalDrug B
p12FnormalnormalDrug B
p13FnormalnormalDrug B
p14Mnormalnot normalDrug B
p15Fnot normalnormalDrug B
p16Mnot normalnot normal???

Table 1: Patients Data

Our decision tree is going to look something like this:

In case you are wondering why is it called “a tree”, you have to look at it upside down:

Imagine that all our data is located at the root node of the tree. We want to split the data by most significant features (that will create new decision nodes) until we reach the leaf nodes with the most homogeneous target possible (ideally just one class is present in each leaf node).

The question is how do we split the data at each node?

Information Gain and Entropy

The main metrics to use for dealing with classification problems when building a tree are Information Gain (IG) and Gini impurity. We are going to use IG for this example, however, Gini impurity will be also discussed at the end.

IG is based on the concept of entropy4 and information content from information theory5. First of all, let’s build an intuition behind the entropy. Simply saying, entropy is the measure of chaos or uncertainty in the data.

Imagine that I asked you to play the following simple game: I have two coins, one coin is unfair $\left( P(\text{Heads}) = \frac{4}{5} \right)$, the other one is fair $\left( P(\text{Heads}) = \frac{1}{2} \right)$. Your task is to pick a coin, make a prediction of an outcome (either Heads or Tails) and flip it. If your prediction is correct, I will give you $1. Which coin would you choose?

Well, I hope that you would choose the unfair coin. In this case, you are more certain about the outcome of the coin flip (on average it will come up Heads in $80\%$ of the time). For the fair coin, you have the highest level of uncertainty, meaning that Heads and Tails can occur equally likely. Probability mass functions for both coins look as:

$$p_{\text{unfair}} = \begin{cases} \frac{4}{5} & \text{Heads}\\ \frac{1}{5} & \text{Tails}\\ \end{cases}$$$$p_{\text{fair}} = \begin{cases} \frac{1}{2} & \text{Heads}\\ \frac{1}{2} & \text{Tails}\\ \end{cases}$$

We can calculate the entropy using the formula:

$$H(X) = - \sum_{i=1}^{n} p(x_i) \cdot log_2 \big( p(x_i) \big)$$
  • $n$ is the number of all possible outcomes for random variable $X$;
  • $log_2$ is the logarithm with base 2 ($log_2 2 = 1$, $log_2 4 = 2$, etc).

Entropy for the unfair coin:

$$H(\text{unfair}) = - \left( \frac{4}{5} \cdot log_2 \frac{4}{5} + \frac{1}{5} \cdot log_2 \frac{1}{5} \right) \\ \approx 0.72$$

Entropy for the fair coin:

$$H(\text{fair}) = - \left( \frac{1}{2} \cdot log_2 \frac{1}{2} + \frac{1}{2} \cdot log_2 \frac{1}{2} \right) \\ = 1$$
 1entropy <- function(p) {
 2  return(-sum(p*log(p, base = 2)))
 3}
 4
 5p_unfair <- c(4/5, 1/5)
 6p_fair <- c(1/2, 1/2)
 7
 8entropy(p_unfair)
 9## [1] 0.7219281
10entropy(p_fair)
11## [1] 1

As you can see, entropy for the fair coin is higher, meaning that the level of uncertainty is also higher. Note that entropy cannot be negative and the minimum value is $0$ which means the highest level of certainty. For example, imagine the unfair coin with the probability $P(\text{Heads}) = 1$ and $P(\text{Tails}) = 0$. In such a way you are sure that the coin will come up Heads and entropy is equal to zero.

Coming back to our decision tree problem we want to split out data into nodes that have reduced the level of uncertainty. Basic algorithm look as follows:

  1. For each node:
  • Choose a feature from the data set.
  • Calculate the significance (e.g., IG) of that feature in the splitting of data.
  • Repeat for each feature.
  1. Split the data by the feature that is the most significant splitter (e.g., highest IG).
  2. Repeat until there are no features left.

Growing a Tree

We are going to split the data on training (first 15 patients) and test (16th patient) sets:

1train_df <- patients_df[-16, ]
2test_df <- patients_df[16, ]

Root Node

The initial entropy of the distribution of target variable (Drug) is:

$$H(X) = -\left( p(A) \cdot log_2 p(A) + p(B) \cdot log_2 p(B) \right)$$
  • $p(A)$ - probability of Drug A. Since there are 7 observations out of 15 with the Drug A, $p(A) = \frac{7}{15} \approx 0.47$
  • $p(B)$ - probability of Drug B. Since there are 8 observations out of 15 with the Drug A, $p(B) = \frac{8}{15} \approx 0.53$
$$H(X) = -\left(0.47 \cdot log_2 0.47 + 0.53 \cdot log_2 0.53 \right) \\ \approx 0.997$$
1temp_df <- train_df %>% 
2  group_by(Drug) %>% 
3  summarise(n = n(),
4            p = round(n() / dim(train_df)[1], 2))
5
6kable(temp_df, format = "markdown")
Drugnp
Drug A70.47
Drug B80.53
1entropy(temp_df$p)
2## [1] 0.9974016

For the next step, we are going to split the data by each feature (Gender, BMI, BP), calculate the entropy, and check which split variant gives the highest IG.

Gender

Let’s start with Gender. We have two labels in the data set: M and F. After splitting the data we end up with the following proportions:

Gender LabelDrug ADrug B
M62
F16

Meaning that among 15 patients 6 men were assigned to drug A and 2 were assigned to drug B, whereas 1 woman was assigned to drug A and 6 women were assigned to drug B.

Calculating the entropy for each gender label:

Gender LabelDrug ADrug BP(A)P(B)Entropy
M62$\frac{6}{8}$$\frac{2}{8}$0.811
F16$\frac{1}{7}$$\frac{6}{7}$0.592

To measure the uncertainty reduction after the split we are going to use IG:

$$ \scriptsize \text{IG} = \text{Entropy before the split} - \text{Weighted entropy after the split}$$
  • Entropy before the split is the entropy at the root in this case
  • Weighted entropy is defined as entropy times the label proportion. For example, weighted entropy for male patients equals to:
$$ \text{Weighted Entropy (M)} = \frac{\text{\# Male patients}}{\text{\# All patients}} \cdot H(M) $$$$ \text{Weighted Entropy (M)} = \frac{8}{15} \cdot 0.811 = 0.433$$
Gender LabelDrug ADrug BP(A)P(B)EntropyWeightWeighted Entropy
M62$\frac{6}{8}$$\frac{2}{8}$0.811$\frac{8}{15}$0.433
F16$\frac{1}{7}$$\frac{6}{7}$0.592$\frac{7}{15}$0.276

Now we can calculate the IG:

$$\text{IG} = 0.997 - (0.433 + 0.276) = 0.288$$

BMI

Repeat the previous steps for the BMI feature. We have two labels: normal and not normal.

BMI LabelDrug ADrug BP(A)P(B)EntropyWeightWeighted Entropy
normal33$\frac{3}{6}$$\frac{3}{6}$1$\frac{6}{15}$0.4
not normal45$\frac{4}{9}$$\frac{5}{9}$0.991$\frac{9}{15}$0.595
$$\text{IG} = 0.997 - (0.4 + 0.595) = 0.002$$

Blood Pressure

And the same for BP column (labels are the same: normal and not normal):

BP LabelDrug ADrug BP(A)P(B)EntropyWeightWeighted Entropy
normal06$\frac{0}{6}$$\frac{6}{6}$0$\frac{6}{15}$0
not normal72$\frac{7}{9}$$\frac{2}{9}$0.764$\frac{9}{15}$0.459
$$\text{IG} = 0.997 - (0 + 0.459) = 0.538$$

Node Summary

As we can see BP gives the highest IG, or in other words, if we split the data by this column we will be more certain about the target variable distribution. So our first split will look like this:

BP:normal Node

Continue to grow the tree down from the (normal node). We filter observations that have normal blood pressure:

1train_df %>% 
2  filter(BP == "normal") %>%
3  select(-BP) %>% 
4  kable(format = "markdown")
Patient IDGenderBMIDrug
p02Mnot normalDrug B
p06Fnot normalDrug B
p08Fnot normalDrug B
p12FnormalDrug B
p13FnormalDrug B
p15Fnot normalDrug B

As we can see, there is no Drug A for this node and entropy is already 0. This node is going to be last for this branch with the drug B as an output with probability of 1. We can also think about this in a way that for all subjects with normal blood pressure drug B worked best.

BP:not normal Node

Filter all the observations that have not normal blood pressure:

1train_df %>% 
2  filter(BP == "not normal") %>% 
3  select(-BP) %>% 
4  kable(format = "markdown")
Patient IDGenderBMIDrug
p01Mnot normalDrug A
p03Mnot normalDrug A
p04MnormalDrug A
p05Mnot normalDrug A
p07FnormalDrug A
p09Mnot normalDrug A
p10MnormalDrug A
p11Fnot normalDrug B
p14MnormalDrug B

Now we consider this set as “initial set”. The entropy of target distribution at this node is $0.764$ (as we have already calculated it). At this node we have two features: Gender and BMI. We perform the same procedure to find the best splitting feature.

Gender

Gender LabelDrug ADrug BP(A)P(B)EntropyWeightWeighted Entropy
M61$\frac{6}{7}$$\frac{1}{7}$0.592$\frac{7}{9}$0.46
F11$\frac{1}{2}$$\frac{1}{2}$1$\frac{2}{9}$0.222
$$\text{IG} = 0.764 - (0.46 + 0.222) = 0.082$$

BMI

BMI LabelDrug ADrug BP(A)P(B)EntropyWeightWeighted Entropy
normal41$\frac{4}{5}$$\frac{1}{5}$0.722$\frac{5}{9}$0.401
not normal31$\frac{3}{4}$$\frac{1}{4}$0.811$\frac{4}{9}$0.361
$$\text{IG} = 0.764 - (0.401 + 0.361) = 0.002$$

Node Summary

Gender gives the highest IG after splitting the data. Now the tree will look like this:

Gender:M Node

Now we filer the set so that we have only observations with BP:not normal and Gender:M:

1train_df %>% 
2  filter(BP == "not normal" & Gender == "M") %>% 
3  select(-c(BP, Gender)) %>% 
4  kable(format = "markdown")
Patient IDBMIDrug
p01not normalDrug A
p03not normalDrug A
p04normalDrug A
p05not normalDrug A
p09not normalDrug A
p10normalDrug A
p14normalDrug B

There is just one feature left BMI. Since we have no more features apart from this one we are going to “grow” last nodes with BMI:normal and BMI:not normal labels:

BMI LabelDrug ADrug BP(A)P(B)
normal21$\frac{2}{3}$$\frac{1}{3}$
not normal40$\frac{4}{4}$$\frac{0}{4}$

For the BMI:normal node the predicted class is going to be Drug A, since its probability is higher. The same applies to BMI:not normal node.

Gender:F Node

Now we filer the set so that we have only observations with BP:not normal and Gender:F:

1train_df %>% 
2  filter(BP == "not normal" & Gender == "F") %>% 
3  select(-c(BP, Gender)) %>% 
4  kable(format = "markdown") 
Patient IDBMIDrug
p07normalDrug A
p11not normalDrug B

We are left with just 1 observation in each node. Predicted class for node BMI:normal is going to be Drug A ($P = 1.$). Predicted class for node BMI:not normal is going to be Drug B ($P = 1.$).

Final Tree

Given all the calculations we visualize the final tree:

Code

 1node0 <- "Blood Pressure"
 2node1 <- "Drug B (P=1.)"
 3node2 <- "Gender"
 4node3 <- "BMI"
 5node4 <- "BMI "
 6node5 <- "Drug A (P=0.66)"
 7node6 <- "Drug A (P=1.)"
 8node7 <- "Drug A (P=1.) "
 9node8 <- "Drug B (P=1.) "
10nodeNames <- c(node0, node1, node2, node3, 
11               node4, node5, node6, node7, node8)
12        
13# create new graph object
14rEG <- new("graphNEL", nodes=nodeNames, edgemode="directed")
15rEG <- addEdge(node0, node1, rEG)
16rEG <- addEdge(node0, node2, rEG)
17rEG <- addEdge(node2, node3, rEG)
18rEG <- addEdge(node2, node4, rEG)
19rEG <- addEdge(node3, node5, rEG)
20rEG <- addEdge(node3, node6, rEG)
21rEG <- addEdge(node4, node7, rEG)
22rEG <- addEdge(node4, node8, rEG)
23
24plot(rEG, attrs = at)
25text(150, 300, "normal", col="black")
26text(260, 300, "not normal", col="black")
27text(200, 190, "male", col="black")
28text(310, 190, "female", col="black")
29text(100, 80, "normal", col="black")
30text(230, 80, "not normal", col="black")
31text(290, 80, "normal", col="black")
32text(420, 80, "not normal", col="black")

Prediction

Let’s look at the 16th patient:

1test_df %>% 
2  kable(format = "markdown") 
Patient IDGenderBMIBPDrug
p16Mnot normalnot normal???

To make a prediction we just need to move top to bottom by the branches of our tree.

  1. Check BP. Since BP:not normal, go to the right branch.
  2. Gender:M => left branch.
  3. BMI:not normal => right branch.

The predicted value is going to be Drug A with a probability of $1.0$.

Let Computer Do the Math

R

We can build a decision tree using rpart function (from rpart package) and plot it with the help of fancyRpartPlot function (from rattle package). Note, that R doesn’t require the categorical variables encoding since it can handle them by itself.

 1model_dt <- rpart(
 2  formula = Drug ~ Gender + BMI + BP,
 3  data = train_df,
 4  method = "class",
 5  minsplit = 1,
 6  minbucket = 1,
 7  parms = list(
 8    split = "information" # for Gini impurity change to split = "gini"
 9  )
10)
11
12#plot decision tree model
13fancyRpartPlot(model_dt, caption = NULL, type=4)

Get a prediction for the 16th patient:

 1# return the class
 2predict(object = model_dt, newdata = test_df, type = "class")
 3##      1 
 4## Drug A 
 5## Levels: Drug A Drug B
 6
 7# return the probability of each class
 8predict(object = model_dt, newdata = test_df, type = "prob")
 9##      Drug A    Drug B
10## 1 0.8571429 0.1428571

Note that since R didn’t split Gender:M node by BMI the probabilities are quite different that we expected but it doesn’t effect the predicted class.

Python

Unlike R, Python doesn’t allow categorical input variables, so we need to encode them:

 1patients_df = pd.read_csv("DecisionTree_SampleData.csv")
 2# drop id column
 3patients_df.drop('Patient ID', axis=1, inplace=True)
 4
 5# convert to dummy variables
 6patients_df_labeled = pd.get_dummies(patients_df)
 7columns_to_drop = ['Gender_F', 'BMI_not normal', 'BP_not normal', 'Drug_???', 'Drug_Drug B']
 8patients_df_labeled.drop(
 9    columns_to_drop, 
10    axis=1, 
11    inplace=True)
12unknown_drug = patients_df_labeled.iloc[15, :]
13patients_df_labeled.drop(15, axis=0, inplace=True) # unknow drug row
14patients_df_labeled.head().to_html(classes='table table-striped')
Gender_MBMI_normalBP_normalDrug_Drug A
01001
11010
21001
31101
41001

For example Gender_M:1 means that subject is Male and Gender_M:0 means that subject is female.

To build a decision tree model we can use a DecisionTreeClassifier function from sklearn.tree module.

1# split the data 
2X = patients_df_labeled.drop('Drug_Drug A', axis=1)
3y = patients_df_labeled['Drug_Drug A']
4
5# fit the decision tree model
6model = tree.DecisionTreeClassifier(
7  criterion='entropy' # for Gini impurity change to criterion="gini"
8)
9model.fit(X, y)
1# plot the tree
2plt.figure(figsize=(10,10))
3tree.plot_tree(
4  model, filled=True, rounded=True,
5  feature_names=X.columns.tolist(), proportion=False,
6  class_names=['Drug B', 'Drug A']
7)
8plt.show()

Get a prediction for the 16th patient:

1unknown_drug = pd.DataFrame(unknown_drug).T
2unknown_drug.drop(['Drug_Drug A'], axis=1, inplace=True)
3
4# return the class
5model.predict(unknown_drug)
6## array([1], dtype=uint8)
1# return the probability of each class
2model.predict_proba(unknown_drug)
3## array([[0., 1.]])

Numerical Explanatory Variables

In our sample data set, we had only categorical variables (with 2 labels each). But what would the model do, if we had numerical value as an input?

Let’s look again on the Python decision tree and its decision node conditions, for example, Gender_M <= 0.5. Where does 0.5 come from? The answer overlaps with the answer on how the model handles numerical input values.

Recall the Gender_M column we created for the Python train set. It consisted of two numerical values (0 and 1). What the model does is filter the data by the average value among these two numbers. Average value was: $\frac{0+1}{2}=0.5$. Now model has two sets of data: one set that matches the condition Gender_M <= 0.5 (Gender is female) and the other set that doesn’t (Gender is male).

Imagine that we have added a third category Non-binary to the Gender feature. First, we would encode the values and create a new column Gender_encoded, which would have three numerical values (0, 1 and 2). Say, Non-binary = 0, Male = 1, Female = 2. Now model would have two conditions for filtering the data:

  • Gender_encoded <= 0.5 (Gender is non-binary)
  • Gender_encoded <= 1.5 (Gender is non-binary OR male)

In a general way, we can say that if we have a numerical feature, the algorithm will convert it to “categorical” by choosing the average value as a threshold between actual input values:

  • filter 1: feature <= threshold
  • filter 2: feature > threshold

For example:

idblood_pressuredrug
p01100A
p02120B
p03150B
p03200B

Threshold values are: 110 $\left( \frac{100+120}{2} \right)$, 135 $\left( \frac{120+150}{2} \right)$, 175 $\left( \frac{150+200}{2} \right)$.

Other Splitting Metrics

Gini Impurity

As told before, for classification problems one can also use Gini impurity as a way to find the best splitting feature. The idea stays the same, but the only difference is that instead of the Entropy formula we are using the Gini formula:

$$G(X) = 1 - \sum_{i=1}^{n} p(x_i)^2$$

After calculating the Gini impurity value for each feature, we would calculate the gain and choose the feature that provides the highest gain.

$$\scriptsize \text{Gain} = \text{Gini impurity before the split} - \text{Weighted Gini impurity after the split}$$

Variance Reduction

If we are dealing with the regression problem (predicting numerical variable, rather than a class) we would use the variance reduction. Instead of calculating the probabilities of the target variable we calculate its variance:

$$Var(x) = \frac{\sum_{i=1}^n(x_i- \bar x)^2}{n-1}$$

And in the same way we are looking for a variable that has the highest variance reduction, or in other words variance in the target variable becomes lower after the split.

$$\scriptsize \text{Variance Reduction} = \text{Variance before the split} - \text{Weighted variance after the split}$$

Summary

We have looked at the example of the CART model for the binary classification problem. It is a relatively simple algorithm that doesn’t require much feature engineering. With the small data set and lots of free time, one can even build a model with “pen and paper”. Compared to other ML algorithms, CART can be considered as a “weak learner” due to the overfitting issue. However, they are often used as a base estimator in more complex bagging and boosting models (for example, Random Forest algorithm). Additionally, CART can help with the hidden data patterns, since they can be visualized.

References