graph LR A[Data Mining] --> C(Common Ground) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
zhejiang wanli university
This chapter introduces tree-based methods for regression and classification. These methods involve segmenting the predictor space into simpler, more manageable regions. Think of it like creating a flowchart to make decisions!
Note
Tree-based methods are simple and easy to understand (interpretable). However, they often aren’t as accurate as some other machine learning techniques. We’ll explore ways to make them better, like bagging, random forests, and boosting, which combine many trees to improve performance.
Let’s start with the foundational concepts. First up: Data Mining.
Data mining is the process of discovering patterns, anomalies, and insights from large datasets. It utilizes a multidisciplinary approach, blending techniques from:
It’s like being a detective 🕵️♀️, sifting through raw data to find valuable knowledge and clues. The goal is to extract useful information, not just any information.
Next, we have Machine Learning.
Machine learning is a subfield of artificial intelligence (AI). It focuses on creating algorithms that allow computers to learn from data without being explicitly programmed. This means we don’t give the computer a set of rules; instead, we give it data and it figures out the rules itself!
Key aspects:
Finally, let’s define Statistical Learning.
Statistical learning refers to a set of tools for modeling and understanding complex datasets. It’s a relatively new field within statistics that intersects with advances in computer science, particularly machine learning. It emphasizes both building predictive models and gaining insights into the underlying data-generating process.
Key aspects:
The following diagram illustrates how these concepts relate to one another:
graph LR A[Data Mining] --> C(Common Ground) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
Tip
Data mining, machine learning, and statistical learning all overlap. They are all about extracting insights and making predictions from data, but with different emphases and approaches. Think of them as different tools in the same toolbox 🧰, each with its own strengths.
Decision trees are a fundamental tree-based method. They work by dividing the predictor space (the space of all possible input values) using a series of splitting rules. These rules are organized in a hierarchical tree structure, making them easy to visualize and understand. They recursively partition the data based on the values of the predictor variables.
Key Features:
Let’s illustrate with a regression tree example. We’ll use the Hitters
dataset (commonly used in statistical learning examples) to predict a baseball player’s salary. This dataset contains information about Major League Baseball players.
Years
: Number of years the player has played in the major leagues.Hits
: Number of hits the player made in the previous year.Log Salary
: We use the logarithm of salary to make the distribution more bell-shaped (normal), which often improves model performance. This transformation helps to stabilize the variance and make the data more suitable for modeling.This image shows a regression tree built from the Hitters data:
Regression tree for Hitters data
Note
This tree predicts the logarithm of a player’s salary based on their Years
of experience and Hits
in the previous year. The numbers at the bottom (in the “leaves”) are the average log(Salary) for players who fall into that specific category. These values represent the predicted log salary for players within each region.
Let’s break down how to interpret this tree. The most important split is at the very top:
Regression tree for Hitters data
Years < 4.5
Regression tree for Hitters data
Hits < 117.5
divides players with less than 4.5 years of experience based on their number of hits. Each internal node represents a decision point based on a predictor variable.Regression tree for Hitters data
Terminal Nodes (Leaves): These are the final boxes at the bottom of the tree. They represent the predicted values – in this case, the mean log(Salary) for players who fall into that specific region of the predictor space. These are the final prediction points of the tree.
Example: The left branch means If a player played less than 4.5 years, his predicted log salary is 5.11.
The decision tree divides the predictor space (the combination of Years
and Hits
) into rectangular regions. This image shows how the tree’s splits correspond to these regions:
Three-region partition
Note
The tree divides the space into three rectangular regions (R1, R2, R3). Each region corresponds to a leaf (terminal node) in the decision tree. Each rectangle represents a group of players with similar predicted salaries.
Three-region partition
Years < 4.5
. This region corresponds to the leftmost leaf in the tree.Years >= 4.5
and Hits < 117.5
. This region corresponds to the middle leaf.Years >= 4.5
and Hits >= 117.5
. This region corresponds to the rightmost leaf.Here’s a summary of how to build a regression tree:
Divide Predictor Space: Split the predictor space into J distinct, non-overlapping regions (R1, R2, …, RJ). These regions should not overlap, and each observation should belong to exactly one region. The goal is to find regions that minimize the prediction error.
Prediction: For any observation that falls into region Rj, we predict the response value to be the mean of the response values of all the training observations that also fall into Rj. This is the average of the response values for all observations within that region.
The key question is: how do we decide where to make the splits and create these regions? The goal is to find the regions that minimize the Residual Sum of Squares (RSS). RSS is a measure of the overall prediction error of the tree.
The RSS formula is:
\[ \text{RSS} = \sum_{j=1}^{J} \sum_{i \in R_j} (y_i - \hat{y}_{R_j})^2 \]
Where:
Tip
Finding the absolute best set of regions to minimize RSS is computationally very expensive (infeasible in most cases). It would require checking every possible partition of the feature space. Therefore, we use a top-down, greedy approach called recursive binary splitting.
Recursive binary splitting is a clever way to build the tree efficiently. It’s a heuristic algorithm that finds a good solution, though not necessarily the optimal solution. It has these key characteristics:
Here’s how recursive binary splitting works step-by-step:
Minimize: We find the specific values of j (the predictor index) and s (the cutpoint) that minimize the following expression:
\[ \sum_{i: x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2 \]
This is just the sum of the RSS for the two new regions. The goal is to find the predictor and cutpoint that result in the lowest combined RSS for the two resulting regions.
Repeat: We take one of the newly created regions and repeat steps 1 and 2. We continue this process, splitting regions into smaller and smaller sub-regions, until a stopping criterion is met.
This figure illustrates the process of recursive binary splitting:
Recursive binary splitting
Recursive binary splitting
Recursive binary splitting
Recursive binary splitting
Recursive binary splitting
The recursive binary splitting process, if allowed to continue unchecked, will often lead to a very large and complex tree. This tree will likely overfit the training data. Overfitting means the tree is too closely tailored to the training data (including noise) and won’t generalize well to new, unseen data. It will have low bias but high variance.
A smaller tree with fewer splits can have:
Tip
Cost complexity pruning (also known as weakest link pruning) is a technique to find the best subtree of the original, large tree. It helps us find a simpler tree that balances accuracy and complexity. It balances the trade-off between model fit and model size.
Goal: Find a subtree T (which is a smaller version of the full tree, T0) that minimizes a penalized version of the RSS:
\[ \sum_{m=1}^{|T|} \sum_{x_i \in R_m} (y_i - \hat{y}_{R_m})^2 + \alpha|T| \]
|T|: The number of terminal nodes (leaves) in the subtree T. This is a measure of the tree’s complexity.
α: A tuning parameter that controls the trade-off between the subtree’s complexity (number of leaves) and how well it fits the training data (RSS). A larger α leads to a larger penalty for complexity.
Note
As we increase the value of α, the penalty for having many terminal nodes increases. This leads to smaller and smaller subtrees. The tuning parameter α controls the balance between a complex, well-fitting tree and a simple, less-well-fitting tree.
Here’s the complete algorithm for building a regression tree, including pruning:
Grow a Large Tree: Use recursive binary splitting to grow a large initial tree (T0) on the training data. Stop splitting when each terminal node has fewer than some pre-specified minimum number of observations.
Cost Complexity Pruning: Apply cost complexity pruning to the large tree (T0) to obtain a sequence of best subtrees, as a function of the tuning parameter α. For each value of α, there is a corresponding subtree that minimizes the penalized RSS.
Choose α: Use K-fold cross-validation to choose the optimal value of α.
Return Subtree: Return the subtree from Step 2 that corresponds to the chosen α. This is your final, pruned tree. This subtree gives the best balance between fit and complexity, as determined by cross-validation.
Classification trees are very similar to regression trees, but they are used to predict a qualitative response (a category or class label) rather than a continuous value.
The classification error rate is the simplest option:
\[ E = 1 - \max_k (\hat{p}_{mk}) \]
The Gini index is a more sensitive measure of node purity:
\[ G = \sum_{k=1}^{K} \hat{p}_{mk}(1 - \hat{p}_{mk}) \]
Entropy is another measure of node impurity:
\[ D = -\sum_{k=1}^{K} \hat{p}_{mk} \log \hat{p}_{mk} \]
Tip
The Gini index and entropy are preferred over the classification error rate for growing the tree because they are more sensitive to node purity. For pruning the tree, any of the three criteria can be used, but the classification error rate is often preferred because it directly relates to the final prediction accuracy.
This figure shows an unpruned classification tree built on the Heart dataset:
Heart data unpruned tree
Yes
or No
) based on 13 predictors.Heart data unpruned tree
Note
Notice that qualitative predictors (like ChestPain
) can be handled directly by decision trees, without needing to create dummy variables. The tree automatically handles the different categories of qualitative predictors. For example, the ChestPain
variable has four levels: “typical angina,” “atypical angina,” “non-anginal pain,” and “asymptomatic.”
Tip
The best model depends on the true underlying relationship between the predictors and the response. If the relationship is close to linear, linear regression will likely outperform a decision tree. If the relationship is highly non-linear and complex, decision trees may be better.
This figure compares decision boundaries of linear models and trees:
Trees vs. Linear Models
Trees vs. Linear Models
Trees vs. Linear Models
Trees vs. Linear Models
Tip
Aggregating multiple decision trees (using techniques like bagging, random forests, and boosting) can significantly improve predictive performance and robustness. This is the idea behind ensemble methods, which combine the predictions of multiple models to create a more accurate and stable model.
An ensemble method combines multiple “weak learner” models (like decision trees) to create a more powerful model. Think of it like “wisdom of the crowds” Crowd wisdom – the combined judgment of a group is often better than the judgment of any individual member.
Bootstrap: Generate B different bootstrapped training datasets. Bootstrapping means taking repeated random samples with replacement from the original training data. Each bootstrapped dataset has the same size as the original dataset, but some observations will be repeated, and others will be left out.
Train: Train a decision tree on each of the B bootstrapped datasets. Typically, these trees are grown deep (large trees) and are not pruned. Growing deep trees allows them to capture complex patterns, and averaging reduces the risk of overfitting.
Average: For a given test observation:
OOB Observations: For each tree, the observations that were not included in the bootstrapped sample are called the “out-of-bag” (OOB) observations. On average, about one-third of the observations are OOB for each tree.
OOB Prediction: We can predict the response for each observation using only the trees in which that observation was OOB. This provides a way to estimate the prediction error without using a separate validation set.
OOB Error: The OOB error is a valid estimate of the test error of the bagged model. It’s a very convenient way to assess performance, as it doesn’t require setting aside a separate validation set.
Initialize: Set the initial prediction \(\hat{f}(x)\) to 0 for all observations, and set the residuals \(r_i\) equal to the observed response values \(y_i\).
Iterate (for b = 1 to B):
Output: The boosted model is the sum of all the trees: \[\hat{f}(x) = \sum_{b=1}^{B} \lambda \hat{f}^b(x)\]
BART, similar to other ensemble methods, employs decision trees as its foundational components. However, BART distinguishes itself through several key characteristics:
This table summarizes the key characteristics of the ensemble methods we’ve discussed:
Method | Tree Growth | Data Sampling | Key Idea |
---|---|---|---|
Bagging | Independent | Bootstrapped | Average many trees to reduce variance. |
Random Forests | Independent | Bootstrapped + | Decorrelate trees by limiting the predictors considered at each split. |
Random Subset | |||
Boosting | Sequential | None (Modified Data) | Learn slowly by fitting small trees to the residuals. |
BART | Sequential, | None | Perturb trees (modify existing trees) to avoid local optima and provide a Bayesian interpretation. |
Perturbed |
邱飞(peter) 💌 [email protected]