graph LR A[Data Mining] --> C(Common Ground: Extracting Information from Data) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
zhejiang wanli university
Statistical learning is a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. 📊 It’s like having a toolbox filled with different instruments to analyze and interpret the information hidden within datasets.
Supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs. We have a target variable we want to predict, like a teacher guiding the learning process. 🎯 Think of predicting house prices based on features like size, location, and number of bedrooms.
With unsupervised statistical learning, there are inputs but no supervising output; nevertheless, we can learn relationships and structure from such data. We’re exploring the data to find patterns, like a detective searching for clues without knowing exactly what they’re looking for. 🔍 An example is grouping customers into different segments based on their purchasing behavior.
This chapter introduces many of the key concepts of statistical learning, focusing on the fundamental ideas, which include:
Let’s clarify the relationship between these often-used terms. These are all related, but have slightly different focuses. 🧐 It’s like comparing different members of a family - they share common traits but also have unique characteristics.
Data Mining: The process of discovering patterns, anomalies, and insights from large datasets, often using computational techniques. It emphasizes finding any interesting pattern, even if it’s not directly related to a specific prediction task. Think of it as exploring a vast dataset to find hidden treasures. ⛏️💎 Imagine you’re sifting through a mountain of sand to find gold nuggets.
Machine Learning: A field of computer science focused on algorithms that can learn from data without explicit programming. It’s heavily focused on prediction – enabling computers to make accurate predictions on new, unseen data. Think of it as teaching a computer to learn from examples. 🤖📚 Imagine a robot learning to play a game by observing human players and trying different strategies.
Statistical Learning: A subfield of statistics that focuses on developing and applying statistical models and methods for prediction and inference. It emphasizes understanding the relationships between variables and making inferences about the underlying data-generating process. It combines the goals of understanding and prediction. 📈🔍 Think of a scientist using data to build a model of a physical phenomenon, both to predict its behavior and to understand the underlying mechanisms.
graph LR A[Data Mining] --> C(Common Ground: Extracting Information from Data) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
This diagram shows how data mining, machine learning, and statistical learning all share the common goal of extracting information from data, which leads to insights and predictions. It’s like a Venn diagram showing the overlap and distinct areas of each field.
To motivate our study, let’s consider a simple example. A company wants to understand how advertising spending affects product sales. 📈🛍️ Imagine you’re running a business and want to know how best to allocate your advertising budget.
Data: Sales
of a product in 200 different markets, along with advertising budgets for TV
, radio
, and newspaper
. This is like having a spreadsheet with sales figures and advertising spending for different regions.
Goal: Build a model to predict sales
based on the advertising budgets in the three media. The company wants to find a formula that connects advertising spending to sales.
Let’s define the key components in our statistical learning framework.
In the advertising example:
Input variables (X): Advertising budgets (TV, radio, newspaper). These are also called predictors, independent variables, or features. We often denote them as X₁, X₂, X₃, …, Xₚ. These are the things we can control or observe. 🛠️ These are like the ingredients in a recipe.
Output variable (Y): Sales. This is also called the response or dependent variable. We’re trying to predict or understand Y. This is the outcome we’re interested in. 🎯 This is like the final dish in a recipe.
Statistical learning will use all these terms interchangeably.
Here, we have three input variables (p=3) representing the advertising budgets for different media, and one output variable (sales). It’s like having three dials (advertising budgets) that we can adjust to try to control the outcome (sales).
Let’s look at the relationship between advertising spending and sales.
More generally, we assume a relationship between the response Y and predictors X. This is the foundation of statistical learning.
\[ Y = f(X) + \epsilon \]
Goal of Statistical Learning: Estimate the unknown function f.
Our primary goal is to find the best possible estimate of the function f, which describes the relationship between our predictors and the response. It’s like trying to find the best possible approximation of the secret formula.
The error term, ε, is crucial. It represents the “noise” in our data.
The error term, ε, captures all the factors that affect Y but are not included in our predictors X. This could include:
Note
The error term is crucial. It acknowledges that our models are approximations of reality. Even the “best” model won’t be perfect. It’s a reminder that there’s always some uncertainty. It’s like acknowledging that our map of the world is not the world itself.
Let’s look at another example: predicting income based on education and seniority.
income
(in thousands of dollars) versus years of education
and years of seniority
for 30 individuals. Each red point represents a person. This allows us to visualize the relationship between income and two predictors simultaneously.income
(in thousands of dollars) versus years of education
for 30 individuals. Each red point represents a person. This is a 2D projection of the 3D data, showing only the relationship between income and education.There are two main reasons to estimate f: Prediction and Inference. It’s like having two different goals when exploring a new city – you might want to find the fastest route to a specific destination (prediction), or you might want to understand the layout of the city and how different neighborhoods are connected (inference).
Prediction: We want to predict Y given a set of X values. We don’t necessarily care about the exact form of f, just that it gives accurate predictions (treat f as a “black box”). We want the best possible guess for Y. 🔮 This is like using a GPS to find the best route – you don’t need to know how the GPS works internally, just that it gives you accurate directions.
\[ \hat{Y} = \hat{f}(X) \]
Inference: We want to understand the relationship between Y and X. We do care about the form of f. We want to answer questions about how the predictors influence the response. 🕵️♀️ This is like studying a map to understand how different roads are connected and how traffic flows in a city.
The accuracy of our prediction, Ŷ, depends on two types of error.
The accuracy of our prediction, Ŷ, depends on two types of error:
Reducible Error: Error due to our estimate of f (f̂) not being perfect. We can reduce this error by choosing better statistical learning techniques, improving our model. 💪 This is like improving your driving skills to get to your destination faster.
Irreducible Error: Error due to the random error term, ε. Even if we knew the true f, we cannot predict ε. This sets a limit on how accurate our predictions can be. This is the inherent randomness we can’t eliminate. 🤷 This is like encountering unexpected traffic – you can’t eliminate it, no matter how good a driver you are.
\[E(Y - \hat{Y})^2 = \underbrace{[f(X) - \hat{f}(X)]^2}_{\text{Reducible}} + \underbrace{Var(\epsilon)}_{\text{Irreducible}}\]
Note
Our goal is to minimize the reducible error.
We focus on reducing the reducible error because that’s the part we can control through better modeling. It’s like focusing on improving our driving skills, rather than worrying about unpredictable traffic.
When our goal is inference, we want to understand how Y changes as a function of X₁, …, Xₚ.
We’re interested in questions like:
We care about interpretability, the form of f, and statistical significance.
Let’s see an example where prediction is the primary goal.
Scenario: A company wants to target a direct-marketing campaign to individuals likely to respond positively. They want to send their advertisements to the people most likely to buy their product.
Note
This is a classic prediction problem. The model is a “black box”.
We don’t necessarily care why certain people respond, just that they respond. It’s like knowing that a certain machine produces good results, without knowing exactly how it works internally.
Now, let’s see an example where inference is the primary goal.
Scenario: Analyze the Advertising
data (Figure 2.1). We want to understand how different types of advertising affect sales.
Questions to answer:
We want to understand the causal relationships between advertising and sales.
Note
This is an inference problem. We want to understand the relationships.
We care about the why, not just the prediction. It’s like trying to understand why a certain medicine works, not just that it works.
We use training data to “teach” our statistical learning method how to estimate f. It’s like learning from examples.
Training data: A set of observed data points: {(x₁, y₁), (x₂, y₂), …, (xₙ, yₙ)}, where:
Goal: Find a function, f̂, such that Y ≈ f̂(X) for any observation (X, Y). We want our estimated function to be close to the true function for all possible data points. We want our model to generalize well to new data.
Two broad approaches: Parametric and non-parametric methods. These are like two different strategies for learning – one involves making assumptions, the other doesn’t.
A two-step, model-based approach. We make an assumption about the shape of f. It’s like assuming a specific recipe for a dish.
Assume a functional form for f. For example, assume f is linear:
\[ f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p \]
This reduces the problem to estimating the p + 1 coefficients (β₀, β₁, …, βₚ). We’ve simplified the problem to estimating a fixed number of parameters. It’s like assuming the dish can be made with a specific set of ingredients and specific proportions.
Use training data to fit or train the model. Find the values of the parameters (β₀, β₁, …, βₚ) that best fit the data. A common method is (ordinary) least squares. We use the data to find the best values for these parameters. It’s like adjusting the proportions of the ingredients to get the best taste, based on tasting (the training data).
Note
Parametric methods simplify the problem by assuming a specific form for f.
This simplification makes the problem much easier to solve. It’s like having a recipe – it makes cooking easier, but it limits the possible variations of the dish.
Let’s see how a linear model fits the income data.
Income
data (Figure 2.3).income ≈ β₀ + β₁ × education + β₂ × seniority
.Let’s weigh the pros and cons of parametric methods.
Advantage: Simplifies the problem of estimating f. It’s easier to estimate a few parameters than an entirely arbitrary function. Computationally efficient. It’s like having a recipe – it’s easier to follow than to invent a dish from scratch.
Disadvantage: The assumed form of f might be wrong. If the true f is very different from our assumed form, our estimate will be poor. The model might be too simple to capture the true relationship. It’s like trying to make a cake using a cookie recipe – it won’t work very well.
Overfitting: If we use a very complex (flexible) model, we might overfit the data. This means the model follows the noise (random error) too closely, resulting in poor predictions on new data. The model might be too complex and capture noise instead of the true signal. It’s like memorizing the training data instead of learning the underlying pattern.
Don’t make assumptions about the shape of f. Let the data speak for itself! It’s like cooking without a recipe – you rely on your senses and experience.
Let’s see a non-parametric method in action.
A thin-plate spline (yellow surface) fit to the Income
data. A thin-plate spline is a flexible method that can fit a wide variety of shapes.
This is a non-parametric method. No pre-specified model is assumed. The shape of the surface is determined entirely by the data.
The fit is much closer to the true f (Figure 2.3) than the linear fit. It captures the non-linear relationship between income, education, and seniority more accurately.
This is a smooth fit. It captures the general trend without being too wiggly. It’s not overly sensitive to individual data points.
Let’s see what happens when we make the non-parametric model too flexible.
Same data, but a rougher thin-plate spline fit. This spline is more flexible than the previous one.
This fit perfectly matches the training data (zero error on training data!). It goes through every single data point.
There’s a fundamental trade-off in statistical learning: accuracy vs. interpretability. It’s like choosing between a powerful but complex tool and a simple but easy-to-use tool.
Flexibility | Interpretability |
---|---|
Low (e.g., Linear Regression) | High |
High (e.g., Neural Networks) | Low |
Why might we choose a simpler model, even if it’s less flexible?
Even if we only care about prediction, a more restrictive model (like linear regression) can sometimes outperform a more flexible model!
Reasons:
Let’s revisit the distinction between supervised and unsupervised learning.
Note
The distinction between supervised and unsupervised learning isn’t always clear-cut.
Some methods can be used in both supervised and unsupervised settings. It’s like having a tool that can be used for different purposes.
Let’s look at an example of unsupervised learning: cluster analysis.
Left: 150 observations, two variables (X₁, X₂).
Three well-separated groups (clusters). Clustering should easily identify these. The groups are distinct and easy to separate. It’s like having three clearly separated piles of different objects.
Goal: Identify distinct groups without knowing the group labels beforehand. We’re trying to find hidden structure in the data. It’s like trying to sort objects into groups without knowing what the groups should be.
In the examples shown, there are only two variables, and we can check the scatterplots to identify clusters. But in practice, we often have many more variables, making visual inspection impossible. We need to use clustering and other unsupervised learning approaches.
Within supervised learning, we have two main types of problems: regression and classification. It’s like having two different types of questions – one asking “how much?” and the other asking “which one?”.
Note
The type of response variable (quantitative or qualitative) is the key distinction. It’s like the type of answer you’re looking for determines the type of question you ask.
How do we measure how well our model performs in a regression setting? How do we know if our predictions are good?
Goal: Quantify how well our predictions match the observed data. We want our predictions to be close to the true values. It’s like measuring how close our darts are to the bullseye.
Mean Squared Error (MSE): A common measure in regression:
\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2 \]
We need to distinguish between how well our model fits the data it was trained on and how well it generalizes to new data.
We usually don’t care how well the model fits the training data. We care about how well it predicts new data (the test data). It’s like caring about how well a student performs on a real exam, not just on practice questions.
A model with low training MSE might have high test MSE (overfitting!). It might be memorizing the training data instead of learning the underlying pattern. It’s like a student who memorizes the answers to practice questions but doesn’t understand the concepts.
Training MSE: Calculated using the training data. How well the model fits the data it was trained on.
Test MSE: Calculated using new, unseen data (test data). This is what we really care about! This measures how well our model will perform in the real world. It’s like evaluating the model on data it has never seen before.
Ideally: We’d choose the model with the lowest test MSE.
Problem: We often don’t have test data when building the model.
Solution: Techniques like cross-validation (Chapter 5) can help us estimate the test MSE using the training data. It’s like simulating a real exam using the practice questions.
Let’s see how training and test MSE change as we vary model flexibility.
The U-shape in the test MSE curve is due to two competing properties: bias and variance. It’s like a seesaw – as one goes up, the other goes down.
The U-shape in the test MSE curve is due to two competing properties:
Expected test MSE at x0 can be decomposed to: \[E(y_0 - \hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)\]
This equation shows that the expected test MSE is the sum of the variance of our estimate, the squared bias of our estimate, and the variance of the error term. It’s like saying the total error is the sum of errors due to inconsistency, errors due to oversimplification, and errors due to randomness.
To minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias.
Let’s visualize the bias-variance trade-off.
Squared bias (blue), variance (orange), irreducible error (dashed), and test MSE (red) for the three examples in Figure 2.9.
Left (non-linear f): Bias decreases rapidly, variance increases slowly. As flexibility increases, bias decreases quickly, but variance increases gradually.
Key takeaway: Good models need both low variance and low bias. This is a trade-off! We need to find the right balance between bias and variance. It’s like finding the sweet spot on a seesaw.
How do we measure model performance in a classification setting? How do we know if our classifier is making good predictions?
Error Rate: The proportion of mistakes made by the classifier. We count the number of times our model predicts the wrong class. It’s like counting the number of wrong answers on a test.
Training error rate:
\[ \frac{1}{n} \sum_{i=1}^{n} I(y_i \ne \hat{y}_i) \]
Test error rate: Ave(I(y_0 ≠ ŷ₀))
. This is what we care about! The average error rate on new, unseen data. It’s like the average error rate on a real exam.
Goal: Choose the classifier with the lowest test error rate.
The Bayes classifier is the theoretical “best” classifier. It’s like the ideal student who always gets the right answer.
The “ideal” classifier: Assigns each observation to the most likely class, given its predictor values. It makes the best possible prediction based on the true probabilities. It’s like knowing the exact probability of each answer being correct and always choosing the most probable one.
Conditional probability: Pr(Y = j | X = x₀) - the probability that Y = j (class j), given the predictor values x₀. The probability of belonging to a specific category, given the observed features.
Bayes Classifier: Assigns an observation to the class j for which Pr(Y = j | X = x₀) is largest. Choose the class with the highest probability.
Bayes Decision Boundary: The points where the conditional probabilities for different classes are equal. This is the boundary between where we would predict different classes. It’s like the dividing line between different territories.
Bayes Error Rate: The lowest possible test error rate achievable. Analogous to the irreducible error. This is the best we can possibly do, even with perfect knowledge. It’s like the minimum possible error rate, even for the ideal student.
Let’s visualize the Bayes classifier.
Simulated data, two classes (orange, blue).
Purple dashed line: Bayes decision boundary. This is the line where the probability of belonging to the orange class is equal to the probability of belonging to the blue class.
Orange/blue shaded regions: Regions where the Bayes classifier would predict orange/blue.
The Bayes error rate is greater than zero because the classes overlap. Even the best classifier will make mistakes because the classes are not perfectly separable. It’s like having some questions on a test that are ambiguous, even for the best student.
KNN is a practical method that approximates the Bayes classifier. It’s like a practical student who tries to learn from their peers.
Problem: In reality, we don’t know the conditional distribution of Y given X. So, we can’t directly use the Bayes classifier. We don’t know the true probabilities. It’s like not knowing the exact probabilities of each answer being correct.
KNN: A non-parametric method that estimates the conditional distribution and then classifies based on the estimate. It uses the training data to approximate the probabilities. It’s like looking at similar past exam questions and their answers to guess the answer to a new question.
Let’s visualize how KNN works.
Left: Small training set (6 blue, 6 orange). Black cross is the test observation. We want to predict the class of the black cross.
Circle shows the 3 nearest neighbors (K=3): 2 blue, 1 orange. We find the three closest training observations to the black cross.
KNN predicts “blue”. Because the majority of the neighbors are blue.
Right: KNN decision boundary (K=3) for all possible values of X₁ and X₂. This shows how KNN would classify any point in the space. This is like drawing a map showing which class KNN would predict for any combination of X₁ and X₂.
KNN can produce a decision boundary and classifier that’s close to Bayes Classifier.
The choice of K (the number of neighbors) is crucial in KNN. It’s like choosing how many friends to ask for advice.
The choice of K (the number of neighbors) controls the flexibility of the KNN classifier.
Example: Figure 2.15 and 2.16 shows KNN fits with different K.
Finding the best K: We want to choose K to minimize the test error rate. Techniques like cross-validation can help. We need to find the value of K that gives the best generalization performance. It’s like finding the optimal number of friends to ask for advice to get the most reliable answer.
Let’s recap the key concepts we’ve covered.
Let’s think about some broader implications and questions.
邱飞(peter) 💌 [email protected]