Introduction to Statistical Learning

peter(邱飞)

zhejiang wanli university

Introduction to Statistical Learning

Statistical learning is a vast set of tools for understanding data. These tools can be classified as supervised or unsupervised. 📊 It’s like having a toolbox filled with different instruments to analyze and interpret the information hidden within datasets.

Supervised vs. Unsupervised Learning

Supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs. We have a target variable we want to predict, like a teacher guiding the learning process. 🎯 Think of predicting house prices based on features like size, location, and number of bedrooms.
With unsupervised statistical learning, there are inputs but no supervising output; nevertheless, we can learn relationships and structure from such data. We’re exploring the data to find patterns, like a detective searching for clues without knowing exactly what they’re looking for. 🔍 An example is grouping customers into different segments based on their purchasing behavior.

Key Concepts

This chapter introduces many of the key concepts of statistical learning, focusing on the fundamental ideas, which include:

Data mining, machine learning and statistical learning relationship. ⛏️🤖📈
Estimating f: How do we find the best function to describe the relationship between our variables?
The trade-off between prediction accuracy and model interpretability: Can we have both a highly accurate and easily understandable model? 🤔
Supervised versus unsupervised learning: The difference between having a target variable and exploring the data without one.
Assessing model accuracy: How do we know if our model is good? We’ll look at metrics like mean square error, the bias-variance, and Bayes error rate.

Data Mining, Machine Learning, and Statistical Learning

Let’s clarify the relationship between these often-used terms. These are all related, but have slightly different focuses. 🧐 It’s like comparing different members of a family - they share common traits but also have unique characteristics.

Data Mining

Data Mining: The process of discovering patterns, anomalies, and insights from large datasets, often using computational techniques. It emphasizes finding any interesting pattern, even if it’s not directly related to a specific prediction task. Think of it as exploring a vast dataset to find hidden treasures. ⛏️💎 Imagine you’re sifting through a mountain of sand to find gold nuggets.
- Example: A supermarket analyzing purchase data to discover that people who buy diapers often also buy beer. This is a pattern, but not necessarily useful for prediction. This unexpected correlation could lead to strategic product placement in the store.

Machine Learning

Machine Learning: A field of computer science focused on algorithms that can learn from data without explicit programming. It’s heavily focused on prediction – enabling computers to make accurate predictions on new, unseen data. Think of it as teaching a computer to learn from examples. 🤖📚 Imagine a robot learning to play a game by observing human players and trying different strategies.
- Example: A spam filter learning to identify spam emails based on the words used in the email body and subject line. The filter improves its accuracy over time as it “learns” from more examples of spam and non-spam emails.

Statistical Learning

Statistical Learning: A subfield of statistics that focuses on developing and applying statistical models and methods for prediction and inference. It emphasizes understanding the relationships between variables and making inferences about the underlying data-generating process. It combines the goals of understanding and prediction. 📈🔍 Think of a scientist using data to build a model of a physical phenomenon, both to predict its behavior and to understand the underlying mechanisms.
- Example: Building a model to predict a patient’s risk of heart disease based on their age, blood pressure, cholesterol levels, and other risk factors, and understanding how each of those factors contributes to the risk. This understanding can help doctors develop better prevention and treatment strategies.

Common Ground

All the concepts are focusing on extracting information from data. They all aim to gain insights and/or make predictions based on data. It’s like different paths leading to the same destination – understanding and utilizing data.

Relationships Visualized

graph LR
    A[Data Mining] --> C(Common Ground: Extracting Information from Data)
    B[Machine Learning] --> C
    D[Statistical Learning] --> C
    C --> E[Insights & Predictions]

This diagram shows how data mining, machine learning, and statistical learning all share the common goal of extracting information from data, which leads to insights and predictions. It’s like a Venn diagram showing the overlap and distinct areas of each field.

The Advertising Example

To motivate our study, let’s consider a simple example. A company wants to understand how advertising spending affects product sales. 📈🛍️ Imagine you’re running a business and want to know how best to allocate your advertising budget.

Advertising Example: Data and Goal

Data: Sales of a product in 200 different markets, along with advertising budgets for TV, radio, and newspaper. This is like having a spreadsheet with sales figures and advertising spending for different regions.
Goal: Build a model to predict sales based on the advertising budgets in the three media. The company wants to find a formula that connects advertising spending to sales.

Advertising Example: Why?

Why? The company can’t directly control sales, but can control advertising spending. A good model helps them optimize their advertising budget to maximize sales. They want to know where to spend their money! 💰➡️📈 This is about getting the biggest bang for your buck – maximizing return on investment.

Input and Output Variables

Let’s define the key components in our statistical learning framework.

Input and Output Variables: Definitions

In the advertising example:

Input variables (X): Advertising budgets (TV, radio, newspaper). These are also called predictors, independent variables, or features. We often denote them as X₁, X₂, X₃, …, Xₚ. These are the things we can control or observe. 🛠️ These are like the ingredients in a recipe.
Output variable (Y): Sales. This is also called the response or dependent variable. We’re trying to predict or understand Y. This is the outcome we’re interested in. 🎯 This is like the final dish in a recipe.
Statistical learning will use all these terms interchangeably.

Input and Output Variables: Example

Example:
- X₁ = TV budget
- X₂ = Radio budget
- X₃ = Newspaper budget
- Y = Sales

Here, we have three input variables (p=3) representing the advertising budgets for different media, and one output variable (sales). It’s like having three dials (advertising budgets) that we can adjust to try to control the outcome (sales).

Visualizing the Advertising Data

Let’s look at the relationship between advertising spending and sales.

Advertising Data: Visualization

Advertising Data: Sales vs. Advertising Budgets

Advertising Data: Interpretation (TV)

The leftmost plot shows sales versus TV advertising budget.
The blue line represents a simple linear model (least squares fit) to predict sales using TV advertising budget.
Observation: There’s a clear positive relationship. As TV advertising spending increases, sales tend to increase as well. This suggests that TV advertising is effective.

Advertising Data: Interpretation (Radio)

The center plot shows sales versus Radio advertising budget.
The blue line represents a simple linear model (least squares fit) to predict sales using Radio advertising budget.
Observation: There’s also a positive relationship, although perhaps slightly less strong than with TV advertising. Radio advertising also seems to be effective.

Advertising Data: Interpretation (Newspaper)

The rightmost plot shows sales versus Newspaper advertising budget.
The blue line represents a simple linear model (least squares fit) to predict sales using Newspaper advertising budget.
Observation: The relationship is less clear. It’s not obvious whether newspaper advertising has a strong positive or negative effect on sales. This might suggest that newspaper advertising is less effective, or that the relationship is more complex.

The General Model

More generally, we assume a relationship between the response Y and predictors X. This is the foundation of statistical learning.

The General Model: Equation

\[ Y = f(X) + \epsilon \]

Y: The quantitative response variable we want to predict. This is the outcome we’re interested in.
X: (X₁, X₂, …, Xₚ), a vector of p predictors. These are the factors we believe influence the outcome.
f(X): An unknown function representing the systematic relationship between X and Y. This is what we want to estimate! This is the underlying pattern we’re trying to uncover. It’s like the secret formula that connects our inputs to the output.
ε: A random error term, independent of X, with a mean of zero. It represents the variation in Y that cannot be explained by f(X). This acknowledges that our model won’t be perfect. It’s like the “noise” or randomness in the system.

The General Model: Goal

Goal of Statistical Learning: Estimate the unknown function f.

Our primary goal is to find the best possible estimate of the function f, which describes the relationship between our predictors and the response. It’s like trying to find the best possible approximation of the secret formula.

Understanding the Error Term (ε)

The error term, ε, is crucial. It represents the “noise” in our data.

The Error Term: Explanation

The error term, ε, captures all the factors that affect Y but are not included in our predictors X. This could include:

Unmeasured variables: Factors influencing Y that we didn’t or couldn’t measure. (e.g., competitor activity, overall economic conditions, customer mood).
Measurement error: Inaccuracies in how we measured X or Y. (e.g., a survey respondent misremembering their income, a faulty sensor recording temperature).
Randomness: Inherent variability in Y that can’t be perfectly predicted. (e.g., even with the same advertising spend, sales might fluctuate due to random chance, like a coin flip).

The Error Term: Importance

Note

The error term is crucial. It acknowledges that our models are approximations of reality. Even the “best” model won’t be perfect. It’s a reminder that there’s always some uncertainty. It’s like acknowledging that our map of the world is not the world itself.

Example: Income vs. Education

Let’s look at another example: predicting income based on education and seniority.

Income vs. Education: Visualization

Income vs. Years of Education and Seniority

Income vs. Education: Interpretation (Left)

Left: A 3D scatterplot of income (in thousands of dollars) versus years of education and years of seniority for 30 individuals. Each red point represents a person. This allows us to visualize the relationship between income and two predictors simultaneously.

Income vs. Education: Interpretation (Right)

Right: The true underlying relationship (blue surface), which is usually unknown (but known here because the data were simulated). The surface represents the average income for any given combination of education and seniority. This is like knowing the “true” formula for income.

Income vs. Education, Seniority: Visualization (2D Projection)

Income vs. Education: Interpretation (2D, Left)

Left: Observed income (in thousands of dollars) versus years of education for 30 individuals. Each red point represents a person. This is a 2D projection of the 3D data, showing only the relationship between income and education.

Income vs. Education: Interpretation (2D, Right)

Right: The true underlying relationship (blue curve), which is usually unknown (but known here because the data were simulated). The black line segments represent errors associated with each data, showing the difference between the observed income and the true underlying relationship. This shows how individual incomes deviate from the average trend.

Income vs. Education: Overall Observation

Observation: More years of education and seniority generally lead to higher income, but there’s variation (the error). Not everyone with the same education and seniority level has the same income. This highlights the role of other factors and randomness in determining income.

Why Estimate f?

There are two main reasons to estimate f: Prediction and Inference. It’s like having two different goals when exploring a new city – you might want to find the fastest route to a specific destination (prediction), or you might want to understand the layout of the city and how different neighborhoods are connected (inference).

Why Estimate f: Prediction

Prediction: We want to predict Y given a set of X values. We don’t necessarily care about the exact form of f, just that it gives accurate predictions (treat f as a “black box”). We want the best possible guess for Y. 🔮 This is like using a GPS to find the best route – you don’t need to know how the GPS works internally, just that it gives you accurate directions.

\[ \hat{Y} = \hat{f}(X) \]
- Ŷ: The prediction of Y. Our best guess for the value of Y.
- f̂: Our estimate of f. The function we’ve learned from the data.

Why Estimate f: Inference

Inference: We want to understand the relationship between Y and X. We do care about the form of f. We want to answer questions about how the predictors influence the response. 🕵️‍♀️ This is like studying a map to understand how different roads are connected and how traffic flows in a city.
- Which predictors are associated with the response? Which factors are most important?
- Is the relationship positive or negative? Does increasing a predictor increase or decrease the response?
- Is the relationship linear or more complex? Is the relationship a straight line or a curve?

Prediction: Reducible and Irreducible Error

The accuracy of our prediction, Ŷ, depends on two types of error.

Prediction Error: Decomposition

The accuracy of our prediction, Ŷ, depends on two types of error:

Reducible Error: Error due to our estimate of f (f̂) not being perfect. We can reduce this error by choosing better statistical learning techniques, improving our model. 💪 This is like improving your driving skills to get to your destination faster.
Irreducible Error: Error due to the random error term, ε. Even if we knew the true f, we cannot predict ε. This sets a limit on how accurate our predictions can be. This is the inherent randomness we can’t eliminate. 🤷 This is like encountering unexpected traffic – you can’t eliminate it, no matter how good a driver you are.

\[E(Y - \hat{Y})^2 = \underbrace{[f(X) - \hat{f}(X)]^2}_{\text{Reducible}} + \underbrace{Var(\epsilon)}_{\text{Irreducible}}\]
- E(Y - Ŷ)²: The expected squared difference between the true value of Y and our prediction. This measures the average squared error.
- [f(X) - f̂(X)]²: The squared difference between the true function and our estimated function. This is the reducible error.
- Var(ε): The variance of the error term. This is the irreducible error.

Prediction Error: Goal

Note

Our goal is to minimize the reducible error.

We focus on reducing the reducible error because that’s the part we can control through better modeling. It’s like focusing on improving our driving skills, rather than worrying about unpredictable traffic.

Inference: Understanding the Relationship

When our goal is inference, we want to understand how Y changes as a function of X₁, …, Xₚ.

Inference: Key Questions

We’re interested in questions like:

Which predictors matter? Are all the Xᵢ related to Y, or only a subset? (e.g., Does newspaper advertising actually impact sales?) This is like figuring out which ingredients are essential for a recipe.
What’s the nature of the relationship? Is it positive, negative, linear, non-linear? (e.g., Does income increase linearly with education, or is there a diminishing return?) This is like understanding how the amount of each ingredient affects the taste of the dish.
Can we simplify the model? Can we get a good understanding with a simpler model (e.g., a linear model)? (e.g., Can we ignore some predictors without losing much accuracy?) This is like simplifying a recipe without sacrificing the flavor.

We care about interpretability, the form of f, and statistical significance.

Example: Modeling for Prediction

Let’s see an example where prediction is the primary goal.

Prediction Example: Direct Marketing

Scenario: A company wants to target a direct-marketing campaign to individuals likely to respond positively. They want to send their advertisements to the people most likely to buy their product.

Predictors (X): Demographic variables (age, income, location, etc.). These are like characteristics of potential customers.
Response (Y): Response to the campaign (positive or negative). Did the customer buy the product or not?
Goal: Accurately predict Y using X. The company doesn’t need to deeply understand why each predictor is related to the response, only that the prediction is accurate. They want to maximize the response rate to their campaign.

Prediction Example: Black Box

Note

This is a classic prediction problem. The model is a “black box”.

We don’t necessarily care why certain people respond, just that they respond. It’s like knowing that a certain machine produces good results, without knowing exactly how it works internally.

Example: Modeling for Inference

Now, let’s see an example where inference is the primary goal.

Inference Example: Advertising Data

Scenario: Analyze the Advertising data (Figure 2.1). We want to understand how different types of advertising affect sales.

Predictors (X): TV, radio, and newspaper advertising budgets. These are the different ways the company spends money on advertising.
Response (Y): Sales. The outcome the company wants to improve.
Goal: Understand how each advertising medium affects sales.

Inference Example: Questions

Questions to answer:

Which media are associated with sales? Which types of advertising are most effective?
Which media generate the biggest boost in sales? Where should the company invest most of its advertising budget?
How large is the effect of TV advertising on sales? How much more can we expect to sell for every dollar spent on TV advertising?

We want to understand the causal relationships between advertising and sales.

Inference Example: Understanding

Note

This is an inference problem. We want to understand the relationships.

We care about the why, not just the prediction. It’s like trying to understand why a certain medicine works, not just that it works.

How Do We Estimate f?

We use training data to “teach” our statistical learning method how to estimate f. It’s like learning from examples.

Estimating f: Training Data

Training data: A set of observed data points: {(x₁, y₁), (x₂, y₂), …, (xₙ, yₙ)}, where:
- xᵢ = (xᵢ₁, xᵢ₂, …, xᵢₚ)ᵀ is the vector of predictor values for the ith observation. This is like a set of measurements for each individual or item in our dataset.
- yᵢ is the response value for the ith observation. This is the outcome we observe for each individual or item.
Goal: Find a function, f̂, such that Y ≈ f̂(X) for any observation (X, Y). We want our estimated function to be close to the true function for all possible data points. We want our model to generalize well to new data.
Two broad approaches: Parametric and non-parametric methods. These are like two different strategies for learning – one involves making assumptions, the other doesn’t.

Parametric Methods

A two-step, model-based approach. We make an assumption about the shape of f. It’s like assuming a specific recipe for a dish.

Parametric Methods: Steps

Assume a functional form for f. For example, assume f is linear:

\[ f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p \]

This reduces the problem to estimating the p + 1 coefficients (β₀, β₁, …, βₚ). We’ve simplified the problem to estimating a fixed number of parameters. It’s like assuming the dish can be made with a specific set of ingredients and specific proportions.
- β₀ is the intercept (the value of Y when all X’s are zero).
- β₁, β₂, …, βₚ are the slopes (the change in Y for a one-unit increase in each X).
Use training data to fit or train the model. Find the values of the parameters (β₀, β₁, …, βₚ) that best fit the data. A common method is (ordinary) least squares. We use the data to find the best values for these parameters. It’s like adjusting the proportions of the ingredients to get the best taste, based on tasting (the training data).

Parametric Methods: Simplification

Note

Parametric methods simplify the problem by assuming a specific form for f.

This simplification makes the problem much easier to solve. It’s like having a recipe – it makes cooking easier, but it limits the possible variations of the dish.

Example: Linear Model Fit to Income Data

Let’s see how a linear model fits the income data.

Linear Model Fit: Visualization

Linear Model Fit: Explanation

A linear model (yellow plane) fit to the Income data (Figure 2.3).
Red dots are the observed data points.
The model assumes: income ≈ β₀ + β₁ × education + β₂ × seniority.
The yellow plane represents the prediction from the linear model. It’s the best-fitting plane we can find within the constraint of a linear relationship.

Linear Model Fit: Explanation (Cont’d)

Linear model is relatively inflexible because it can generate linear functions only. It can only capture relationships that are approximately linear. It can’t capture curves or more complex patterns.

Parametric Methods: Advantages and Disadvantages

Let’s weigh the pros and cons of parametric methods.

Parametric Methods: Pros and Cons

Advantage: Simplifies the problem of estimating f. It’s easier to estimate a few parameters than an entirely arbitrary function. Computationally efficient. It’s like having a recipe – it’s easier to follow than to invent a dish from scratch.
Disadvantage: The assumed form of f might be wrong. If the true f is very different from our assumed form, our estimate will be poor. The model might be too simple to capture the true relationship. It’s like trying to make a cake using a cookie recipe – it won’t work very well.
Overfitting: If we use a very complex (flexible) model, we might overfit the data. This means the model follows the noise (random error) too closely, resulting in poor predictions on new data. The model might be too complex and capture noise instead of the true signal. It’s like memorizing the training data instead of learning the underlying pattern.

Non-parametric Methods

Don’t make assumptions about the shape of f. Let the data speak for itself! It’s like cooking without a recipe – you rely on your senses and experience.

Non-parametric Methods: Definition

Do not make explicit assumptions about the functional form of f. We don’t assume a specific equation for the relationship.
Seek an estimate of f that gets as close to the data points as possible, without being too rough or wiggly. Try to find a smooth curve that fits the data well. It’s like trying to draw a smooth curve through a set of points.
Advantage: Can accurately fit a wider range of possible shapes for f. Avoids the risk of making a wrong assumption about the form of f. More flexible and can capture more complex relationships. It’s like being able to cook any dish, not just those with a recipe.
Disadvantage: Requires a very large number of observations to get an accurate estimate of f. Can be computationally expensive. It’s like needing a lot of experience to cook without a recipe.

Example: Thin-Plate Spline Fit to Income Data

Let’s see a non-parametric method in action.

Thin-Plate Spline: Visualization

Thin-Plate Spline: Explanation

A thin-plate spline (yellow surface) fit to the Income data. A thin-plate spline is a flexible method that can fit a wide variety of shapes.
This is a non-parametric method. No pre-specified model is assumed. The shape of the surface is determined entirely by the data.

Thin-Plate Spline: Explanation(Cont’d)

The fit is much closer to the true f (Figure 2.3) than the linear fit. It captures the non-linear relationship between income, education, and seniority more accurately.
This is a smooth fit. It captures the general trend without being too wiggly. It’s not overly sensitive to individual data points.

Example: Overfitting with a Rough Spline

Let’s see what happens when we make the non-parametric model too flexible.

Rough Spline: Visualization

Rough Spline: Explanation

Same data, but a rougher thin-plate spline fit. This spline is more flexible than the previous one.
This fit perfectly matches the training data (zero error on training data!). It goes through every single data point.

Rough Spline: Explanation (Cont’d)

BUT: This is an example of overfitting. The fit is too wiggly and will likely perform poorly on new data. It has captured the noise, not just the underlying pattern. It’s learned the training data too well. It’s like memorizing the answers to a specific set of questions, rather than understanding the concepts.

The Trade-Off Between Prediction Accuracy and Model Interpretability

There’s a fundamental trade-off in statistical learning: accuracy vs. interpretability. It’s like choosing between a powerful but complex tool and a simple but easy-to-use tool.

Flexibility and Interpretability

Flexibility: How many different shapes of functions can the method fit?
- Less flexible (restrictive): Linear regression (only linear functions). Simpler models. Like a simple tool that can only do one thing.
- More flexible: Thin-plate splines, neural networks. More complex models. Like a multi-purpose tool that can do many things.
Interpretability: How easy is it to understand the fitted model?
- More interpretable: Linear regression (easy to understand coefficients). We can easily see how each predictor affects the response. Like a simple tool with clear instructions.
- Less interpretable: Complex, non-linear models (hard to see how each predictor affects the response). “Black box” models. Like a complex machine with no explanation of how it works.
General rule: As flexibility increases, interpretability decreases.

The Trade-Off: Visualization

Flexibility	Interpretability
Low (e.g., Linear Regression)	High
High (e.g., Neural Networks)	Low

Important Trade-off: We often have to choose between more accurate, but less interpretable models, and simpler, more interpretable models. We can’t always have both! It’s like choosing between a detailed but confusing map and a simplified but easy-to-read map.

Why Choose a More Restrictive Method?

Why might we choose a simpler model, even if it’s less flexible?

Restrictive Methods: Advantages

Even if we only care about prediction, a more restrictive model (like linear regression) can sometimes outperform a more flexible model!

Reasons:

Inference: If we’re interested in understanding the relationship, restrictive models are more interpretable. Easier to explain to stakeholders. It’s like using a simple model that everyone can understand.
Overfitting: Flexible models can overfit the training data, leading to poor predictions on new data. A simpler model might generalize better. Less likely to be fooled by noise. It’s like using a more robust model that’s less sensitive to quirks in the data.
Curse of Dimensionality: With many predictors, flexible models can be hard to fit well and require huge amounts of data. Simpler models are more robust when data is limited. It’s like using a simpler model when you don’t have a lot of information.

Supervised vs. Unsupervised Learning

Let’s revisit the distinction between supervised and unsupervised learning.

Supervised Learning: Definition

Supervised Learning: We have both predictors (X) and a response (Y) for each observation. We want to learn the relationship between X and Y. We have a “teacher” (the response variable) guiding the learning process. It’s like learning with a teacher who provides feedback.
- Examples: Regression, classification. Predicting a numerical value or a category.
- Most of the methods in this book are supervised.

Unsupervised Learning: Definition

Unsupervised Learning: We only have predictors (X), no response (Y). We want to find patterns and structure in the data. We’re exploring the data without a specific target in mind. It’s like exploring a new city without a map or a destination.
- Example: Cluster analysis (grouping observations into clusters). Finding groups of similar observations.

Semi-supervised Learning

Semi-supervised Learning: A mix. We have (X, Y) for some observations, but only X for others. We have some labeled data and some unlabeled data. It’s like learning with a teacher who provides some feedback, but also encourages independent exploration.

Supervised vs. Unsupervised: Clear Distinction?

Note

The distinction between supervised and unsupervised learning isn’t always clear-cut.

Some methods can be used in both supervised and unsupervised settings. It’s like having a tool that can be used for different purposes.

Example: Cluster Analysis

Let’s look at an example of unsupervised learning: cluster analysis.

Cluster Analysis: Visualization

Cluster Analysis: Explanation (Left)

Left: 150 observations, two variables (X₁, X₂).
Three well-separated groups (clusters). Clustering should easily identify these. The groups are distinct and easy to separate. It’s like having three clearly separated piles of different objects.

Cluster Analysis: Explanation (Right)

Right: Overlapping groups. Clustering is much harder. The groups are mixed together, making it difficult to find clear boundaries. It’s like having piles of objects that are mixed together.

Cluster Analysis: Explanation (Goal)

Goal: Identify distinct groups without knowing the group labels beforehand. We’re trying to find hidden structure in the data. It’s like trying to sort objects into groups without knowing what the groups should be.
In the examples shown, there are only two variables, and we can check the scatterplots to identify clusters. But in practice, we often have many more variables, making visual inspection impossible. We need to use clustering and other unsupervised learning approaches.

Regression vs. Classification Problems

Within supervised learning, we have two main types of problems: regression and classification. It’s like having two different types of questions – one asking “how much?” and the other asking “which one?”.

Regression: Definition

Regression: The response variable (Y) is quantitative (numerical).
- Example: Predicting income, house price, stock return, temperature, age. The response can take on a continuous range of values. It’s like asking “how much?” or “how many?”.

Classification: Definition

Classification: The response variable (Y) is qualitative (categorical).
- Example: Predicting whether someone will default on a loan (yes/no), which brand of product they’ll buy (A/B/C), or a medical diagnosis (disease 1/disease 2/no disease), email spam or not spam. The response belongs to one of a set of categories. It’s like asking “which one?” or “what type?”.

Regression vs. Classification: Overlap

Note

Some methods are better suited to regression, others to classification. But many methods can be used for both.
Whether the predictors are quantitative or qualitative is usually less important than the type of response.

The type of response variable (quantitative or qualitative) is the key distinction. It’s like the type of answer you’re looking for determines the type of question you ask.

Assessing Model Accuracy: Regression

How do we measure how well our model performs in a regression setting? How do we know if our predictions are good?

Model Accuracy: Regression - MSE

Goal: Quantify how well our predictions match the observed data. We want our predictions to be close to the true values. It’s like measuring how close our darts are to the bullseye.
Mean Squared Error (MSE): A common measure in regression:

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{f}(x_i))^2 \]
- yᵢ: The true response value for the ith observation. The actual value.
- f̂(xᵢ): The predicted response value for the ith observation. Our model’s prediction.
- Lower MSE is better (closer predictions). We want the average squared difference between our predictions and the true values to be small. It’s like wanting the average distance of our darts from the bullseye to be small.

Training MSE vs. Test MSE

We need to distinguish between how well our model fits the data it was trained on and how well it generalizes to new data.

Training vs. Test MSE

We usually don’t care how well the model fits the training data. We care about how well it predicts new data (the test data). It’s like caring about how well a student performs on a real exam, not just on practice questions.
A model with low training MSE might have high test MSE (overfitting!). It might be memorizing the training data instead of learning the underlying pattern. It’s like a student who memorizes the answers to practice questions but doesn’t understand the concepts.
Training MSE: Calculated using the training data. How well the model fits the data it was trained on.
Test MSE: Calculated using new, unseen data (test data). This is what we really care about! This measures how well our model will perform in the real world. It’s like evaluating the model on data it has never seen before.

Training vs. Test MSE: Ideal Scenario

Ideally: We’d choose the model with the lowest test MSE.
Problem: We often don’t have test data when building the model.
Solution: Techniques like cross-validation (Chapter 5) can help us estimate the test MSE using the training data. It’s like simulating a real exam using the practice questions.

Example: Training and Test MSE vs. Flexibility

Let’s see how training and test MSE change as we vary model flexibility.

MSE vs. Flexibility: Visualization

MSE vs. Flexibility: Explanation (Left)

Left: Data simulated from a non-linear f (black curve). The true relationship is a curve.
- Three fits:
  - linear (orange, a straight line, less flexible),
  - smooth spline (blue, a moderately flexible curve),
  - wiggly spline (green, a very flexible curve).

MSE vs. Flexibility: Explanation (Right)

Right:
- Training MSE (grey curve): Decreases as flexibility increases. More flexible models fit the training data better. The green curve has the lowest training MSE.
- Test MSE (red curve): U-shaped. Decreases, then increases (overfitting). Too much flexibility leads to poor generalization. The blue curve has the lowest test MSE.
- Dashed line: Minimum possible test MSE (irreducible error). Even the best possible model can’t achieve zero test MSE.

MSE vs. Flexibility: Explanation (Observation)

Observation: The blue curve (moderate flexibility) has the lowest test MSE. This is the “sweet spot”. It’s like Goldilocks finding the porridge that’s “just right”.

The Bias-Variance Trade-Off

The U-shape in the test MSE curve is due to two competing properties: bias and variance. It’s like a seesaw – as one goes up, the other goes down.

Bias and Variance: Definitions

The U-shape in the test MSE curve is due to two competing properties:

Variance: How much would our estimate of f (f̂) change if we used a different training set?
- High variance: f̂ changes a lot with different training sets (typical of flexible models). The model is sensitive to the specific training data. It’s like a dart player who is inconsistent – their throws vary a lot.
- Low variance: f̂ is relatively stable (typical of less flexible models). The model is less sensitive to the specific training data. It’s like a dart player who is consistent – their throws are always close together.
Bias: The error introduced by approximating a complex real-world problem with a simpler model.
- High bias: The model makes strong (and possibly wrong) assumptions about f (typical of less flexible models). The model is too simple to capture the true relationship. It’s like trying to fit a square peg in a round hole.
- Low bias: The model makes fewer assumptions (typical of flexible models). The model is flexible enough to capture the true relationship. It’s like using a moldable material that can fit any shape.

Bias-Variance Tradeoff: Decomposition of Test MSE

Expected test MSE at x0 can be decomposed to: \[E(y_0 - \hat{f}(x_0))^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)\]
This equation shows that the expected test MSE is the sum of the variance of our estimate, the squared bias of our estimate, and the variance of the error term. It’s like saying the total error is the sum of errors due to inconsistency, errors due to oversimplification, and errors due to randomness.
To minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias.

Bias-Variance Trade-Off: Illustration

Let’s visualize the bias-variance trade-off.

Bias-Variance Trade-Off: Visualization

Bias-Variance Trade-Off: Explanation (Left)

Squared bias (blue), variance (orange), irreducible error (dashed), and test MSE (red) for the three examples in Figure 2.9.
Left (non-linear f): Bias decreases rapidly, variance increases slowly. As flexibility increases, bias decreases quickly, but variance increases gradually.

Bias-Variance Trade-Off: Explanation (Center)

Center (nearly linear f): Bias is low, variance increases quickly. When the true relationship is close to linear, even simple models have low bias, but variance increases rapidly with flexibility.

Bias-Variance Trade-Off: Explanation (Right)

Right (very non-linear f): Bias decreases dramatically, variance is low. When the true relationship is highly complex, flexible models are needed to reduce bias, and the increase in variance is less of a concern.

Bias-Variance Trade-Off: Key Takeaway

Key takeaway: Good models need both low variance and low bias. This is a trade-off! We need to find the right balance between bias and variance. It’s like finding the sweet spot on a seesaw.

Assessing Model Accuracy: Classification

How do we measure model performance in a classification setting? How do we know if our classifier is making good predictions?

Model Accuracy: Classification - Error Rate

Error Rate: The proportion of mistakes made by the classifier. We count the number of times our model predicts the wrong class. It’s like counting the number of wrong answers on a test.
- Training error rate:
  
  \[ \frac{1}{n} \sum_{i=1}^{n} I(y_i \ne \hat{y}_i) \]
  - yᵢ: True class label. The correct category.
  - ŷᵢ: Predicted class label. Our model’s prediction.
  - I(yᵢ ≠ ŷᵢ): Indicator variable (1 if mistake, 0 if correct). A function that is 1 if the prediction is wrong and 0 if it’s correct.
- Test error rate: Ave(I(y_0 ≠ ŷ₀)). This is what we care about! The average error rate on new, unseen data. It’s like the average error rate on a real exam.
Goal: Choose the classifier with the lowest test error rate.

The Bayes Classifier

The Bayes classifier is the theoretical “best” classifier. It’s like the ideal student who always gets the right answer.

Bayes Classifier: Definition

The “ideal” classifier: Assigns each observation to the most likely class, given its predictor values. It makes the best possible prediction based on the true probabilities. It’s like knowing the exact probability of each answer being correct and always choosing the most probable one.
Conditional probability: Pr(Y = j | X = x₀) - the probability that Y = j (class j), given the predictor values x₀. The probability of belonging to a specific category, given the observed features.
Bayes Classifier: Assigns an observation to the class j for which Pr(Y = j | X = x₀) is largest. Choose the class with the highest probability.
Bayes Decision Boundary: The points where the conditional probabilities for different classes are equal. This is the boundary between where we would predict different classes. It’s like the dividing line between different territories.
Bayes Error Rate: The lowest possible test error rate achievable. Analogous to the irreducible error. This is the best we can possibly do, even with perfect knowledge. It’s like the minimum possible error rate, even for the ideal student.

Example: Bayes Classifier

Let’s visualize the Bayes classifier.

Bayes Classifier: Visualization

Bayes Classifier: Explanation

Simulated data, two classes (orange, blue).
Purple dashed line: Bayes decision boundary. This is the line where the probability of belonging to the orange class is equal to the probability of belonging to the blue class.

Bayes Classifier: Explanation (Cont’d)

Orange/blue shaded regions: Regions where the Bayes classifier would predict orange/blue.
The Bayes error rate is greater than zero because the classes overlap. Even the best classifier will make mistakes because the classes are not perfectly separable. It’s like having some questions on a test that are ambiguous, even for the best student.

K-Nearest Neighbors (KNN)

KNN is a practical method that approximates the Bayes classifier. It’s like a practical student who tries to learn from their peers.

KNN: Motivation

Problem: In reality, we don’t know the conditional distribution of Y given X. So, we can’t directly use the Bayes classifier. We don’t know the true probabilities. It’s like not knowing the exact probabilities of each answer being correct.
KNN: A non-parametric method that estimates the conditional distribution and then classifies based on the estimate. It uses the training data to approximate the probabilities. It’s like looking at similar past exam questions and their answers to guess the answer to a new question.

KNN: Algorithm

How it works:
1. Given a test observation, x₀, find the K closest training observations (the “neighborhood”). Find the K most similar examples in the training data.
2. Estimate the conditional probability for class j as the fraction of neighbors in the neighborhood whose response value is j. Calculate the proportion of neighbors belonging to each class.
3. Classify x₀ to the class with the highest estimated probability. Choose the class that is most common among the neighbors.

Example: KNN

Let’s visualize how KNN works.

KNN: Explanation (Left)

Left: Small training set (6 blue, 6 orange). Black cross is the test observation. We want to predict the class of the black cross.
Circle shows the 3 nearest neighbors (K=3): 2 blue, 1 orange. We find the three closest training observations to the black cross.
KNN predicts “blue”. Because the majority of the neighbors are blue.

KNN: Explanation (Right)

Right: KNN decision boundary (K=3) for all possible values of X₁ and X₂. This shows how KNN would classify any point in the space. This is like drawing a map showing which class KNN would predict for any combination of X₁ and X₂.
KNN can produce a decision boundary and classifier that’s close to Bayes Classifier.

KNN: The Choice of K

The choice of K (the number of neighbors) is crucial in KNN. It’s like choosing how many friends to ask for advice.

KNN: Choosing K

The choice of K (the number of neighbors) controls the flexibility of the KNN classifier.
- Small K: More flexible, lower bias, higher variance (risk of overfitting). The decision boundary is more jagged. It’s like asking only a few close friends – their opinions might be biased or vary a lot.
- Large K: Less flexible, higher bias, lower variance. The decision boundary is smoother. It’s like asking many friends – their opinions will be more stable but might not capture specific nuances.
Example: Figure 2.15 and 2.16 shows KNN fits with different K.
Finding the best K: We want to choose K to minimize the test error rate. Techniques like cross-validation can help. We need to find the value of K that gives the best generalization performance. It’s like finding the optimal number of friends to ask for advice to get the most reliable answer.

KNN: Choosing K, Example Visualization (K=1)

KNN: Choosing K, Example Visualization (K=100)

KNN: K=1 vs K=100

Figure 2.15 (K=1): The KNN decision boundary is overly flexible, and follow the training data too closely.
Figure 2.16 (K=100): The KNN decision boundary is almost linear and too inflexible.

Summary

Let’s recap the key concepts we’ve covered.

Summary: Key Concepts

Statistical learning is about estimating relationships between variables, for prediction and/or inference. It’s like using data to understand the world and make predictions.
Parametric methods assume a specific functional form; non-parametric methods don’t. It’s like choosing between cooking with a recipe and cooking without one.
There’s a trade-off between model flexibility and interpretability. It’s like choosing between a powerful but complex tool and a simple but easy-to-use tool.
We need to assess model accuracy using test data (or estimates of test error). It’s like evaluating a student on a real exam, not just practice questions.
The bias-variance trade-off is fundamental: Good models need both low bias and low variance. It’s like finding the right balance on a seesaw.
In classification, the Bayes classifier is optimal, but we often have to approximate it (e.g., with KNN). It’s like having an ideal student as a benchmark, but using practical methods to approximate their performance.
Choosing the right level of flexibility is crucial. It’s like choosing the right tool for the job.

Thoughts and Discussion

Let’s think about some broader implications and questions.

Thoughts and Discussion: Questions

Think about real-world problems you’re interested in. Would you approach them with a focus on prediction, inference, or both? (e.g., predicting stock prices, understanding customer behavior, diagnosing diseases)
Can you think of examples where a simple, interpretable model might be preferable to a more complex, “black box” model, even for prediction? (e.g., credit scoring, medical diagnosis where explainability is important)
How might the “best” model (in terms of test error) depend on the amount of data available? (e.g., with limited data, simpler models might be better; with lots of data, more complex models might be better)
How does the concept of “overfitting” relate to the bias-variance trade-off? (Overfitting is a result of high variance – the model is too sensitive to the training data and doesn’t generalize well.)
Discuss the differences and similarities between supervised, unsupervised, and semi-supervised learning, and how they apply to real-world problems. (e.g. Supervised: spam filtering; unsupervised: customer segmentation; semi-supervised: image classification with some labeled and some unlabeled images)