graph LR A[Data Mining] --> C(Common Ground) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
zhejiang wanli university
Data mining is the process of discovering patterns, insights, and knowledge from large datasets. It involves using various techniques from statistics, computer science, and database management to extract valuable information. It’s like being a detective, but instead of solving crimes, you’re uncovering hidden clues within data.
Think of it like sifting through a mountain of data to find hidden gems of information! 💎 It’s like panning for gold, but instead of gold, you’re finding valuable insights!
Machine learning is a field of artificial intelligence (AI) that focuses on enabling computers to learn from data without being explicitly programmed (i.e., without being given specific rules for every situation). It involves developing algorithms that can identify patterns, make predictions, and improve their performance over time as they are exposed to more data.
Essentially, machine learning allows computers to learn and adapt, much like humans do! The computer learns from the data, rather than having a programmer tell it exactly what to do.
Statistical learning is a framework for understanding data using statistical methods. It encompasses a vast set of tools for modeling and understanding complex datasets. It’s a recently developed area in statistics and blends with parallel developments in computer science, such as machine learning. It provides the theoretical underpinnings for many machine learning techniques.
It is the theoretical foundation that underpins many machine learning techniques. Think of it as the “science” behind the “magic” of machine learning. It provides the mathematical and statistical rigor.
graph LR A[Data Mining] --> C(Common Ground) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
Data mining, machine learning, and statistical learning overlap significantly. They all involve analyzing data to extract insights and make predictions. Statistical learning provides a theoretical foundation, while machine learning focuses on algorithms and prediction. Data mining applies these techniques to large, often unstructured datasets, to uncover hidden patterns.
Linear regression is a fundamental approach in supervised learning, a type of machine learning where we have labeled data (input features and corresponding output values). It’s used primarily for predicting a quantitative response – a numerical value (e.g., sales, price, height).
It’s like trying to find the best-fitting straight line through a set of data points. 📏 Think of plotting points on a graph and then drawing the line that best captures the overall trend.
Relationship between advertising spending and sales
This plot shows the relationship between advertising spending (on TV, radio, and newspaper) and sales. The x-axis represents the advertising budget (in thousands of dollars) for each medium, and the y-axis represents sales (in thousands of units). Linear regression helps us understand and quantify these relationships – how much does sales increase for each dollar spent on advertising?
We’ll explore the Advertising
data to address several key questions:
Addressing these questions helps determine if a relationship exists, its strength, individual media contributions, prediction accuracy, the linearity of the connection, and potential synergy effects. By answering these, we can make informed decisions about advertising strategies.
Simple linear regression predicts a quantitative response, \(Y\), using a single predictor variable, \(X\), assuming a linear relationship:
\[ Y \approx \beta_0 + \beta_1X \]
The equation \(Y \approx \beta_0 + \beta_1X\) represents a straight line:
graph LR subgraph Linear Equation A[Y] --> B(β₀ + β₁X) end B --> C[β₀: Intercept] B --> D[β₁: Slope] C --> E[Value of Y when X = 0] D --> F[Change in Y for a one-unit increase in X]
This equation defines a straight line where the intercept (\(\beta_0\)) is the point where the line crosses the Y-axis (when X is zero), and the slope (\(\beta_1\)) determines the steepness and direction of the line. If β₁ is positive, Y increases as X increases. If β₁ is negative, Y decreases as X increases.
For instance, let’s regress sales
onto TV
advertising:
\[ \text{sales} \approx \beta_0 + \beta_1 \times \text{TV} \]
Once we estimate the coefficients \(\beta_0\) and \(\beta_1\) (denoted as \(\hat{\beta_0}\) and \(\hat{\beta_1}\)), we can predict sales for a given TV advertising budget (x):
\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1}x \]
The “hat” symbol (^) indicates an estimated value. \(\hat{y}\) is the predicted value of sales, based on our estimated linear model. It’s our best guess for the sales, given the TV advertising budget.
Our goal is to find \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that best fit the data. “Best fit” means the line should be as close as possible to the data points \((x_i, y_i)\). We use the least squares method. Intuitively, we want to find the line that minimizes the overall “error” between the observed data and the predicted values.
The least squares method minimizes the residual sum of squares (RSS):
\[ \text{RSS} = \sum_{i=1}^{n} (y_i - \hat{y_i})^2 = \sum_{i=1}^{n} (y_i - \hat{\beta_0} - \hat{\beta_1}x_i)^2 \]
graph LR subgraph Residual Calculation A[Observed Data (yᵢ)] --> C{Residual (eᵢ)} B[Predicted Data (ŷᵢ)] --> C C --> D[eᵢ = yᵢ - ŷᵢ] end
The residual (\(e_i\)) represents the error in our prediction for each data point. It’s the vertical distance between the actual data point and the point on the regression line. Least squares aims to minimize the sum of the squared residuals. We square the residuals to ensure they are all positive and to penalize larger errors more heavily.
Least squares finds the line that minimizes the sum of the squared vertical distances between the data points and the line. We’re essentially finding the line that makes the overall prediction error as small as possible. By squaring the distances, we ensure that positive and negative errors don’t cancel each other out, and we give more weight to larger errors.
Using calculus (specifically, taking partial derivatives and setting them to zero), we find the values of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that minimize RSS:
\[ \hat{\beta_1} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \]
\[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} \]
The formula for the slope, \(\hat{\beta_1}\), can be interpreted as:
\[ \hat{\beta_1} = \frac{\text{Covariance}(X, Y)}{\text{Variance}(X)} \]
The numerator measures the covariance between X and Y, which indicates how much X and Y vary together. The denominator measures the variance of X, which indicates how much X varies on its own. The slope is essentially the covariance scaled by the variance of X.
The formula for the intercept, \(\hat{\beta_0}\), ensures that the regression line passes through the point (\(\bar{x}\), \(\bar{y}\)).
\[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} \]
This means that when X is equal to its average value (\(\bar{x}\)), the predicted value of Y (\(\hat{y}\)) will be equal to the average value of Y (\(\bar{y}\)).
Least Squares Fit
This figure shows the least squares regression line for sales
versus TV
advertising. Each grey line segment represents a residual – the difference between the actual sales value and the sales value predicted by the line. The least squares method finds the line that minimizes the sum of the squares of these residuals, resulting in the “best-fitting” line.
Population vs. Sample Regression Lines
The left panel shows the true population regression line (red) and the estimated least squares line (blue) from a single sample. The blue line is our best estimate of the red line, based on one particular set of data.
Population vs. Sample Regression Lines
The right panel shows ten different least squares lines, each estimated from a different sample drawn from the same population. The least squares lines vary, but they cluster around the true population line (red). This illustrates the concept of sampling variability – our estimates will vary depending on the specific sample we draw.
Unbiasedness means that our estimation method doesn’t systematically over- or underestimate the true values. If we took many samples and calculated the estimates each time, the average of the estimates would converge to the true values. This is a desirable property of an estimator – it means that, in the long run, our estimation method will give us the right answer.
Formulas for standard errors:
\[ \text{SE}(\hat{\beta_0})^2 = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \right] \]
\[ \text{SE}(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \] - \(\sigma^2\): Variance of the error term \(\epsilon\) (usually unknown, estimated by the residual standard error, RSE). This represents the variability of the data around the true regression line.
\[ \text{SE}(\hat{\beta_0})^2 = \sigma^2 \left[ \frac{1}{n} + \frac{\bar{x}^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \right] \]
The standard error of the intercept depends on the sample size (n), the variance of the error term (σ²), the average value of X (x̄), and the spread of the X values. A larger sample size (n) leads to a smaller standard error.
\[ \text{SE}(\hat{\beta_1})^2 = \frac{\sigma^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2} \]
Smaller standard errors indicate more precise estimates. Notice that SE(\(\hat{\beta_1}\)) is smaller when the \(x_i\) values are more spread out – having a wider range of X values (larger denominator) gives us more information about the slope, leading to a more precise estimate. Also, a smaller error variance (σ²) leads to a smaller standard error.
\[ \hat{\beta_1} \pm 2 \cdot \text{SE}(\hat{\beta_1}) \]
This means that if we were to repeatedly sample from the population and construct 95% confidence intervals, approximately 95% of those intervals would contain the true value of \(\beta_1\). It gives us a range within which we are reasonably confident the true parameter lies. The interval is centered around our estimate (\(\hat{\beta_1}\)) and its width depends on the standard error.
\[ t = \frac{\hat{\beta_1} - 0}{\text{SE}(\hat{\beta_1})} \]
A small p-value (typically < 0.05) provides evidence against the null hypothesis, suggesting a statistically significant relationship between X and Y. A large t-statistic (in absolute value) corresponds to a small p-value. If the p-value is small, it means it’s unlikely to observe such a large t-statistic if there were truly no relationship between X and Y.
Predictor | Coefficient | Std. error | t-statistic | p-value |
---|---|---|---|---|
Intercept | 7.0325 | 0.4578 | 15.36 | < 0.0001 |
TV | 0.0475 | 0.0027 | 17.67 | < 0.0001 |
The table shows the results of regressing sales
on TV
advertising. The very small p-value for TV
provides strong evidence that \(\beta_1 \neq 0\), meaning there is a statistically significant relationship between TV advertising and sales. The t-statistic for TV (17.67) is very large, indicating that the estimated coefficient for TV is many standard errors away from zero.
\[ \text{RSE} = \sqrt{\frac{1}{n-2}\text{RSS}} = \sqrt{\frac{1}{n-2}\sum_{i=1}^{n}(y_i - \hat{y_i})^2} \]
Lower RSE values indicate a better fit, meaning the model’s predictions are closer to the actual values. The RSE is measured in the units of Y. The (n-2) in the denominator represents the degrees of freedom, accounting for the fact that we’ve estimated two parameters (β₀ and β₁).
\[ R^2 = \frac{\text{TSS} - \text{RSS}}{\text{TSS}} = 1 - \frac{\text{RSS}}{\text{TSS}} \]
R² closer to 1 indicates that a large proportion of the variability in the response is explained by the regression. An R² of 0 means the model explains none of the variability (the model is no better than just predicting the average value of Y). In simple linear regression, R² is the square of the correlation between X and Y. It can be interpreted as the percentage of the variation in Y that can be attributed to X.
Quantity | Value |
---|---|
Residual standard error | 3.26 |
R² | 0.612 |
F-statistic | 312.1 |
For the regression of sales
on TV
, the RSE is 3.26 (thousands of units). This means that, on average, the actual sales values deviate from the true regression line by about 3,260 units. The R² is 0.612, meaning that 61.2% of the variability in sales is explained by TV advertising. The F-statistic is a measure of overall model significance (relevant for multiple regression, discussed later).
Extends simple linear regression to handle multiple predictors:
\[ Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \dots + \beta_pX_p + \epsilon \]
Each predictor now has its own slope coefficient, representing its unique contribution to the response, while controlling for the other predictors. This allows us to isolate the effect of each predictor.
\[ \text{sales} = \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times \text{newspaper} + \epsilon \]
Here, we’re trying to predict sales using TV, radio, and newspaper advertising budgets.
Predictor | Coefficient | Std. error | t-statistic | p-value |
---|---|---|---|---|
Intercept | 2.939 | 0.3119 | 9.42 | < 0.0001 |
TV | 0.046 | 0.0014 | 32.81 | < 0.0001 |
radio | 0.189 | 0.0086 | 21.89 | < 0.0001 |
newspaper | -0.001 | 0.0059 | -0.18 | 0.8599 |
Holding TV and newspaper advertising fixed, spending an additional $1,000 on radio advertising is associated with an increase in sales of approximately 189 units. The newspaper coefficient is not statistically significant (p-value > 0.05), suggesting that after accounting for TV and radio advertising, newspaper advertising does not have a significant impact on sales.
TV | radio | newspaper | sales | |
---|---|---|---|---|
TV | 1.0000 | 0.0548 | 0.0567 | 0.7822 |
radio | 0.0548 | 1.0000 | 0.3541 | 0.5762 |
newspaper | 0.0567 | 0.3541 | 1.0000 | 0.2283 |
sales | 0.7822 | 0.5762 | 0.2283 | 1.0000 |
This correlation matrix shows the pairwise correlations between the variables in the Advertising
data. Notice the moderate correlation (0.35) between radio
and newspaper
. This correlation can affect the coefficients in the multiple regression model, explaining why the newspaper
coefficient is not significant in the multiple regression, even though it might be significant in a simple linear regression with only newspaper
. The correlation between radio and newspaper “confounds” the effect of newspaper on sales.
\[ F = \frac{(\text{TSS} - \text{RSS})/p}{\text{RSS}/(n - p - 1)} \]
If there’s no relationship between the response and predictors, the F-statistic will be close to 1. If Hₐ is true, F will be greater than 1. The larger the F-statistic, the stronger the evidence against the null hypothesis. The numerator represents the variance explained by the model, and the denominator represents the unexplained variance. The values p and (n - p - 1) are the degrees of freedom for the numerator and denominator, respectively.
Quantity | Value |
---|---|
Residual standard error | 1.69 |
R² | 0.897 |
F-statistic | 570 |
The F-statistic for the multiple regression of sales
on TV
, radio
, and newspaper
is 570. This is much larger than 1, providing strong evidence against the null hypothesis. The associated p-value is essentially zero, indicating that at least one advertising medium is significantly related to sales. The high R² (0.897) also indicates a good model fit – the predictors explain a large proportion of the variance in sales.
Goal: Identify the subset of predictors that are most strongly related to the response. We want to find the most important predictors and exclude those that don’t contribute meaningfully to the model. We aim for a parsimonious model – one that is as simple as possible while still explaining the data well.
Methods:
We typically can’t try all possible subsets of predictors (there are 2p of them!), so we use these more efficient, step-wise methods to find a good model. These methods provide a computationally feasible way to search for a good subset of predictors.
Adding more variables may not improve predictions on new data (test data). We need to be careful not to overfit the training data by including too many predictors. Techniques like cross-validation can help assess model performance on unseen data and prevent overfitting. Adjusted R² is another metric that penalizes the addition of unnecessary variables.
Three sources of uncertainty in predictions:
Prediction intervals are always wider than confidence intervals because they account for both the uncertainty in estimating the population regression plane and the inherent variability of individual data points around that plane (the irreducible error). Confidence intervals only account for the uncertainty in estimating the average response.
- For a predictor with two levels: Create *one* dummy variable.
- For a predictor with more than two levels: Create *one fewer dummy variable than the number of levels.
- One level will serve as a *baseline* (reference) level.
Each dummy variable is coded as 0 or 1, indicating the absence or presence of a particular level. The baseline level is implicitly represented when all dummy variables are 0.
We want to predict balance
using the own
variable (whether someone owns a house).
\[ x_i = \begin{cases} 1 & \text{if person } i \text{ owns a house} \\ 0 & \text{if person } i \text{ does not own a house} \end{cases} \]
\[ y_i = \beta_0 + \beta_1x_i + \epsilon_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i & \text{if person } i \text{ owns a house} \\ \beta_0 + \epsilon_i & \text{if person } i \text{ does not own a house} \end{cases} \]
\(\beta_0\) represents the average credit card balance for non-owners (the baseline group). \(\beta_0 + \beta_1\) represents the average balance for owners. \(\beta_1\) is the average difference in balance between owners and non-owners. The coefficient of the dummy variable represents the difference in the mean response between the level represented by the dummy variable and the baseline level.
Suppose we have a qualitative predictor, region
, with three levels: North, South, and West. We create two dummy variables:
\[ x_{i1} = \begin{cases} 1 & \text{if person } i \text{ is from the North} \\ 0 & \text{otherwise} \end{cases} \]
\[ x_{i2} = \begin{cases} 1 & \text{if person } i \text{ is from the South} \\ 0 & \text{otherwise} \end{cases} \]
The regression model would be:
\[ y_i = \beta_0 + \beta_1x_{i1} + \beta_2x_{i2} + \epsilon_i = \begin{cases} \beta_0 + \beta_1 + \epsilon_i & \text{if North} \\ \beta_0 + \beta_2 + \epsilon_i & \text{if South} \\ \beta_0 + \epsilon_i & \text{if West (baseline)} \end{cases} \]
\(\beta_0\) represents the average balance for people from the West (the baseline). \(\beta_1\) represents the average difference in balance between people from the North and the West. \(\beta_2\) represents the average difference in balance between people from the South and the West.
Interactions allow the relationship between a predictor and the response to vary depending on the values of other predictors. They allow for a more flexible and realistic model, capturing situations where the effect of one predictor is amplified or diminished by another.
\[ \text{sales} = \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times (\text{TV} \times \text{radio}) + \epsilon \]
This model includes an interaction term between TV and radio advertising.
The model with the interaction term can be rewritten as:
\[ \text{sales} = \beta_0 + (\beta_1 + \beta_3 \times \text{radio}) \times \text{TV} + \beta_2 \times \text{radio} + \epsilon \]
Now, the slope for TV
(\(\beta_1 + \beta_3 \times \text{radio}\)) depends on the value of radio
. The interaction term (\(\beta_3\)) allows for synergy between the advertising media. If \(\beta_3\) is positive, the effect of TV advertising increases as radio advertising increases. If \(\beta_3\) is negative, the effect of TV advertising decreases as radio advertising increases.
We can also include interactions between qualitative and quantitative predictors. For example, we could model the relationship between balance
, income
, and student
(a qualitative variable indicating whether someone is a student) as:
\[ \text{balance} = \beta_0 + \beta_1 \times \text{income} + \beta_2 \times \text{student} + \beta_3 \times (\text{income} \times \text{student}) + \epsilon \]
where student
is a dummy variable (1 if student, 0 otherwise).
This model allows for different slopes for students and non-students. The coefficient β₃ represents the difference in the slope of the income-balance relationship between students and non-students. This allows the effect of income on balance to be different for students compared to non-students.
\[ \text{mpg} = \beta_0 + \beta_1 \times \text{horsepower} + \beta_2 \times \text{horsepower}^2 + \epsilon \]
This is still a linear regression model (linear in the coefficients), but it models a non-linear (quadratic) relationship between mpg
and horsepower
. We’re fitting a curve rather than a straight line to the data.
## Polynomial Regression: degree = 1
Here is the linear fit (dashed), and It does not fit the data well
Polynomial Regression
The plot shows a linear fit (dashed) and a quadratic fit (degree = 2, solid) to the relationship between horsepower
and mpg
. The quadratic fit captures the non-linear relationship better. We can see that as horsepower increases, mpg initially decreases, but then the rate of decrease slows down, and eventually, mpg might even start to increase slightly at very high horsepower levels.
These problems can affect the accuracy and interpretability of the regression model. They can lead to biased coefficient estimates, incorrect standard errors, and misleading conclusions. It’s important to diagnose and address these issues to ensure the model is reliable and valid.
邱飞(peter) 💌 [email protected]