01 Foundations of Machine Learning

The Core Question: Why Should Economists Learn Machine Learning?

Traditional econometrics and machine learning represent two different ‘cultures’ for solving problems.

  • Econometrics: The core is Causal Inference.
    • Goal: To understand ‘why’ the world works the way it does.
    • Concern: Unbiasedness and consistency of parameter estimates.
    • Example: By how much did the minimum wage increase cause a change in the unemployment rate?
  • Machine Learning: The core is Prediction.
    • Goal: To predict ‘what’ will happen in the world.
    • Concern: The model’s generalization ability on unseen data.
    • Example: Based on current macroeconomic indicators, predict next quarter’s GDP growth rate.

A Visual Contrast of the Two Cultures

The two methodologies have fundamental differences in model selection and objectives.

Econometrics vs. Machine Learning A comparative diagram showing causal inference in econometrics on the left and prediction in machine learning on the right. Econometrics Goal: Causal Inference X Treatment Y Outcome β 'We want to understand β' Prefers simple, interpretable models (e.g., OLS, IV) Machine Learning Goal: Accurate Prediction f(x) Input X Prediction Ŷ 'We only care that Ŷ ≈ Y' Embraces complex, 'black-box' models (e.g., Random Forest, Neural Networks)

Why Prediction Matters to Economists

In a data-driven era, predictive power is itself a potent economic tool.

  • Financial Markets: Predicting asset prices, volatility, and credit risk.
  • Macroeconomics: Forecasting inflation, GDP growth, and unemployment to inform policy.
  • Business Decisions: Forecasting product sales, customer churn, and supply chain demand.
  • Policy Evaluation: Predicting the likely economic impact of a policy (e.g., a tax cut).

Causal inference explains the past; accurate prediction provides insight into the future. Combined, their power is multiplied.

Today’s Goal: Build a Complete Mental Framework for Machine Learning

After this chapter, you will be able to systematically understand any machine learning project from the perspective of ‘The Four Pillars’.

The Four Pillars of Machine Learning This diagram shows the four core components of machine learning: Framing the Problem, Modeling, Evaluation, and Optimization. ? 1. Frame Problem (Framing) Regression vs. Classification? Supervised vs. Unsupervised? f(x) 2. Define Model (Modeling) Linear vs. Non-linear? Simple vs. Complex? 3. Define 'Good' (Evaluation) MSE vs. F1-Score? Is it overfitting? 4. Define 'Learning' (Optimization) Which optimizer to use? What learning rate to set?

Pillar I: Frame the Problem (Framing)

This is the starting point for all work.

Before any technical details, you must clearly define the business problem and translate it into a specific machine learning task.

  • What are you trying to predict?
    • A continuous value (e.g., tomorrow’s stock price) \(\rightarrow\) Regression
    • A discrete category (e.g., whether a customer will default) \(\rightarrow\) Classification
  • What data do you have?
    • Does the data come with the ‘answer’ you want to predict (i.e., a label y)?
      • Yes \(\rightarrow\) Supervised Learning
      • No \(\rightarrow\) Unsupervised Learning

Pillar II: Define the Model (Modeling)

A model is essentially a mathematical function \(f(x, \theta)\) that tries to capture the relationship between input features \(x\) and the output \(y\).

  • \(x\): The input feature vector (e.g., house area, location).
  • \(\theta\): The model’s parameters. These are the values that need to be determined through ‘learning’ (e.g., coefficients in a linear regression).
  • \(f\): The form of the function. This is what we, as modelers, choose.

The range of models is vast:

  • Simple Models: Linear Regression, Logistic Regression (highly interpretable).
  • Complex Models: Random Forest, Gradient Boosting Trees, Neural Networks (powerful prediction).

Pillar III: Define ‘Good’ (Evaluation)

How do we objectively measure how good a model is? We need an evaluation metric.

  • This metric must reflect the business objective.
  • It must be calculated on data the model has never seen before (the test set).

Common Evaluation Metrics:

  • Regression Tasks:
    • Mean Squared Error (MSE)
    • R-squared (R²)
  • Classification Tasks:
    • Accuracy
    • Precision, Recall, F1-Score

Pillar IV: Define ‘Learning’ (Optimization)

The process of ‘learning’ is the process of automatically finding the best parameters \(\theta\).

  1. We first define a Loss Function \(J(\theta)\), which measures how bad the model’s predictions are with the current parameters \(\theta\). The smaller the loss, the better the model.

  2. Then, we use an Optimizer, such as Gradient Descent, to systematically and iteratively adjust the parameters \(\theta\) to find the set of values \(\theta^*\) that minimizes the loss function \(J(\theta)\).

\[ \large \theta^* = \arg\min_{\theta} J(\theta) \]

What is Machine Learning? Learning Functions from Data

The essence of Machine Learning (ML) is to have a computer automatically learn a function from data, rather than through explicit programming, where this function can make predictions on unknown data.

Fundamental Machine Learning Workflow A diagram showing the ML workflow: a learning algorithm uses training data and a performance metric to produce a predictive model, which then makes predictions on new data. Fundamental ML Workflow Training Data (D) Performance Metric (P) Learning Algorithm (A) Predictive Model f(x) New Data (x) Prediction (ŷ)

Data Representation in ML: Everything is a Vector

In machine learning, we need to convert real-world objects into a language computers understand—numbers.

  • Dataset: A collection of N samples \(X = \{x_1, x_2, \dots, x_N\}\).
  • Sample: A single data point (a house, a customer).
  • Feature: A dimension describing a sample (area, number of bedrooms).
  • Feature Vector: A vector of all features for one sample \(x = (x_{\text{area}}, x_{\text{bedrooms}}, \dots)^T\).
  • Label: The target value we want to predict, y (house price).

From the Real World to Mathematical Objects

This transformation process is at the heart of data preprocessing.

Data Vectorization This diagram shows how a tabular dataset is converted into a feature matrix X and a label vector Y for machine learning. 1. Raw Data (Table) Area(m²) Bedrooms Location Price($10k) 120 3 Zone A 500 85 2 Zone B 320 ... 200 4 Zone A 850 N samples Vectorization 2. ML Representation Feature Matrix X ... N x d matrix Label Vector Y ... N x 1 vector

A Sample = A Point in d-Dimensional Space

Once we represent samples as feature vectors, each sample can be viewed as a point in a d-dimensional feature space.

This provides a geometric foundation for understanding machine learning algorithms.

Data Samples in 2D Feature Space A plot showing two data points in a 2D space defined by 'Feature 1 (Area)' and 'Feature 2 (Bedrooms)'. Feature 1 (Area) Feature 2 (Bedrooms) 0 1 2 3 4 0 100 200 Sample 1 Sample 2

Case Study: Understanding Feature Vectors with Economic Data

Suppose we want to predict U.S. Personal Consumption Expenditures (PCE). We can get data from FRED.

Date Disposable Personal Income (DPI) Consumer Confidence Index (CONF) Personal Consumption Expenditures (PCE) (Label y)
2023-01-01 19,800 102.9 18,000
2023-02-01 19,850 103.4 18,050
  • A sample is one row of data, representing one month.
  • The feature vector is \(x_t = (\text{DPI}_t, \text{CONF}_t)^T\). This is a point in a 2D space.
  • The label is \(y_t = \text{PCE}_t\).

The Three Main Categories of Machine Learning

Based on the data we have (especially whether we have the label y), machine learning tasks can be divided into three main categories.

  1. Supervised Learning
  2. Unsupervised Learning
  3. Reinforcement Learning

We will introduce them one by one.

Category 1: Supervised Learning

Used when the data comes with clear ‘answers’ or ‘labels’.

Supervised Learning Workflow A flowchart of the supervised learning process: labeled data is used to train a model, which then makes predictions on new, unlabeled data. Supervised Learning Workflow 1. Labeled Training Data D = { (x₁, y₁), (x₂, y₂), ... } (Image of cat, 'Cat') (Image of dog, 'Dog') 2. Train Model Model f Learn mapping f: X → Y 3. Predict x_new New Data ŷ Prediction f() ŷ = f(x_new)

Category 2: Unsupervised Learning

Used when data has no ‘answers’, and we want to discover its internal structure.

Unsupervised Learning: Clustering A diagram illustrating the process of unsupervised learning, where an algorithm finds hidden structures (clusters) in unlabeled data. Unsupervised Learning Input: Unlabeled Data Clustering Algorithm Output: Discovered Structure

Category 3: Reinforcement Learning

Used when we need to learn an optimal strategy through ‘trial and error’ with an environment.

Reinforcement Learning Loop A diagram showing the reinforcement learning cycle: an agent takes an action, the environment returns a new state and a reward, and the agent uses this feedback to learn an optimal policy. Reinforcement Learning Agent Environment Action Aₜ State Sₜ₊₁, Reward Rₜ₊₁ Data: No fixed dataset; generated dynamically through interaction. Goal: Learn a Policy to maximize long-term cumulative reward. Examples: Training game AI, robotics control, automated trading.

Focusing on Supervised Learning: Regression vs. Classification

The vast majority of tasks in economics and business fall under supervised learning. It can be further divided into two major tasks based on the type of the label y.

  • Regression:
    • Goal: Predict a continuous numerical value.
    • Output: \(y \in \mathbb{R}\)
    • Examples: Predicting GDP growth rate, company sales figures.
  • Classification:
    • Goal: Predict a discrete category.
    • Output: \(y \in \{C_1, C_2, \dots, C_K\}\)
    • Examples: Determining if a customer will churn, if a transaction is fraudulent.

Geometric Intuition of Regression and Classification

Supervised Learning: Regression vs. Classification A side-by-side comparison of regression, which fits a line to continuous data, and classification, which finds a boundary to separate discrete classes. Supervised Learning: Regression vs. Classification Regression Predicting a continuous value Continuous Target Feature Best-Fit Line Classification Predicting a discrete class Feature 2 Feature 1 Decision Boundary

Practical Context: The Keynesian Consumption Function

Before diving into code, let’s revisit a classic economic theory: the Keynesian Consumption Function.

\[ C = a + b Y_d \]

  • \(C\): Total consumption
  • \(Y_d\): Disposable income
  • \(a\): Autonomous consumption (consumption even with zero income)
  • \(b\): Marginal Propensity to Consume (MPC, how much of an extra unit of income is spent)

This is a classic linear relationship. We can use the simplest machine learning model—linear regression—to estimate this relationship from data.

Practical Example: A Simple Regression Analysis in Python

Task: Use U.S. Disposable Personal Income (DPI) to predict Personal Consumption Expenditures (PCE).

This is a typical regression problem. We will use the statsmodels library.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# --- Create mock data ---
# Assume the true relationship is PCE = 500 + 0.9 * DPI + noise
np.random.seed(42)
dpi_mock = np.linspace(10000, 20000, 150)
noise = np.random.normal(0, 300, 150)
pce_mock = 500 + 0.9 * dpi_mock + noise
df = pd.DataFrame({'DPI': dpi_mock, 'PCE': pce_mock})

# Define independent (X) and dependent (y) variables
X = df['DPI']
y = df['PCE']
X = sm.add_constant(X) # Add an intercept term

# Fit the OLS model
model = sm.OLS(y, X).fit()

# Visualization
fig, ax = plt.subplots(figsize=(8, 5))

# Plot scatter plot and the fitted line
ax.scatter(df['DPI'], df['PCE'], alpha=0.6, label='Simulated Monthly Data', color='gray')
ax.plot(df['DPI'], model.predict(X), color='crimson', linewidth=2.5, label='OLS Fit Line')

ax.set_title('Keynesian Consumption Function: Consumption Determined by Income', fontsize=16, fontweight='bold')
ax.set_xlabel('Real Disposable Personal Income (Billions of Dollars)', fontsize=12)
ax.set_ylabel('Personal Consumption Expenditures (Billions of Dollars)', fontsize=12)

# Set legend
ax.legend(fontsize=11, loc='upper left')

# Set tick labels
ax.tick_params(axis='both', which='major', labelsize=11)

# Add grid for readability
ax.grid(True, alpha=0.3)

fig.tight_layout()
plt.show()

# Print partial model summary
print(f'R-squared: {model.rsquared:.4f}')
print(f"DPI Coefficient (estimated MPC): {model.params['DPI']:.4f}")
Figure 1: U.S. Personal Consumption Expenditures (PCE) vs. Disposable Personal Income (DPI)
R-squared: 0.9886
DPI Coefficient (estimated MPC): 0.9040

Pillar III: How to Evaluate if a Model is Good?

How do we know if the model we trained, \(f(x; \theta)\), is a ‘good’ model?

Core Principle: The model’s performance on data it has never seen before (the test set) is the only true measure of its ability to generalize.

This leads to the most important practice in machine learning: the Train-Test Split.

The Golden Rule: Train-Test Split

We must divide our data into at least two parts.

Train-Test Split Workflow This flowchart shows how a full dataset is split into a training set and a test set. The model is trained on the training set and evaluated on the test set. Full Dataset 80% 20% Training Set Used to train the model Test Set 'Mock Exam' Train Model f(x,θ) Evaluate Performance

Why Must We Split? The Ghost of Overfitting

Overfitting occurs when a model learns the training data “too well,” to the point that it memorizes the noise and random fluctuations in the data as if they were general patterns.

  • Symptom: Performs extremely well on the training set, but very poorly on the test set.
  • Cause: The model is too complex, with too much freedom relative to the amount of data.

The test set is like a ‘mock exam’—it fairly tests whether the model has truly learned the subject or just memorized the answers.

Cornerstone of Classification Evaluation: The Confusion Matrix

For binary classification problems (e.g., predicting customer default), all evaluation metrics derive from a simple table: the Confusion Matrix.

Predicted Positive Predicted Negative
Actual Positive True Positive (TP) False Negative (FN)
Actual Negative False Positive (FP) True Negative (TN)
  • Positive: The event we care about, like ‘default’ or ‘fraud’.
  • Negative: The other case, like ‘no default’.
  • True/False: Refers to whether the prediction was correct.

Understanding the Four Quadrants of the Confusion Matrix

  • TP (True Positive): Correct prediction, the customer did default. (Hit)
  • TN (True Negative): Correct prediction, the customer did not default. (Correct Rejection)
  • FP (False Positive): Incorrect prediction, predicted default, but they didn’t. (False Alarm, Type I Error)
  • FN (False Negative): Incorrect prediction, predicted no default, but they did. (Miss, Type II Error)

In fields like financial risk management, the cost of an FN (missing a bad customer) is often far greater than the cost of an FP (misjudging a good customer).

Classification Metric (1): Accuracy

Accuracy measures the proportion of total samples that the model predicted correctly.

\[ \large \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Samples}} = \frac{TP + TN}{TP + TN + FP + FN} \]

Advantage: Very intuitive and easy to understand.

Disadvantage: Highly misleading on imbalanced datasets.

The Accuracy Trap: An Example

Imagine a credit card fraud detection scenario:

  • Total transactions: 10,000
  • Normal transactions: 9,990 (99.9%)
  • Fraudulent transactions: 10 (0.1%)

How does a ‘lazy’ model that predicts all transactions as ‘normal’ perform?

  • TP = 0, TN = 9990
  • FP = 0, FN = 10
  • Accuracy = (0 + 9990) / 10000 = 99.9%

This model has extremely high accuracy but is completely useless, as it fails to identify a single case of fraud.

Classification Metric (2): Precision

Precision measures the proportion of all samples predicted as positive that are actually positive.

\[ \large \text{Precision} = \frac{TP}{TP + FP} \]

  • Business Meaning: ‘Of all the fraud alerts I raised, how many were real?’
  • Focus: The purity of the predictions. High precision means fewer false alarms (FP).
Precision Illustrated A Venn diagram explaining the concept of precision, highlighting the intersection of predicted positive and actual positive sets. Predicted Positive (TP + FP) Actual Positive (TP + FN) FP TP FN

Classification Metric (3): Recall

Recall measures the proportion of all actual positive samples that we successfully identified.

\[ \large \text{Recall} = \frac{TP}{TP + FN} \]

  • Business Meaning: ‘Of all the actual fraud cases that occurred, how many did my model catch?’
  • Focus: How complete the search is. High recall means fewer misses (FN).
Recall Illustrated A Venn diagram explaining the concept of recall, focusing on the proportion of actual positives that were correctly identified. FP TP FN Predicted Positive (TP+FP) Actual Positive (TP+FN)

Precision vs. Recall: An Eternal Trade-off

In the real world, precision and recall are often negatively correlated.

  • Want to increase Recall? Lower the model’s ‘alarm’ threshold. It’s better to catch a thousand innocent people than to let one guilty person escape. This increases false alarms (FP), thereby decreasing precision.
  • Want to increase Precision? Raise the model’s ‘alarm’ threshold. Only sound the alarm for cases with overwhelming evidence. This increases misses (FN), thereby decreasing recall.

Business Decision: We need to decide which balance point to choose in this trade-off based on the different business costs of FPs and FNs.

Visualizing the Precision-Recall Trade-off

Precision-Recall Tradeoff A curve showing the inverse relationship between precision and recall. As one increases, the other tends to decrease, depending on the model's decision threshold. Precision-Recall Tradeoff Precision Recall A: High Threshold (High P, Low R) B: Low Threshold (Low P, High R)

Classification Metric (4): F1-Score

To balance precision and recall, we use the F1-Score, which is their harmonic mean.

\[ \large F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

  • The F1-Score will only be high if both precision and recall are both relatively high.
  • If one of the metrics is low, the F1-Score will also be pulled down.
  • It is a more robust single evaluation metric than accuracy on imbalanced datasets.

Practical Example: Calculating Classification Metrics with scikit-learn

Let’s use a hypothetical credit default prediction example to demonstrate how to calculate these metrics.

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# --- Create mock data (imbalanced) ---
# Assume 1000 customers, 50 of whom actually default (5%)
y_true = np.array([0]*950 + [1]*50)
# Assume the model predicts 40 defaults, 30 of which are correct (TP=30) and 10 are wrong (FP=10)
# This means the model missed 20 of the 50 actual defaults (FN=20)
y_pred = np.array([0]*940 + [1]*10 + [0]*20 + [1]*30)
# Create a random permutation to shuffle the data
p = np.random.permutation(len(y_true))
y_true, y_pred = y_true[p], y_pred[p]

# 1. Calculate the metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, zero_division=0)
recall = recall_score(y_true, y_pred, zero_division=0)
f1 = f1_score(y_true, y_pred, zero_division=0)

print(f'Accuracy: {accuracy:.3f}')
print(f'Precision: {precision:.3f}')
print(f'Recall: {recall:.3f}')
print(f'F1-Score: {f1:.3f}')

# 2. Calculate and visualize the confusion matrix
cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', cm)

fig, ax = plt.subplots(figsize=(6, 4.5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted No Default', 'Predicted Default'], 
            yticklabels=['Actual No Default', 'Actual Default'], ax=ax, annot_kws={'size': 14})
ax.set_ylabel('Actual Label', fontsize=12)
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_title('Confusion Matrix', fontsize=14)
plt.show()
Accuracy: 0.970
Precision: 0.750
Recall: 0.600
F1-Score: 0.667

Confusion Matrix:
 [[940  10]
 [ 20  30]]

Regression Metric: Mean Squared Error (MSE)

For regression tasks (predicting continuous values), the most common evaluation metric is the Mean Squared Error (MSE).

\[ \large \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \]

  • \(y_i\) is the true value for sample \(i\).
  • \(\hat{y}_i\) is the model’s prediction for sample \(i\).
  • \((y_i - \hat{y}_i)\) is the residual.

MSE calculates the average of the squared residuals. Because it uses squares, it penalizes large errors more heavily.

Visualizing Mean Squared Error (MSE)

Mean Squared Error Visualization A plot showing a regression line and data points. The residuals (errors) are shown as dashed lines, and the squared errors are represented as squares, illustrating that larger residuals contribute more to the total error. Mean Squared Error MSE penalizes large errors more heavily Model Prediction (ŷ) Actual Data Point (y) Residual (y - ŷ) Squared Residual (y - ŷ)²

The Essence of Learning: Optimization

We’ve defined the model and evaluation criteria, but how does a machine actually ‘learn’? The essence of learning is an Optimization process.

  1. We define a Loss Function \(J(\theta)\), which measures how bad the model’s predictions are on the training set with the current parameters \(\theta\). The lower the loss function’s value, the better the model performs.

  2. The goal of ‘learning’ is to find a set of parameters \(\theta^*\) that minimizes the loss function \(J(\theta)\).

For regression problems, MSE is the most commonly used loss function.

Loss Function: The ‘Navigation Map’ for Optimization

The loss function \(J(\theta)\) describes a ‘topographical map’, where the altitude is the loss value. Our goal is to start from a random point and walk to the lowest valley in this terrain.

Figure 2: The landscape of a loss function: Our goal is to find the global minimum.

Loss Function for Classification: Cross-Entropy

For classification problems, we commonly use Cross-Entropy Loss.

  • Intuitive Understanding: It measures the ‘distance’ between the probability distribution predicted by the model and the true probability distribution.
  • For binary classification: \[ \large L(\theta) = - \frac{1}{N} \sum_{i=1}^N \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right] \] where \(y_i \in \{0, 1\}\) is the true label, and \(\hat{y}_i \in [0, 1]\) is the model’s predicted probability of the class being 1.

This function has a nice property: when the model makes a very confident and wrong prediction, the loss becomes very large, giving the model a strong ‘penalty’ signal.

Pillar IV: Define ‘Learning’ (Optimization)

We have the map (the loss function), but how do we find the way down the mountain?

The most classic and important method is Gradient Descent.

Core Idea: Imagine you are on a foggy mountainside, and you can only see the small patch of ground at your feet. To get down the fastest, you should take a step in the direction of the steepest descent from your current position.

Mathematically, the negative of the gradient of a function at a point is the direction in which the function’s value decreases most rapidly.

The Mathematical Principle of Gradient Descent

Gradient descent is an iterative algorithm. At each step t, it updates the parameters \(\theta\) according to the following rule:

\[ \large \theta_{t+1} = \theta_t - \eta \nabla J(\theta_t) \]

  • \(\theta_t\): The value of the parameters at step t.
  • \(\nabla J(\theta_t)\): The gradient of the loss function \(J\) at \(\theta_t\). It is a vector pointing in the direction of the fastest increase in the function’s value.
  • \(\eta\): The Learning Rate, a hyperparameter that controls how far we step each time.
  • \(-\eta \nabla J(\theta_t)\): We take a small step in the direction opposite to the gradient.

We repeat this process until the parameters converge.

Visualizing Gradient Descent

Gradient Descent Optimization Path A contour plot showing the path of gradient descent, starting from an initial point and iteratively moving towards the minimum of the loss function. Minimum (θ*) θ₀ (Start) θ₁ θ₂ -η∇J(θ₀)

The Learning Rate (η): Determining Optimization Speed and Success

Figure 3: The impact of the learning rate (η) on the gradient descent process
  • \(\eta\) Too Small: Crawls down the mountain like a snail, converging very slowly.
  • \(\eta\) Too Large: Descends like a drunkard, potentially ‘overshooting’ and oscillating around the minimum, or even ‘jumping’ to the other side of the mountain, causing divergence.

Variants of Gradient Descent: Handling Large-Scale Data

When our training set is very large, computing the gradient over the entire dataset becomes very time-consuming. For this reason, variants of gradient descent have been developed.

Comparison of Gradient Descent Variant Paths Compares the optimization paths of Batch Gradient Descent (BGD), Stochastic Gradient Descent (SGD), and Mini-batch Gradient Descent (MBGD). Convergence Paths of Different Gradient Descent Algorithms BGD (Smooth) SGD (Noisy) Mini-batch (Compromise) Start End

Comparison of Gradient Descent Variants

Type Gradient Calculation Method Advantages Disadvantages
Batch GD (BGD) Uses all training samples Accurate gradient, smooth convergence High computational cost, slow
Stochastic GD (SGD) Uses one randomly picked sample Fast, can escape local minima High variance in updates, noisy path
Mini-batch GD (MBGD) Uses a small batch of samples (e.g., 32) Combines benefits of BGD and SGD, the default choice Requires tuning batch size

Advanced Optimizers: Making the Descent Smarter

Basic gradient descent can struggle in complex loss landscapes. In modern deep learning, we use more advanced optimizers.

  • Momentum
    • Idea: Simulates momentum from physics. The update considers not only the current gradient but also the previous update direction, like a ball rolling down a hill.
    • Effect: Helps the algorithm “power through” flat regions and local minima, accelerating convergence.
  • Adam (Adaptive Moment Estimation)
    • Idea: Combines momentum with adaptive learning rates (adjusting the learning rate independently for each parameter).
    • Effect: Performs well across a wide range of tasks and is often the go-to default optimizer.

For beginners: Using the Adam optimizer directly will often yield excellent results.

Comprehensive Case Study: Predicting House Prices with Linear Regression

Let’s tie everything we’ve learned today together to complete a full machine learning project using scikit-learn.

  • Task: Predict the median house value in California districts (a regression problem).
  • Data: Scikit-learn’s built-in California Housing dataset.
  • Model: Linear Regression.
  • Evaluation: Mean Squared Error (MSE).
  • Optimization: Scikit-learn’s LinearRegression uses efficient optimization algorithms under the hood.

Case Study (1): Loading and Splitting the Data

The first step is always to prepare the data: load it and split it into training and test sets.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='MedHouseVal')

# 2. Split the data into training and testing sets (80% train, 20% test)
# random_state ensures the split is the same every time for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print('Training set size:', X_train.shape)
print('Test set size:', X_test.shape)
print('\nPreview of some feature data:')
print(X_train.head())
Training set size: (16512, 8)
Test set size: (4128, 8)

Preview of some feature data:
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
14196  3.2596      33.0  5.017657   1.006421      2300.0  3.691814     32.71   
8267   3.8125      49.0  4.473545   1.041005      1314.0  1.738095     33.77   
17445  4.1563       4.0  5.645833   0.985119       915.0  2.723214     34.66   
14265  1.9425      36.0  4.002817   1.033803      1418.0  3.994366     32.69   
2271   3.5542      43.0  6.268421   1.134211       874.0  2.300000     36.78   

       Longitude  
14196    -117.03  
8267     -118.16  
17445    -120.48  
14265    -117.11  
2271     -119.80  

Case Study (2): Training the Model and Making Predictions

We use the training set to ‘teach’ our model, and then test its learning on the test set.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# (Continues from the previous code block with variables X_train, y_train, X_test, y_test)

# 1. Initialize a Linear Regression model
model = LinearRegression()

# 2. Train the model on the training set (i.e., find the best parameters θ)
# The .fit() method executes the learning/optimization process
model.fit(X_train, y_train)

# 3. Make predictions on the test set
# The .predict() method uses the learned model to make predictions
y_pred = model.predict(X_test)

# Print some of the prediction results
print('Actual house prices (first 5):', np.round(y_test.head().values, 2))
print('Predicted house prices (first 5):', np.round(y_pred[:5], 2))
Actual house prices (first 5): [0.48 0.46 5.   2.19 2.78]
Predicted house prices (first 5): [0.72 1.76 2.71 2.84 2.6 ]

Case Study (3): Evaluating the Model and Interpreting Results

Finally, we calculate our evaluation metric and see what the model has learned.

import pandas as pd
from sklearn.metrics import mean_squared_error
# (Continues from the previous code block with variables y_test, y_pred, model, X)

# 4. Evaluate the model's performance on the test set
mse = mean_squared_error(y_test, y_pred)
print(f'\nThe model`s Mean Squared Error (MSE) on the test set is: {mse:.4f}')

# 5. Examine the learned parameters (coefficients)
# model.coef_ corresponds to the slope coefficients in the regression equation
coef_df = pd.DataFrame(model.coef_, index=X.columns, columns=['Coefficient'])
print('\nLearned Coefficients (θ):')
print(coef_df.head())
# model.intercept_ corresponds to the intercept term
print(f'\nLearned Intercept: {model.intercept_:.4f}')

The model`s Mean Squared Error (MSE) on the test set is: 0.5559

Learned Coefficients (θ):
            Coefficient
MedInc         0.448675
HouseAge       0.009724
AveRooms      -0.123323
AveBedrms      0.783145
Population    -0.000002

Learned Intercept: -37.0233

Case Study (4): Visualizing the Prediction Results

Comparing the model’s predictions to the actual values is the most intuitive way to check its performance.

Figure 4: Model Predicted House Value vs. Actual House Value

If the predictions were perfect, all the points would lie on the red dashed line.

Chapter Summary: The Four Pillars of Machine Learning

Today, we built a complete framework for thinking about machine learning problems.

1. Frame the Problem

  • Task: Regression vs. Classification
  • Learning Type: Supervised vs. Unsupervised

2. Define the Model

  • Choose a function \(f(x, \theta)\)
  • e.g., Linear Regression

3. Define ‘Good’ (Evaluation)

  • Core: Evaluate on the test set
  • Regression: MSE, R²
  • Classification: F1-Score, Recall

4. Define ‘Learning’ (Optimization)

  • Minimize a Loss Function \(J(\theta)\)
  • Gradient Descent is the core algorithm

Thank you!

Q & A