01 Foundations of Machine Learning

The Core Question: Why Should Economists Learn Machine Learning?

Traditional econometrics and machine learning represent two different ‘cultures’ for solving problems.

Econometrics: The core is Causal Inference.
- Goal: To understand ‘why’ the world works the way it does.
- Concern: Unbiasedness and consistency of parameter estimates.
- Example: By how much did the minimum wage increase cause a change in the unemployment rate?
Machine Learning: The core is Prediction.
- Goal: To predict ‘what’ will happen in the world.
- Concern: The model’s generalization ability on unseen data.
- Example: Based on current macroeconomic indicators, predict next quarter’s GDP growth rate.

A Visual Contrast of the Two Cultures

The two methodologies have fundamental differences in model selection and objectives.

Why Prediction Matters to Economists

In a data-driven era, predictive power is itself a potent economic tool.

Financial Markets: Predicting asset prices, volatility, and credit risk.
Macroeconomics: Forecasting inflation, GDP growth, and unemployment to inform policy.
Business Decisions: Forecasting product sales, customer churn, and supply chain demand.
Policy Evaluation: Predicting the likely economic impact of a policy (e.g., a tax cut).

Causal inference explains the past; accurate prediction provides insight into the future. Combined, their power is multiplied.

Today’s Goal: Build a Complete Mental Framework for Machine Learning

After this chapter, you will be able to systematically understand any machine learning project from the perspective of ‘The Four Pillars’.

Pillar I: Frame the Problem (Framing)

This is the starting point for all work.

Before any technical details, you must clearly define the business problem and translate it into a specific machine learning task.

What are you trying to predict?
- A continuous value (e.g., tomorrow’s stock price) \(\rightarrow\) Regression
- A discrete category (e.g., whether a customer will default) \(\rightarrow\) Classification
What data do you have?
- Does the data come with the ‘answer’ you want to predict (i.e., a label y)?
  - Yes \(\rightarrow\) Supervised Learning
  - No \(\rightarrow\) Unsupervised Learning

Pillar II: Define the Model (Modeling)

A model is essentially a mathematical function \(f(x, \theta)\) that tries to capture the relationship between input features \(x\) and the output \(y\).

\(x\): The input feature vector (e.g., house area, location).
\(\theta\): The model’s parameters. These are the values that need to be determined through ‘learning’ (e.g., coefficients in a linear regression).
\(f\): The form of the function. This is what we, as modelers, choose.

The range of models is vast:

Simple Models: Linear Regression, Logistic Regression (highly interpretable).
Complex Models: Random Forest, Gradient Boosting Trees, Neural Networks (powerful prediction).

Pillar III: Define ‘Good’ (Evaluation)

How do we objectively measure how good a model is? We need an evaluation metric.

This metric must reflect the business objective.
It must be calculated on data the model has never seen before (the test set).

Common Evaluation Metrics:

Regression Tasks:
- Mean Squared Error (MSE)
- R-squared (R²)
Classification Tasks:
- Accuracy
- Precision, Recall, F1-Score

Pillar IV: Define ‘Learning’ (Optimization)

The process of ‘learning’ is the process of automatically finding the best parameters \(\theta\).

We first define a Loss Function \(J(\theta)\), which measures how bad the model’s predictions are with the current parameters \(\theta\). The smaller the loss, the better the model.
Then, we use an Optimizer, such as Gradient Descent, to systematically and iteratively adjust the parameters \(\theta\) to find the set of values \(\theta^*\) that minimizes the loss function \(J(\theta)\).

\[ \large \theta^* = \arg\min_{\theta} J(\theta) \]

What is Machine Learning? Learning Functions from Data

The essence of Machine Learning (ML) is to have a computer automatically learn a function from data, rather than through explicit programming, where this function can make predictions on unknown data.

Data Representation in ML: Everything is a Vector

In machine learning, we need to convert real-world objects into a language computers understand—numbers.

Dataset: A collection of N samples \(X = \{x_1, x_2, \dots, x_N\}\).
Sample: A single data point (a house, a customer).
Feature: A dimension describing a sample (area, number of bedrooms).
Feature Vector: A vector of all features for one sample \(x = (x_{\text{area}}, x_{\text{bedrooms}}, \dots)^T\).
Label: The target value we want to predict, y (house price).

From the Real World to Mathematical Objects

This transformation process is at the heart of data preprocessing.

A Sample = A Point in d-Dimensional Space

Once we represent samples as feature vectors, each sample can be viewed as a point in a d-dimensional feature space.

This provides a geometric foundation for understanding machine learning algorithms.

Case Study: Understanding Feature Vectors with Economic Data

Suppose we want to predict U.S. Personal Consumption Expenditures (PCE). We can get data from FRED.

Date	Disposable Personal Income (DPI)	Consumer Confidence Index (CONF)	Personal Consumption Expenditures (PCE) (Label y)
2023-01-01	19,800	102.9	18,000
2023-02-01	19,850	103.4	18,050
…	…	…	…

A sample is one row of data, representing one month.
The feature vector is \(x_t = (\text{DPI}_t, \text{CONF}_t)^T\). This is a point in a 2D space.
The label is \(y_t = \text{PCE}_t\).

The Three Main Categories of Machine Learning

Based on the data we have (especially whether we have the label y), machine learning tasks can be divided into three main categories.

Supervised Learning
Unsupervised Learning
Reinforcement Learning

We will introduce them one by one.

Category 1: Supervised Learning

Used when the data comes with clear ‘answers’ or ‘labels’.

Category 2: Unsupervised Learning

Used when data has no ‘answers’, and we want to discover its internal structure.

Category 3: Reinforcement Learning

Used when we need to learn an optimal strategy through ‘trial and error’ with an environment.

Focusing on Supervised Learning: Regression vs. Classification

The vast majority of tasks in economics and business fall under supervised learning. It can be further divided into two major tasks based on the type of the label y.

Regression:
- Goal: Predict a continuous numerical value.
- Output: \(y \in \mathbb{R}\)
- Examples: Predicting GDP growth rate, company sales figures.
Classification:
- Goal: Predict a discrete category.
- Output: \(y \in \{C_1, C_2, \dots, C_K\}\)
- Examples: Determining if a customer will churn, if a transaction is fraudulent.

Geometric Intuition of Regression and Classification

Practical Context: The Keynesian Consumption Function

Before diving into code, let’s revisit a classic economic theory: the Keynesian Consumption Function.

\[ C = a + b Y_d \]

\(C\): Total consumption
\(Y_d\): Disposable income
\(a\): Autonomous consumption (consumption even with zero income)
\(b\): Marginal Propensity to Consume (MPC, how much of an extra unit of income is spent)

This is a classic linear relationship. We can use the simplest machine learning model—linear regression—to estimate this relationship from data.

Practical Example: A Simple Regression Analysis in Python

Task: Use U.S. Disposable Personal Income (DPI) to predict Personal Consumption Expenditures (PCE).

This is a typical regression problem. We will use the statsmodels library.

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# --- Create mock data ---
# Assume the true relationship is PCE = 500 + 0.9 * DPI + noise
np.random.seed(42)
dpi_mock = np.linspace(10000, 20000, 150)
noise = np.random.normal(0, 300, 150)
pce_mock = 500 + 0.9 * dpi_mock + noise
df = pd.DataFrame({'DPI': dpi_mock, 'PCE': pce_mock})

# Define independent (X) and dependent (y) variables
X = df['DPI']
y = df['PCE']
X = sm.add_constant(X) # Add an intercept term

# Fit the OLS model
model = sm.OLS(y, X).fit()

# Visualization
fig, ax = plt.subplots(figsize=(8, 5))

# Plot scatter plot and the fitted line
ax.scatter(df['DPI'], df['PCE'], alpha=0.6, label='Simulated Monthly Data', color='gray')
ax.plot(df['DPI'], model.predict(X), color='crimson', linewidth=2.5, label='OLS Fit Line')

ax.set_title('Keynesian Consumption Function: Consumption Determined by Income', fontsize=16, fontweight='bold')
ax.set_xlabel('Real Disposable Personal Income (Billions of Dollars)', fontsize=12)
ax.set_ylabel('Personal Consumption Expenditures (Billions of Dollars)', fontsize=12)

# Set legend
ax.legend(fontsize=11, loc='upper left')

# Set tick labels
ax.tick_params(axis='both', which='major', labelsize=11)

# Add grid for readability
ax.grid(True, alpha=0.3)

fig.tight_layout()
plt.show()

# Print partial model summary
print(f'R-squared: {model.rsquared:.4f}')
print(f"DPI Coefficient (estimated MPC): {model.params['DPI']:.4f}")

Figure 1: U.S. Personal Consumption Expenditures (PCE) vs. Disposable Personal Income (DPI)

R-squared: 0.9886
DPI Coefficient (estimated MPC): 0.9040

Pillar III: How to Evaluate if a Model is Good?

How do we know if the model we trained, \(f(x; \theta)\), is a ‘good’ model?

Core Principle: The model’s performance on data it has never seen before (the test set) is the only true measure of its ability to generalize.

This leads to the most important practice in machine learning: the Train-Test Split.

The Golden Rule: Train-Test Split

We must divide our data into at least two parts.

Why Must We Split? The Ghost of Overfitting

Overfitting occurs when a model learns the training data “too well,” to the point that it memorizes the noise and random fluctuations in the data as if they were general patterns.

Symptom: Performs extremely well on the training set, but very poorly on the test set.
Cause: The model is too complex, with too much freedom relative to the amount of data.

The test set is like a ‘mock exam’—it fairly tests whether the model has truly learned the subject or just memorized the answers.

Cornerstone of Classification Evaluation: The Confusion Matrix

For binary classification problems (e.g., predicting customer default), all evaluation metrics derive from a simple table: the Confusion Matrix.

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Positive: The event we care about, like ‘default’ or ‘fraud’.
Negative: The other case, like ‘no default’.
True/False: Refers to whether the prediction was correct.

Understanding the Four Quadrants of the Confusion Matrix

TP (True Positive): Correct prediction, the customer did default. (Hit)
TN (True Negative): Correct prediction, the customer did not default. (Correct Rejection)
FP (False Positive): Incorrect prediction, predicted default, but they didn’t. (False Alarm, Type I Error)
FN (False Negative): Incorrect prediction, predicted no default, but they did. (Miss, Type II Error)

In fields like financial risk management, the cost of an FN (missing a bad customer) is often far greater than the cost of an FP (misjudging a good customer).

Classification Metric (1): Accuracy

Accuracy measures the proportion of total samples that the model predicted correctly.

\[ \large \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Samples}} = \frac{TP + TN}{TP + TN + FP + FN} \]

Advantage: Very intuitive and easy to understand.

Disadvantage: Highly misleading on imbalanced datasets.

The Accuracy Trap: An Example

Imagine a credit card fraud detection scenario:

Total transactions: 10,000
Normal transactions: 9,990 (99.9%)
Fraudulent transactions: 10 (0.1%)

How does a ‘lazy’ model that predicts all transactions as ‘normal’ perform?

TP = 0, TN = 9990
FP = 0, FN = 10
Accuracy = (0 + 9990) / 10000 = 99.9%

This model has extremely high accuracy but is completely useless, as it fails to identify a single case of fraud.

Classification Metric (2): Precision

Precision measures the proportion of all samples predicted as positive that are actually positive.

\[ \large \text{Precision} = \frac{TP}{TP + FP} \]

Business Meaning: ‘Of all the fraud alerts I raised, how many were real?’
Focus: The purity of the predictions. High precision means fewer false alarms (FP).

Classification Metric (3): Recall

Recall measures the proportion of all actual positive samples that we successfully identified.

\[ \large \text{Recall} = \frac{TP}{TP + FN} \]

Business Meaning: ‘Of all the actual fraud cases that occurred, how many did my model catch?’
Focus: How complete the search is. High recall means fewer misses (FN).

Precision vs. Recall: An Eternal Trade-off

In the real world, precision and recall are often negatively correlated.

Want to increase Recall? Lower the model’s ‘alarm’ threshold. It’s better to catch a thousand innocent people than to let one guilty person escape. This increases false alarms (FP), thereby decreasing precision.
Want to increase Precision? Raise the model’s ‘alarm’ threshold. Only sound the alarm for cases with overwhelming evidence. This increases misses (FN), thereby decreasing recall.

Business Decision: We need to decide which balance point to choose in this trade-off based on the different business costs of FPs and FNs.

Visualizing the Precision-Recall Trade-off

Classification Metric (4): F1-Score

To balance precision and recall, we use the F1-Score, which is their harmonic mean.

\[ \large F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} \]

The F1-Score will only be high if both precision and recall are both relatively high.
If one of the metrics is low, the F1-Score will also be pulled down.
It is a more robust single evaluation metric than accuracy on imbalanced datasets.

Practical Example: Calculating Classification Metrics with `scikit-learn`

Let’s use a hypothetical credit default prediction example to demonstrate how to calculate these metrics.

from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# --- Create mock data (imbalanced) ---
# Assume 1000 customers, 50 of whom actually default (5%)
y_true = np.array([0]*950 + [1]*50)
# Assume the model predicts 40 defaults, 30 of which are correct (TP=30) and 10 are wrong (FP=10)
# This means the model missed 20 of the 50 actual defaults (FN=20)
y_pred = np.array([0]*940 + [1]*10 + [0]*20 + [1]*30)
# Create a random permutation to shuffle the data
p = np.random.permutation(len(y_true))
y_true, y_pred = y_true[p], y_pred[p]

# 1. Calculate the metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, zero_division=0)
recall = recall_score(y_true, y_pred, zero_division=0)
f1 = f1_score(y_true, y_pred, zero_division=0)

print(f'Accuracy: {accuracy:.3f}')
print(f'Precision: {precision:.3f}')
print(f'Recall: {recall:.3f}')
print(f'F1-Score: {f1:.3f}')

# 2. Calculate and visualize the confusion matrix
cm = confusion_matrix(y_true, y_pred)
print('\nConfusion Matrix:\n', cm)

fig, ax = plt.subplots(figsize=(6, 4.5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted No Default', 'Predicted Default'], 
            yticklabels=['Actual No Default', 'Actual Default'], ax=ax, annot_kws={'size': 14})
ax.set_ylabel('Actual Label', fontsize=12)
ax.set_xlabel('Predicted Label', fontsize=12)
ax.set_title('Confusion Matrix', fontsize=14)
plt.show()

Accuracy: 0.970
Precision: 0.750
Recall: 0.600
F1-Score: 0.667

Confusion Matrix:
 [[940  10]
 [ 20  30]]

Regression Metric: Mean Squared Error (MSE)

For regression tasks (predicting continuous values), the most common evaluation metric is the Mean Squared Error (MSE).

\[ \large \text{MSE} = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2 \]

\(y_i\) is the true value for sample \(i\).
\(\hat{y}_i\) is the model’s prediction for sample \(i\).
\((y_i - \hat{y}_i)\) is the residual.

MSE calculates the average of the squared residuals. Because it uses squares, it penalizes large errors more heavily.

Visualizing Mean Squared Error (MSE)

The Essence of Learning: Optimization

We’ve defined the model and evaluation criteria, but how does a machine actually ‘learn’? The essence of learning is an Optimization process.

We define a Loss Function \(J(\theta)\), which measures how bad the model’s predictions are on the training set with the current parameters \(\theta\). The lower the loss function’s value, the better the model performs.
The goal of ‘learning’ is to find a set of parameters \(\theta^*\) that minimizes the loss function \(J(\theta)\).

For regression problems, MSE is the most commonly used loss function.

Loss Function for Classification: Cross-Entropy

For classification problems, we commonly use Cross-Entropy Loss.

Intuitive Understanding: It measures the ‘distance’ between the probability distribution predicted by the model and the true probability distribution.
For binary classification: \[ \large L(\theta) = - \frac{1}{N} \sum_{i=1}^N \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right] \] where \(y_i \in \{0, 1\}\) is the true label, and \(\hat{y}_i \in [0, 1]\) is the model’s predicted probability of the class being 1.

This function has a nice property: when the model makes a very confident and wrong prediction, the loss becomes very large, giving the model a strong ‘penalty’ signal.

Pillar IV: Define ‘Learning’ (Optimization)

We have the map (the loss function), but how do we find the way down the mountain?

The most classic and important method is Gradient Descent.

Core Idea: Imagine you are on a foggy mountainside, and you can only see the small patch of ground at your feet. To get down the fastest, you should take a step in the direction of the steepest descent from your current position.

Mathematically, the negative of the gradient of a function at a point is the direction in which the function’s value decreases most rapidly.

The Mathematical Principle of Gradient Descent

Gradient descent is an iterative algorithm. At each step t, it updates the parameters \(\theta\) according to the following rule:

\[ \large \theta_{t+1} = \theta_t - \eta \nabla J(\theta_t) \]

\(\theta_t\): The value of the parameters at step t.
\(\nabla J(\theta_t)\): The gradient of the loss function \(J\) at \(\theta_t\). It is a vector pointing in the direction of the fastest increase in the function’s value.
\(\eta\): The Learning Rate, a hyperparameter that controls how far we step each time.
\(-\eta \nabla J(\theta_t)\): We take a small step in the direction opposite to the gradient.

We repeat this process until the parameters converge.

Visualizing Gradient Descent

The Learning Rate (η): Determining Optimization Speed and Success

Figure 3: The impact of the learning rate (η) on the gradient descent process

\(\eta\) Too Small: Crawls down the mountain like a snail, converging very slowly.
\(\eta\) Too Large: Descends like a drunkard, potentially ‘overshooting’ and oscillating around the minimum, or even ‘jumping’ to the other side of the mountain, causing divergence.

Variants of Gradient Descent: Handling Large-Scale Data

When our training set is very large, computing the gradient over the entire dataset becomes very time-consuming. For this reason, variants of gradient descent have been developed.

Comparison of Gradient Descent Variants

Type	Gradient Calculation Method	Advantages	Disadvantages
Batch GD (BGD)	Uses all training samples	Accurate gradient, smooth convergence	High computational cost, slow
Stochastic GD (SGD)	Uses one randomly picked sample	Fast, can escape local minima	High variance in updates, noisy path
Mini-batch GD (MBGD)	Uses a small batch of samples (e.g., 32)	Combines benefits of BGD and SGD, the default choice	Requires tuning batch size

Advanced Optimizers: Making the Descent Smarter

Basic gradient descent can struggle in complex loss landscapes. In modern deep learning, we use more advanced optimizers.

Momentum
- Idea: Simulates momentum from physics. The update considers not only the current gradient but also the previous update direction, like a ball rolling down a hill.
- Effect: Helps the algorithm “power through” flat regions and local minima, accelerating convergence.
Adam (Adaptive Moment Estimation)
- Idea: Combines momentum with adaptive learning rates (adjusting the learning rate independently for each parameter).
- Effect: Performs well across a wide range of tasks and is often the go-to default optimizer.

For beginners: Using the Adam optimizer directly will often yield excellent results.

Comprehensive Case Study: Predicting House Prices with Linear Regression

Let’s tie everything we’ve learned today together to complete a full machine learning project using scikit-learn.

Task: Predict the median house value in California districts (a regression problem).
Data: Scikit-learn’s built-in California Housing dataset.
Model: Linear Regression.
Evaluation: Mean Squared Error (MSE).
Optimization: Scikit-learn’s LinearRegression uses efficient optimization algorithms under the hood.

Case Study (1): Loading and Splitting the Data

The first step is always to prepare the data: load it and split it into training and test sets.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

# 1. Load the California Housing dataset
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = pd.Series(housing.target, name='MedHouseVal')

# 2. Split the data into training and testing sets (80% train, 20% test)
# random_state ensures the split is the same every time for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print('Training set size:', X_train.shape)
print('Test set size:', X_test.shape)
print('\nPreview of some feature data:')
print(X_train.head())

Training set size: (16512, 8)
Test set size: (4128, 8)

Preview of some feature data:
       MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
14196  3.2596      33.0  5.017657   1.006421      2300.0  3.691814     32.71   
8267   3.8125      49.0  4.473545   1.041005      1314.0  1.738095     33.77   
17445  4.1563       4.0  5.645833   0.985119       915.0  2.723214     34.66   
14265  1.9425      36.0  4.002817   1.033803      1418.0  3.994366     32.69   
2271   3.5542      43.0  6.268421   1.134211       874.0  2.300000     36.78   

       Longitude  
14196    -117.03  
8267     -118.16  
17445    -120.48  
14265    -117.11  
2271     -119.80

Case Study (2): Training the Model and Making Predictions

We use the training set to ‘teach’ our model, and then test its learning on the test set.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np
# (Continues from the previous code block with variables X_train, y_train, X_test, y_test)

# 1. Initialize a Linear Regression model
model = LinearRegression()

# 2. Train the model on the training set (i.e., find the best parameters θ)
# The .fit() method executes the learning/optimization process
model.fit(X_train, y_train)

# 3. Make predictions on the test set
# The .predict() method uses the learned model to make predictions
y_pred = model.predict(X_test)

# Print some of the prediction results
print('Actual house prices (first 5):', np.round(y_test.head().values, 2))
print('Predicted house prices (first 5):', np.round(y_pred[:5], 2))

Actual house prices (first 5): [0.48 0.46 5.   2.19 2.78]
Predicted house prices (first 5): [0.72 1.76 2.71 2.84 2.6 ]

Case Study (3): Evaluating the Model and Interpreting Results

Finally, we calculate our evaluation metric and see what the model has learned.

import pandas as pd
from sklearn.metrics import mean_squared_error
# (Continues from the previous code block with variables y_test, y_pred, model, X)

# 4. Evaluate the model's performance on the test set
mse = mean_squared_error(y_test, y_pred)
print(f'\nThe model`s Mean Squared Error (MSE) on the test set is: {mse:.4f}')

# 5. Examine the learned parameters (coefficients)
# model.coef_ corresponds to the slope coefficients in the regression equation
coef_df = pd.DataFrame(model.coef_, index=X.columns, columns=['Coefficient'])
print('\nLearned Coefficients (θ):')
print(coef_df.head())
# model.intercept_ corresponds to the intercept term
print(f'\nLearned Intercept: {model.intercept_:.4f}')


The model`s Mean Squared Error (MSE) on the test set is: 0.5559

Learned Coefficients (θ):
            Coefficient
MedInc         0.448675
HouseAge       0.009724
AveRooms      -0.123323
AveBedrms      0.783145
Population    -0.000002

Learned Intercept: -37.0233

Case Study (4): Visualizing the Prediction Results

Comparing the model’s predictions to the actual values is the most intuitive way to check its performance.

Figure 4: Model Predicted House Value vs. Actual House Value

If the predictions were perfect, all the points would lie on the red dashed line.

Chapter Summary: The Four Pillars of Machine Learning

Today, we built a complete framework for thinking about machine learning problems.

1. Frame the Problem

Task: Regression vs. Classification
Learning Type: Supervised vs. Unsupervised

2. Define the Model

Choose a function \(f(x, \theta)\)
e.g., Linear Regression

3. Define ‘Good’ (Evaluation)

Core: Evaluate on the test set
Regression: MSE, R²
Classification: F1-Score, Recall

4. Define ‘Learning’ (Optimization)

Minimize a Loss Function \(J(\theta)\)
Gradient Descent is the core algorithm

Thank you!

Q & A

01 Foundations of Machine Learning

The Core Question: Why Should Economists Learn Machine Learning?

A Visual Contrast of the Two Cultures

Why Prediction Matters to Economists

Today’s Goal: Build a Complete Mental Framework for Machine Learning

Pillar I: Frame the Problem (Framing)

Pillar II: Define the Model (Modeling)

Pillar III: Define ‘Good’ (Evaluation)

Pillar IV: Define ‘Learning’ (Optimization)

What is Machine Learning? Learning Functions from Data

Data Representation in ML: Everything is a Vector

From the Real World to Mathematical Objects

A Sample = A Point in d-Dimensional Space

Case Study: Understanding Feature Vectors with Economic Data

The Three Main Categories of Machine Learning

Category 1: Supervised Learning

Category 2: Unsupervised Learning

Category 3: Reinforcement Learning

Focusing on Supervised Learning: Regression vs. Classification

Geometric Intuition of Regression and Classification

Practical Context: The Keynesian Consumption Function

Practical Example: A Simple Regression Analysis in Python

Pillar III: How to Evaluate if a Model is Good?

The Golden Rule: Train-Test Split

Why Must We Split? The Ghost of Overfitting

Cornerstone of Classification Evaluation: The Confusion Matrix

Understanding the Four Quadrants of the Confusion Matrix

Classification Metric (1): Accuracy

The Accuracy Trap: An Example

Classification Metric (2): Precision

Classification Metric (3): Recall

Precision vs. Recall: An Eternal Trade-off

Visualizing the Precision-Recall Trade-off

Classification Metric (4): F1-Score

Practical Example: Calculating Classification Metrics with scikit-learn

Regression Metric: Mean Squared Error (MSE)

Visualizing Mean Squared Error (MSE)

The Essence of Learning: Optimization

Loss Function: The ‘Navigation Map’ for Optimization

Loss Function for Classification: Cross-Entropy

Pillar IV: Define ‘Learning’ (Optimization)

The Mathematical Principle of Gradient Descent

Visualizing Gradient Descent

The Learning Rate (η): Determining Optimization Speed and Success

Variants of Gradient Descent: Handling Large-Scale Data

Comparison of Gradient Descent Variants

Advanced Optimizers: Making the Descent Smarter

Comprehensive Case Study: Predicting House Prices with Linear Regression

Case Study (1): Loading and Splitting the Data

Case Study (2): Training the Model and Making Predictions

Case Study (3): Evaluating the Model and Interpreting Results

Case Study (4): Visualizing the Prediction Results

Chapter Summary: The Four Pillars of Machine Learning

1. Frame the Problem

2. Define the Model

3. Define ‘Good’ (Evaluation)

4. Define ‘Learning’ (Optimization)

Thank you!

Practical Example: Calculating Classification Metrics with `scikit-learn`