The Core Question: Why Should Economists Learn Machine Learning?
Traditional econometrics and machine learning represent two different ‘cultures’ for solving problems.
Econometrics: The core is Causal Inference.
Goal: To understand ‘why’ the world works the way it does.
Concern: Unbiasedness and consistency of parameter estimates.
Example: By how much did the minimum wage increase cause a change in the unemployment rate?
Machine Learning: The core is Prediction.
Goal: To predict ‘what’ will happen in the world.
Concern: The model’s generalization ability on unseen data.
Example: Based on current macroeconomic indicators, predict next quarter’s GDP growth rate.
A Visual Contrast of the Two Cultures
The two methodologies have fundamental differences in model selection and objectives.
Why Prediction Matters to Economists
In a data-driven era, predictive power is itself a potent economic tool.
Financial Markets: Predicting asset prices, volatility, and credit risk.
Macroeconomics: Forecasting inflation, GDP growth, and unemployment to inform policy.
Business Decisions: Forecasting product sales, customer churn, and supply chain demand.
Policy Evaluation: Predicting the likely economic impact of a policy (e.g., a tax cut).
Causal inference explains the past; accurate prediction provides insight into the future. Combined, their power is multiplied.
Today’s Goal: Build a Complete Mental Framework for Machine Learning
After this chapter, you will be able to systematically understand any machine learning project from the perspective of ‘The Four Pillars’.
Pillar I: Frame the Problem (Framing)
This is the starting point for all work.
Before any technical details, you must clearly define the business problem and translate it into a specific machine learning task.
What are you trying to predict?
A continuous value (e.g., tomorrow’s stock price) \(\rightarrow\)Regression
A discrete category (e.g., whether a customer will default) \(\rightarrow\)Classification
What data do you have?
Does the data come with the ‘answer’ you want to predict (i.e., a label y)?
Yes \(\rightarrow\)Supervised Learning
No \(\rightarrow\)Unsupervised Learning
Pillar II: Define the Model (Modeling)
A model is essentially a mathematical function \(f(x, \theta)\) that tries to capture the relationship between input features \(x\) and the output \(y\).
\(x\): The input feature vector (e.g., house area, location).
\(\theta\): The model’s parameters. These are the values that need to be determined through ‘learning’ (e.g., coefficients in a linear regression).
\(f\): The form of the function. This is what we, as modelers, choose.
The range of models is vast:
Simple Models: Linear Regression, Logistic Regression (highly interpretable).
How do we objectively measure how good a model is? We need an evaluation metric.
This metric must reflect the business objective.
It must be calculated on data the model has never seen before (the test set).
Common Evaluation Metrics:
Regression Tasks:
Mean Squared Error (MSE)
R-squared (R²)
Classification Tasks:
Accuracy
Precision, Recall, F1-Score
Pillar IV: Define ‘Learning’ (Optimization)
The process of ‘learning’ is the process of automatically finding the best parameters \(\theta\).
We first define a Loss Function\(J(\theta)\), which measures how bad the model’s predictions are with the current parameters \(\theta\). The smaller the loss, the better the model.
Then, we use an Optimizer, such as Gradient Descent, to systematically and iteratively adjust the parameters \(\theta\) to find the set of values \(\theta^*\) that minimizes the loss function \(J(\theta)\).
What is Machine Learning? Learning Functions from Data
The essence of Machine Learning (ML) is to have a computer automatically learn a function from data, rather than through explicit programming, where this function can make predictions on unknown data.
Data Representation in ML: Everything is a Vector
In machine learning, we need to convert real-world objects into a language computers understand—numbers.
Dataset: A collection of N samples \(X = \{x_1, x_2, \dots, x_N\}\).
Sample: A single data point (a house, a customer).
Feature: A dimension describing a sample (area, number of bedrooms).
Feature Vector: A vector of all features for one sample \(x = (x_{\text{area}}, x_{\text{bedrooms}}, \dots)^T\).
Label: The target value we want to predict, y (house price).
From the Real World to Mathematical Objects
This transformation process is at the heart of data preprocessing.
A Sample = A Point in d-Dimensional Space
Once we represent samples as feature vectors, each sample can be viewed as a point in a d-dimensional feature space.
This provides a geometric foundation for understanding machine learning algorithms.
Case Study: Understanding Feature Vectors with Economic Data
Suppose we want to predict U.S. Personal Consumption Expenditures (PCE). We can get data from FRED.
Date
Disposable Personal Income (DPI)
Consumer Confidence Index (CONF)
Personal Consumption Expenditures (PCE) (Label y)
2023-01-01
19,800
102.9
18,000
2023-02-01
19,850
103.4
18,050
…
…
…
…
A sample is one row of data, representing one month.
The feature vector is \(x_t = (\text{DPI}_t, \text{CONF}_t)^T\). This is a point in a 2D space.
The label is \(y_t = \text{PCE}_t\).
The Three Main Categories of Machine Learning
Based on the data we have (especially whether we have the label y), machine learning tasks can be divided into three main categories.
Supervised Learning
Unsupervised Learning
Reinforcement Learning
We will introduce them one by one.
Category 1: Supervised Learning
Used when the data comes with clear ‘answers’ or ‘labels’.
Category 2: Unsupervised Learning
Used when data has no ‘answers’, and we want to discover its internal structure.
Category 3: Reinforcement Learning
Used when we need to learn an optimal strategy through ‘trial and error’ with an environment.
Focusing on Supervised Learning: Regression vs. Classification
The vast majority of tasks in economics and business fall under supervised learning. It can be further divided into two major tasks based on the type of the label y.
Regression:
Goal: Predict a continuous numerical value.
Output: \(y \in \mathbb{R}\)
Examples: Predicting GDP growth rate, company sales figures.
Classification:
Goal: Predict a discrete category.
Output: \(y \in \{C_1, C_2, \dots, C_K\}\)
Examples: Determining if a customer will churn, if a transaction is fraudulent.
Geometric Intuition of Regression and Classification
Practical Context: The Keynesian Consumption Function
Before diving into code, let’s revisit a classic economic theory: the Keynesian Consumption Function.
\[ C = a + b Y_d \]
\(C\): Total consumption
\(Y_d\): Disposable income
\(a\): Autonomous consumption (consumption even with zero income)
\(b\): Marginal Propensity to Consume (MPC, how much of an extra unit of income is spent)
This is a classic linear relationship. We can use the simplest machine learning model—linear regression—to estimate this relationship from data.
Practical Example: A Simple Regression Analysis in Python
Task: Use U.S. Disposable Personal Income (DPI) to predict Personal Consumption Expenditures (PCE).
This is a typical regression problem. We will use the statsmodels library.
import pandas as pdimport numpy as npimport statsmodels.api as smimport matplotlib.pyplot as plt# --- Create mock data ---# Assume the true relationship is PCE = 500 + 0.9 * DPI + noisenp.random.seed(42)dpi_mock = np.linspace(10000, 20000, 150)noise = np.random.normal(0, 300, 150)pce_mock =500+0.9* dpi_mock + noisedf = pd.DataFrame({'DPI': dpi_mock, 'PCE': pce_mock})# Define independent (X) and dependent (y) variablesX = df['DPI']y = df['PCE']X = sm.add_constant(X) # Add an intercept term# Fit the OLS modelmodel = sm.OLS(y, X).fit()# Visualizationfig, ax = plt.subplots(figsize=(8, 5))# Plot scatter plot and the fitted lineax.scatter(df['DPI'], df['PCE'], alpha=0.6, label='Simulated Monthly Data', color='gray')ax.plot(df['DPI'], model.predict(X), color='crimson', linewidth=2.5, label='OLS Fit Line')ax.set_title('Keynesian Consumption Function: Consumption Determined by Income', fontsize=16, fontweight='bold')ax.set_xlabel('Real Disposable Personal Income (Billions of Dollars)', fontsize=12)ax.set_ylabel('Personal Consumption Expenditures (Billions of Dollars)', fontsize=12)# Set legendax.legend(fontsize=11, loc='upper left')# Set tick labelsax.tick_params(axis='both', which='major', labelsize=11)# Add grid for readabilityax.grid(True, alpha=0.3)fig.tight_layout()plt.show()# Print partial model summaryprint(f'R-squared: {model.rsquared:.4f}')print(f"DPI Coefficient (estimated MPC): {model.params['DPI']:.4f}")
Figure 1: U.S. Personal Consumption Expenditures (PCE) vs. Disposable Personal Income (DPI)
How do we know if the model we trained, \(f(x; \theta)\), is a ‘good’ model?
Core Principle: The model’s performance on data it has never seen before (the test set) is the only true measure of its ability to generalize.
This leads to the most important practice in machine learning: the Train-Test Split.
The Golden Rule: Train-Test Split
We must divide our data into at least two parts.
Why Must We Split? The Ghost of Overfitting
Overfitting occurs when a model learns the training data “too well,” to the point that it memorizes the noise and random fluctuations in the data as if they were general patterns.
Symptom: Performs extremely well on the training set, but very poorly on the test set.
Cause: The model is too complex, with too much freedom relative to the amount of data.
The test set is like a ‘mock exam’—it fairly tests whether the model has truly learned the subject or just memorized the answers.
Cornerstone of Classification Evaluation: The Confusion Matrix
For binary classification problems (e.g., predicting customer default), all evaluation metrics derive from a simple table: the Confusion Matrix.
Predicted Positive
Predicted Negative
Actual Positive
True Positive (TP)
False Negative (FN)
Actual Negative
False Positive (FP)
True Negative (TN)
Positive: The event we care about, like ‘default’ or ‘fraud’.
Negative: The other case, like ‘no default’.
True/False: Refers to whether the prediction was correct.
Understanding the Four Quadrants of the Confusion Matrix
TP (True Positive): Correct prediction, the customer did default. (Hit)
TN (True Negative): Correct prediction, the customer did not default. (Correct Rejection)
FP (False Positive): Incorrect prediction, predicted default, but they didn’t. (False Alarm, Type I Error)
FN (False Negative): Incorrect prediction, predicted no default, but they did. (Miss, Type II Error)
In fields like financial risk management, the cost of an FN (missing a bad customer) is often far greater than the cost of an FP (misjudging a good customer).
Classification Metric (1): Accuracy
Accuracy measures the proportion of total samples that the model predicted correctly.
\[ \large \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Samples}} = \frac{TP + TN}{TP + TN + FP + FN} \]
Advantage: Very intuitive and easy to understand.
Disadvantage: Highly misleading on imbalanced datasets.
The Accuracy Trap: An Example
Imagine a credit card fraud detection scenario:
Total transactions: 10,000
Normal transactions: 9,990 (99.9%)
Fraudulent transactions: 10 (0.1%)
How does a ‘lazy’ model that predicts all transactions as ‘normal’ perform?
TP = 0, TN = 9990
FP = 0, FN = 10
Accuracy = (0 + 9990) / 10000 = 99.9%
This model has extremely high accuracy but is completely useless, as it fails to identify a single case of fraud.
Classification Metric (2): Precision
Precision measures the proportion of all samples predicted as positive that are actually positive.
Business Meaning: ‘Of all the fraud alerts I raised, how many were real?’
Focus: The purity of the predictions. High precision means fewer false alarms (FP).
Classification Metric (3): Recall
Recall measures the proportion of all actual positive samples that we successfully identified.
\[ \large \text{Recall} = \frac{TP}{TP + FN} \]
Business Meaning: ‘Of all the actual fraud cases that occurred, how many did my model catch?’
Focus: How complete the search is. High recall means fewer misses (FN).
Precision vs. Recall: An Eternal Trade-off
In the real world, precision and recall are often negatively correlated.
Want to increase Recall? Lower the model’s ‘alarm’ threshold. It’s better to catch a thousand innocent people than to let one guilty person escape. This increases false alarms (FP), thereby decreasing precision.
Want to increase Precision? Raise the model’s ‘alarm’ threshold. Only sound the alarm for cases with overwhelming evidence. This increases misses (FN), thereby decreasing recall.
Business Decision: We need to decide which balance point to choose in this trade-off based on the different business costs of FPs and FNs.
Visualizing the Precision-Recall Trade-off
Classification Metric (4): F1-Score
To balance precision and recall, we use the F1-Score, which is their harmonic mean.
The F1-Score will only be high if both precision and recall are both relatively high.
If one of the metrics is low, the F1-Score will also be pulled down.
It is a more robust single evaluation metric than accuracy on imbalanced datasets.
Practical Example: Calculating Classification Metrics with scikit-learn
Let’s use a hypothetical credit default prediction example to demonstrate how to calculate these metrics.
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_scoreimport seaborn as snsimport matplotlib.pyplot as pltimport numpy as np# --- Create mock data (imbalanced) ---# Assume 1000 customers, 50 of whom actually default (5%)y_true = np.array([0]*950+ [1]*50)# Assume the model predicts 40 defaults, 30 of which are correct (TP=30) and 10 are wrong (FP=10)# This means the model missed 20 of the 50 actual defaults (FN=20)y_pred = np.array([0]*940+ [1]*10+ [0]*20+ [1]*30)# Create a random permutation to shuffle the datap = np.random.permutation(len(y_true))y_true, y_pred = y_true[p], y_pred[p]# 1. Calculate the metricsaccuracy = accuracy_score(y_true, y_pred)precision = precision_score(y_true, y_pred, zero_division=0)recall = recall_score(y_true, y_pred, zero_division=0)f1 = f1_score(y_true, y_pred, zero_division=0)print(f'Accuracy: {accuracy:.3f}')print(f'Precision: {precision:.3f}')print(f'Recall: {recall:.3f}')print(f'F1-Score: {f1:.3f}')# 2. Calculate and visualize the confusion matrixcm = confusion_matrix(y_true, y_pred)print('\nConfusion Matrix:\n', cm)fig, ax = plt.subplots(figsize=(6, 4.5))sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Predicted No Default', 'Predicted Default'], yticklabels=['Actual No Default', 'Actual Default'], ax=ax, annot_kws={'size': 14})ax.set_ylabel('Actual Label', fontsize=12)ax.set_xlabel('Predicted Label', fontsize=12)ax.set_title('Confusion Matrix', fontsize=14)plt.show()
\(\hat{y}_i\) is the model’s prediction for sample \(i\).
\((y_i - \hat{y}_i)\) is the residual.
MSE calculates the average of the squared residuals. Because it uses squares, it penalizes large errors more heavily.
Visualizing Mean Squared Error (MSE)
The Essence of Learning: Optimization
We’ve defined the model and evaluation criteria, but how does a machine actually ‘learn’? The essence of learning is an Optimization process.
We define a Loss Function\(J(\theta)\), which measures how bad the model’s predictions are on the training set with the current parameters \(\theta\). The lower the loss function’s value, the better the model performs.
The goal of ‘learning’ is to find a set of parameters \(\theta^*\) that minimizes the loss function \(J(\theta)\).
For regression problems, MSE is the most commonly used loss function.
Loss Function: The ‘Navigation Map’ for Optimization
The loss function \(J(\theta)\) describes a ‘topographical map’, where the altitude is the loss value. Our goal is to start from a random point and walk to the lowest valley in this terrain.
Figure 2: The landscape of a loss function: Our goal is to find the global minimum.
Loss Function for Classification: Cross-Entropy
For classification problems, we commonly use Cross-Entropy Loss.
Intuitive Understanding: It measures the ‘distance’ between the probability distribution predicted by the model and the true probability distribution.
For binary classification: \[ \large L(\theta) = - \frac{1}{N} \sum_{i=1}^N \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right] \] where \(y_i \in \{0, 1\}\) is the true label, and \(\hat{y}_i \in [0, 1]\) is the model’s predicted probability of the class being 1.
This function has a nice property: when the model makes a very confident and wrong prediction, the loss becomes very large, giving the model a strong ‘penalty’ signal.
Pillar IV: Define ‘Learning’ (Optimization)
We have the map (the loss function), but how do we find the way down the mountain?
The most classic and important method is Gradient Descent.
Core Idea: Imagine you are on a foggy mountainside, and you can only see the small patch of ground at your feet. To get down the fastest, you should take a step in the direction of the steepest descent from your current position.
Mathematically, the negative of the gradient of a function at a point is the direction in which the function’s value decreases most rapidly.
The Mathematical Principle of Gradient Descent
Gradient descent is an iterative algorithm. At each step t, it updates the parameters \(\theta\) according to the following rule:
\(\theta_t\): The value of the parameters at step t.
\(\nabla J(\theta_t)\): The gradient of the loss function \(J\) at \(\theta_t\). It is a vector pointing in the direction of the fastest increase in the function’s value.
\(\eta\): The Learning Rate, a hyperparameter that controls how far we step each time.
\(-\eta \nabla J(\theta_t)\): We take a small step in the direction opposite to the gradient.
We repeat this process until the parameters converge.
Visualizing Gradient Descent
The Learning Rate (η): Determining Optimization Speed and Success
Figure 3: The impact of the learning rate (η) on the gradient descent process
\(\eta\) Too Small: Crawls down the mountain like a snail, converging very slowly.
\(\eta\) Too Large: Descends like a drunkard, potentially ‘overshooting’ and oscillating around the minimum, or even ‘jumping’ to the other side of the mountain, causing divergence.
Variants of Gradient Descent: Handling Large-Scale Data
When our training set is very large, computing the gradient over the entire dataset becomes very time-consuming. For this reason, variants of gradient descent have been developed.
Comparison of Gradient Descent Variants
Type
Gradient Calculation Method
Advantages
Disadvantages
Batch GD (BGD)
Uses all training samples
Accurate gradient, smooth convergence
High computational cost, slow
Stochastic GD (SGD)
Uses one randomly picked sample
Fast, can escape local minima
High variance in updates, noisy path
Mini-batch GD (MBGD)
Uses a small batch of samples (e.g., 32)
Combines benefits of BGD and SGD, the default choice
Requires tuning batch size
Advanced Optimizers: Making the Descent Smarter
Basic gradient descent can struggle in complex loss landscapes. In modern deep learning, we use more advanced optimizers.
Momentum
Idea: Simulates momentum from physics. The update considers not only the current gradient but also the previous update direction, like a ball rolling down a hill.
Effect: Helps the algorithm “power through” flat regions and local minima, accelerating convergence.
Adam (Adaptive Moment Estimation)
Idea: Combines momentum with adaptive learning rates (adjusting the learning rate independently for each parameter).
Effect: Performs well across a wide range of tasks and is often the go-to default optimizer.
For beginners: Using the Adam optimizer directly will often yield excellent results.
Comprehensive Case Study: Predicting House Prices with Linear Regression
Let’s tie everything we’ve learned today together to complete a full machine learning project using scikit-learn.
Task: Predict the median house value in California districts (a regression problem).
Data: Scikit-learn’s built-in California Housing dataset.
Model: Linear Regression.
Evaluation: Mean Squared Error (MSE).
Optimization: Scikit-learn’s LinearRegression uses efficient optimization algorithms under the hood.
Case Study (1): Loading and Splitting the Data
The first step is always to prepare the data: load it and split it into training and test sets.
import pandas as pdfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import fetch_california_housing# 1. Load the California Housing datasethousing = fetch_california_housing()X = pd.DataFrame(housing.data, columns=housing.feature_names)y = pd.Series(housing.target, name='MedHouseVal')# 2. Split the data into training and testing sets (80% train, 20% test)# random_state ensures the split is the same every time for reproducibilityX_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)print('Training set size:', X_train.shape)print('Test set size:', X_test.shape)print('\nPreview of some feature data:')print(X_train.head())
Training set size: (16512, 8)
Test set size: (4128, 8)
Preview of some feature data:
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \
14196 3.2596 33.0 5.017657 1.006421 2300.0 3.691814 32.71
8267 3.8125 49.0 4.473545 1.041005 1314.0 1.738095 33.77
17445 4.1563 4.0 5.645833 0.985119 915.0 2.723214 34.66
14265 1.9425 36.0 4.002817 1.033803 1418.0 3.994366 32.69
2271 3.5542 43.0 6.268421 1.134211 874.0 2.300000 36.78
Longitude
14196 -117.03
8267 -118.16
17445 -120.48
14265 -117.11
2271 -119.80
Case Study (2): Training the Model and Making Predictions
We use the training set to ‘teach’ our model, and then test its learning on the test set.
from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_errorimport numpy as np# (Continues from the previous code block with variables X_train, y_train, X_test, y_test)# 1. Initialize a Linear Regression modelmodel = LinearRegression()# 2. Train the model on the training set (i.e., find the best parameters θ)# The .fit() method executes the learning/optimization processmodel.fit(X_train, y_train)# 3. Make predictions on the test set# The .predict() method uses the learned model to make predictionsy_pred = model.predict(X_test)# Print some of the prediction resultsprint('Actual house prices (first 5):', np.round(y_test.head().values, 2))print('Predicted house prices (first 5):', np.round(y_pred[:5], 2))
Actual house prices (first 5): [0.48 0.46 5. 2.19 2.78]
Predicted house prices (first 5): [0.72 1.76 2.71 2.84 2.6 ]
Case Study (3): Evaluating the Model and Interpreting Results
Finally, we calculate our evaluation metric and see what the model has learned.
import pandas as pdfrom sklearn.metrics import mean_squared_error# (Continues from the previous code block with variables y_test, y_pred, model, X)# 4. Evaluate the model's performance on the test setmse = mean_squared_error(y_test, y_pred)print(f'\nThe model`s Mean Squared Error (MSE) on the test set is: {mse:.4f}')# 5. Examine the learned parameters (coefficients)# model.coef_ corresponds to the slope coefficients in the regression equationcoef_df = pd.DataFrame(model.coef_, index=X.columns, columns=['Coefficient'])print('\nLearned Coefficients (θ):')print(coef_df.head())# model.intercept_ corresponds to the intercept termprint(f'\nLearned Intercept: {model.intercept_:.4f}')
The model`s Mean Squared Error (MSE) on the test set is: 0.5559
Learned Coefficients (θ):
Coefficient
MedInc 0.448675
HouseAge 0.009724
AveRooms -0.123323
AveBedrms 0.783145
Population -0.000002
Learned Intercept: -37.0233
Case Study (4): Visualizing the Prediction Results
Comparing the model’s predictions to the actual values is the most intuitive way to check its performance.
Figure 4: Model Predicted House Value vs. Actual House Value
If the predictions were perfect, all the points would lie on the red dashed line.
Chapter Summary: The Four Pillars of Machine Learning
Today, we built a complete framework for thinking about machine learning problems.