10. Artificial Neural Networks: An Economic Perspective

Artificial Neural Networks: An Economic Perspective

A journey from the limits of linear models to the power of deep learning for economic analysis.

Today’s Agenda: Moving Beyond Linearity

  1. Review & Reflection: The limitations of linear models.
  2. Core Idea: Using biology to inspire mathematical models.
  3. Basic Building Block: Starting with a single ‘neuron’.
  4. Key Innovation: How ‘activation functions’ introduce non-linearity.
  5. Building a Network: From the Perceptron to the Multi-Layer Perceptron (MLP).
  6. Model Learning: Gradient descent and backpropagation.
  7. Economics in Practice: Predicting U.S. economic recessions.
  8. A Glimpse of the Frontier: Introduction to Convolutional Neural Networks (CNNs).

The Core Question: What If the World Isn’t Linear?

As economics students, our most familiar tool is Ordinary Least Squares (OLS).

\[ \large{Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \epsilon} \]

It is powerful and highly interpretable, but it rests on a critical assumption: a linear relationship between variables.

The Beauty and Burden of the Linear Assumption

A linear relationship means that for every one-unit increase in an independent variable \(X\), the change in the dependent variable \(Y\) is constant (\(\beta\)).

Linear Relationship Diagram A scatter plot and regression line illustrating a linear relationship with a constant slope. Independent Variable X (e.g., Years of Education) Dependent Variable Y (e.g., Income) Linear Relationship: Simple and Constant ΔX = 1 ΔY = β

But is the real world always this simple?

The Real World: A Web of Complex, Non-Linear Relationships

Many economic phenomena cannot be perfectly described by a straight line.

  • Diminishing Marginal Utility: The happiness gained from an increase in income diminishes at higher income levels.
  • The Laffer Curve: The relationship between tax rates and tax revenue is an ‘inverted U-shape’.
  • ‘Fear’ and ‘Greed’ in Financial Markets: Asset prices react non-linearly to news, exhibiting thresholds and sharp fluctuations.

When faced with these complex non-linear relationships, traditional econometric models may fall short.

Example 1: Diminishing Marginal Utility

The higher the income, the smaller the increase in happiness from the same amount of money.

Diminishing Marginal Utility A curve showing that as income increases, the corresponding increase in utility gets smaller. Two identical income increases are shown, resulting in different utility increases. Diminishing Marginal Utility ΔIncome ΔUtility₁ ΔIncome ΔUtility₂ ΔUtility₁ > ΔUtility₂ Utility (Happiness) Income

Example 2: The Laffer Curve

Higher tax rates are not always better. Excessively high rates can stifle economic activity, leading to a decrease in tax revenue.

Laffer Curve An inverted U-shaped curve showing the relationship between tax rates and tax revenue, with an optimal tax rate T* that maximizes revenue. The Laffer Curve Peak Revenue Point Tax Revenue Tax Rate 0% T* 100% Normal Zone Prohibitive Zone

This Chapter’s Goal: Introduce a Powerful Non-Linear Tool

In this chapter, we will learn a new modeling paradigm inspired by the workings of the human brain:

Artificial Neural Networks (ANNs)

Our objectives are to: 1. Understand the basic building block of a neural network—the neuron. 2. Grasp how networks introduce non-linearity through activation functions. 3. Learn how to build and train a feedforward neural network. 4. Explore its potential applications in economics and finance.

Inspiration: The Human Brain’s Neuron

Before diving into the math, let’s look at the source of inspiration. A biological neuron consists of three main parts:

  • Dendrites: Receive signals from other neurons.
  • Soma (Cell Body): Processes the received signals.
  • Axon: Transmits the processed signal outwards.

Signals are passed between neurons across a Synapse.

Biological Neuron Diagram A simplified diagram of a biological neuron, clearly showing the signal flow from dendrites (input), through the soma (processing), to the axon (output). Input 1. Dendrites (Receive Signals) 2. Soma (Cell Body) (Process Signals) 3. Axon (Transmit Signals) Output

The Mathematical Abstraction: The McCulloch-Pitts Neuron

In 1943, Warren McCulloch and Walter Pitts proposed the first mathematical model of a neuron, known as the ‘M-P model’.

It simulates two key processes of a biological neuron:

  1. Signal Aggregation: It receives input signals from multiple upstream neurons and calculates their weighted sum.
  2. Activation Decision: It compares this weighted sum to a threshold. If the sum exceeds the threshold, the neuron ‘fires’ and outputs a signal; otherwise, it remains ‘inhibited’ and outputs nothing.

M-P Model Step 1: Signal Aggregation

Assume a neuron receives p input signals \(x_1, x_2, \dots, x_p\) from other neurons.

First, a linear transformation (weighted sum) is performed: \[ \large{u = \sum_{i=1}^{p} w_i x_i} \] Here, \(w_i\) represents the ‘weight’ of the \(i\)-th connection, simulating the strength of a synapse. A higher weight means the corresponding input signal is more important.

M-P Model Step 2: Activation Decision

Next, the weighted sum \(u\) is compared with a threshold \(\theta\):

\[ \large{y = \begin{cases} 1, & \text{if } u \ge \theta \quad \text{(Fires)} \\ 0, & \text{if } u < \theta \quad \text{(Inhibited)} \end{cases}} \]

This is an ‘all-or-nothing’ response pattern, like a switch that is either on (1) or off (0).

Graphical Representation of the M-P Model

We can represent the M-P model with a simple computation graph.

M-P Neuron Model Diagram A diagram showing the computational flow of an M-P neuron, from inputs, weighting, summation, to activation and output. x₁ x₂ xₚ Σ u ≥ θ ? y w₁ w₂ wₚ u = Σwᵢxᵢ

A More Convenient Formulation: Introducing the Bias Term

Working with a threshold \(\theta\) is algebraically inconvenient. We can perform a simple transformation.

Let \(b = -\theta\). This \(b\) is called the bias.

Then, the condition \(u \ge \theta\) is equivalent to \(u - \theta \ge 0\), which is \(u + b \ge 0\).

This allows us to treat the bias \(b\) as a special weight whose corresponding input is always 1. \[ \large{z = \left(\sum_{i=1}^{p} w_i x_i\right) + b} \] The activation process then becomes checking if \(z\) is greater than or equal to 0.

The Modern Neuron: From Threshold to Smooth Activation

The M-P model’s ‘all-or-nothing’ activation (a step function) is too crude because it is not differentiable at the threshold.

This is a fatal flaw for using gradient-based optimization algorithms to ‘learn’ the optimal weights and biases.

Therefore, modern artificial neural networks replace the simple threshold with a smooth, differentiable Activation Function \(f(\cdot)\).

\[ \large{z = \mathbf{w}^T \mathbf{x} + b} \] \[ \large{y = f(z) = f(\mathbf{w}^T \mathbf{x} + b)} \] Here, \(y\) is no longer just 0 or 1, but can be a continuous value.

The Soul of the Network: The Activation Function

The activation function is the soul of a neural network. It is responsible for introducing non-linearity into the model.

Key Insight: If there were no activation function (or if it were linear, \(f(x)=x\)), then no matter how many layers you stack, the entire network would be equivalent to a single, simple linear model.

Saturated

The function’s curve flattens out at both ends.

  • Sigmoid
  • Tanh

Non-Saturated (ReLU-based)

The derivative is constant in the positive region.

  • ReLU
  • Leaky ReLU

Saturated Activation 1: The Sigmoid Function

The Sigmoid function, also known as the Logistic function, was one of the most common activation functions in early neural networks.

\[ \large{\sigma(z) = \frac{1}{1 + e^{-z}}} \]

  • Role: Squeezes any real-valued input into the range \((0, 1)\).
  • Economic Connection: Its form is identical to the Logit model, making it perfectly suited for outputting probabilities.

Pros and Cons of the Sigmoid Function

Advantages

  • Output is bounded, allowing for a probabilistic interpretation.
  • Smooth and differentiable everywhere.

Disadvantages

  • Vanishing Gradients: The derivative is close to 0 in the saturated regions, making deep networks hard to train.
  • Not Zero-Centered: The output mean is around 0.5, not 0, which can slow down the convergence of gradient descent.

Visualizing the Sigmoid Function and Its Derivative

Derivative: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\)

Sigmoid (Logistic) Function and its Derivative A plot of the S-shaped sigmoid function and its bell-shaped derivative, showing the function's output range and the derivative's peak at z=0. Sigmoid (Logistic) Function 1.00.50.0 -4-2024 zσ(z) σ(z) = 1 / (1 + e⁻ᶻ) σ'(z)

An Intuition for Derivatives

Question: Why is the function σ(z) rising for z > 0, while its derivative σ'(z) is falling?

This is an excellent question that gets to the heart of what a derivative represents. Let’s use an analogy: driving a car.

  • Function σ(z): The distance traveled by the car.
  • Derivative σ'(z): The car’s instantaneous speed.

Now let’s describe the journey: 1. z < 0 (Starting and accelerating): The car is moving forward (distance increases), and you are pressing the gas (speed increases). The σ(z) curve gets steeper. 2. z = 0 (Peak speed): The car is still moving forward (distance increases), but you have reached your maximum speed. This is the steepest point on the σ(z) curve. 3. z > 0 (Approaching destination, easing off the gas): The car is still moving forward (distance increases), but you are easing off the gas, so your speed is decreasing. The σ(z) curve is still rising, but it’s becoming less steep.

The derivative’s value tells you the function’s direction. The derivative’s trend tells you about the function’s curvature.

Saturated Activation 2: The Tanh Function

The hyperbolic tangent (Tanh) function is a variant of the Sigmoid.

\[ \large{\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = 2\sigma(2z) - 1} \]

  • Role: Squeezes input into the range \((-1, 1)\).
  • Core Advantage: It is zero-centered. The mean of its output is close to 0, which generally leads to faster convergence than Sigmoid.

Pros and Cons of the Tanh Function

Advantages

  • Zero-centered, leading to faster convergence.
  • Output is bounded.
  • Smooth and differentiable.

Disadvantages

  • The vanishing gradient problem still exists, although it’s slightly less severe than with Sigmoid.

Visualizing the Tanh Function and Its Derivative

Derivative: \(\tanh'(z) = 1 - \tanh^2(z)\)

Hyperbolic Tangent (Tanh) Function A plot of the S-shaped Tanh function and its bell-shaped derivative, showing the function's output range of (-1, 1) and its zero-centered nature. Hyperbolic Tangent (Tanh) Function 1.00.0-1.0 -4-2024 zf(z) tanh(z) tanh'(z) = 1 - tanh²(z)

The Modern Default: ReLU (Rectified Linear Unit)

The Rectified Linear Unit (ReLU) is currently the most popular activation function, especially in deep learning.

\[ \large{\text{ReLU}(z) = \max(0, z) = \begin{cases} z, & \text{if } z > 0 \\ 0, & \text{if } z \le 0 \end{cases}} \]

It acts like a gatekeeper: negative values are blocked (set to zero), while positive values pass through unchanged.

Pros and Cons of the ReLU Function

Advantages

  • Extremely simple to compute (just a max operation).
  • The derivative is a constant 1 for positive inputs, which alleviates the vanishing gradient problem.
  • Promotes sparsity in the network (some neurons output 0), reducing the risk of overfitting.

Disadvantages

  • Not zero-centered.
  • The Dying ReLU Problem: If a neuron’s input is consistently negative, its gradient will always be 0, and the neuron effectively ‘dies’.

Visualizing the ReLU Function and Its Derivative

Derivative: \(\text{ReLU}'(z) = \begin{cases} 1, & \text{if } z > 0 \\ 0, & \text{if } z \le 0 \end{cases}\)

Rectified Linear Unit (ReLU) A plot of the ReLU function, which is zero for negative inputs and linear for positive inputs, and its step-function derivative. Rectified Linear Unit (ReLU) -4-2024 0124 zf(z) f(z) = max(0, z) f'(z) = 1, (z > 0) f'(z) = 0, (z < 0) (Not differentiable at z=0)

A ReLU Variant: Leaky ReLU

To solve the ‘Dying ReLU’ problem, researchers proposed Leaky ReLU.

\[ \large{\text{LeakyReLU}(z) = \max(\alpha z, z) = \begin{cases} z, & \text{if } z > 0 \\ \alpha z, & \text{if } z \le 0 \end{cases}} \] where \(\alpha\) is a small positive constant, such as 0.01.

Core Idea: When the input is negative, it has a small, non-zero gradient of \(\alpha\). This ensures that the neuron’s gradient never becomes completely zero, preventing it from ‘dying’.

Visualizing the Leaky ReLU Function and Its Derivative

Derivative: \(\text{LeakyReLU}'(z) = \begin{cases} 1, & \text{if } z > 0 \\ \alpha, & \text{if } z \le 0 \end{cases}\)

Leaky Rectified Linear Unit (Leaky ReLU) A plot of the Leaky ReLU function, which has a small positive slope for negative inputs, preventing the 'dying ReLU' problem. Leaky Rectified Linear Unit (Leaky ReLU) -4-2024 012 zf(z) f(z) f'(z) = 1, (z > 0) f'(z) = α = 0.1, (z ≤ 0) (Not differentiable at z=0)

Activation Function Choice Strategy

Layer Task Type Recommended Activation Rationale
Hidden Layers (General) ReLU Fast computation, good performance, the default choice.
(If ReLU fails) Leaky ReLU / ELU Solves the ‘Dying ReLU’ problem.
Output Layer Binary Classification Sigmoid Outputs a probability in the (0, 1) range.
Multiclass Classification Softmax Outputs a probability distribution over all classes, summing to 1.
Regression None (Linear) Outputs a continuous value in any range.

Rule of Thumb: Never start with Sigmoid as a hidden layer activation. Default to ReLU, and try others only if performance is poor.

From a Single Neuron to a Network: The Perceptron

In 1957, Frank Rosenblatt introduced the Perceptron, which can be considered the first complete, learnable neural network model.

  • Structure: A single M-P model neuron.
  • Activation Function: The sign function, which outputs -1 or 1. \[ \large{\hat{y} = \text{sign}(\mathbf{w}^T \mathbf{x} + b)} \]
  • Capability: The Perceptron is a linear classifier. It can find a line (or hyperplane) in the feature space to separate data points into two classes.

The Perceptron Learning Algorithm: Error-Driven

The Perceptron’s learning rule is very intuitive: ‘Correct mistakes as you see them’.

  1. Initialize weights \(\mathbf{w}\) and bias \(b\).
  2. For each training example \((\mathbf{x}, y)\):
    1. Make a prediction \(\hat{y}\) using the current parameters.
    2. If the prediction is wrong (\(y \neq \hat{y}\)), update the parameters: \[ \large{\mathbf{w} \leftarrow \mathbf{w} + \eta y \mathbf{x}} \] \[ \large{b \leftarrow b + \eta y} \] where \(\eta\) is the learning rate.
    3. If the prediction is correct, do nothing.
  3. Repeat step 2 until all examples are classified correctly.

The Perceptron’s Achilles’ Heel: The XOR Problem

As a linear classifier, the Perceptron has a famous limitation—it cannot solve the Exclusive OR (XOR) problem.

The XOR logic is as follows:

\(x_1\) \(x_2\) \(y\)
0 0 0
0 1 1
1 0 1
1 1 0

Visualizing the XOR Problem: Linearly Inseparable

It is impossible to draw a single straight line to separate the blue squares (y=0) from the orange triangles (y=1).

Linear Inseparability of the XOR Problem A scatter plot with four points representing the XOR logic, demonstrating that they cannot be separated by a single straight line. The XOR Problem: Linearly Inseparable x₁ x₂ 01 10 Cannot be perfectly separated by one line

The Solution: Stacking Neurons to Form a Network

The solution to the XOR problem is to combine multiple neurons into a network. By introducing one or more ‘Hidden Layers’, we can build a Multi-Layer Perceptron (MLP), also known as a Feedforward Neural Network (FNN).

Multi-Layer Perceptron Structure A diagram of an MLP structure with an input layer, a hidden layer, and an output layer. Input Layer (p=2) Hidden Layer (3 neurons) Output Layer (1 neuron)

How MLPs Solve the XOR Problem

An MLP with a hidden layer can perform a non-linear transformation on the original input space, mapping it to a new feature space. In this new space, data that was previously linearly inseparable can become linearly separable.

MLP Solving XOR via Feature Space Transformation A diagram showing that XOR data points, while linearly inseparable in the original space, become linearly separable in a new feature space after transformation by a hidden layer. Original Input Space x₁x₂ Hidden Layer's Non-linear Transform New Feature Space (Linearly Separable) h₁h₂

Mathematical Representation of an MLP: Layer by Layer

Consider an L-layer MLP. For the \(l\)-th layer (where \(l=1, \dots, L\)):

  • Linear Transformation: \[ \large{\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{y}^{(l-1)} + \mathbf{b}^{(l)}} \]
  • Non-linear Activation: \[ \large{\mathbf{y}^{(l)} = f^{(l)}(\mathbf{z}^{(l)})} \]

Where: * \(\mathbf{y}^{(l-1)}\) is the output of the \((l-1)\)-th layer (or the original input \(\mathbf{x}\) when \(l=1\)). * \(\mathbf{W}^{(l)}\) and \(\mathbf{b}^{(l)}\) are the weight matrix and bias vector for the \(l\)-th layer. * \(f^{(l)}\) is the activation function for the \(l\)-th layer.

Network Architecture: Depth vs. Width

Width

  • The number of neurons in a hidden layer.
  • Wider networks can learn more complex features at a given layer.
  • Risk: Prone to overfitting.

Depth

  • The number of hidden layers.
  • Deeper networks can learn a hierarchy of features (from simple to complex).
  • Universal Approximation Theorem: A single hidden layer network with enough width can approximate any continuous function. However, in practice, deep networks are often more efficient than shallow, wide ones.

How to Train an MLP: The Core Idea

We have the network structure, but how do we find the optimal parameter values for the thousands (or millions) of parameters (all the W’s and b’s)?

  1. Define a Loss Function: First, we need a function to measure how ‘bad’ the model’s predictions are.
    • Regression: Mean Squared Error (MSE)
    • Classification: Cross-Entropy
  2. Objective: Find the set of parameters \((\mathbf{W}, \mathbf{b})\) that minimizes the total loss over the entire training set.
  3. Method: Use the Gradient Descent algorithm.

An Intuitive Understanding of Gradient Descent

Imagine you are on a dark mountain and your goal is to walk to the lowest point in the valley.

  1. You feel around with your foot to find the direction of the steepest slope (this is the gradient).
  2. You take a small step in the direction of the steepest descent.
  3. You repeat this process, step by step, making your way down to the valley floor.
Gradient Descent: Local vs. Global Optima Loss Parameter Space Local Optimum (local minimum) Global Optimum (global minimum) Start Gradient Descent Path The algorithm starts, iterating along the negative gradient, but may become trapped in a local optimum. in the deepest valley Iteration Point Direction of Descent Local Optimum Global Optimum

The Mathematics of Gradient Descent

The parameter update rule is: \[ \large{\theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla_{\theta} J(\theta)} \]

  • \(\theta\): Represents all model parameters (W, b).
  • \(J(\theta)\): The loss function.
  • \(\nabla_{\theta} J(\theta)\): The gradient of the loss function with respect to the parameters. It points in the direction of the steepest ascent.
  • \(-\nabla_{\theta} J(\theta)\): Points in the direction of the steepest descent.
  • \(\eta\): The learning rate, which determines the size of each step.

The Biggest Challenge: How to Compute the Gradient?

For a deep network, the loss function is an extremely complex composite function of thousands or millions of parameters.

\[ \large{L = f_L(f_{L-1}(\dots f_1(\mathbf{x}; \mathbf{W}^{(1)}, \mathbf{b}^{(1)}); \dots); \mathbf{W}^{(L)}, \mathbf{b}^{(L)})} \]

Taking the derivative of this directly is nearly impossible. We need an efficient algorithm to compute this gradient.

The Solution: The Backpropagation Algorithm

The Backpropagation (BP) algorithm is the cornerstone of training neural networks. It is essentially an efficient application of the Chain Rule from calculus to a neural network.

It involves two phases:

  1. Forward Pass: From input to output, compute the prediction and the loss.
  2. Backward Pass: From output to input, compute the gradient of the loss with respect to the parameters of each layer.
Neural Network: Forward & Backward Propagation A computational graph showing the forward pass for calculating loss and the backward pass for calculating gradients in a neural network. Neural Network: Forward & Backward Propagation x h₁ h₂ ŷ L Forward Propagation (Compute Loss) h₁=σ(W₁x)h₂=σ(W₂h₁)ŷ=σ(W₃h₂)L=Cost(ŷ, y) Backward Propagation (Compute Gradients to Update Weights W) ∂L/∂ŷ∂L/∂h₂∂L/∂h₁∂L/∂x ∇W₃∇W₂∇W₁

The Core of Backpropagation: The Chain Rule

If we have \(y = f(u)\) and \(u = g(x)\), then the derivative of \(y\) with respect to \(x\) is: \[ \large{\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \cdot \frac{\partial u}{\partial x}} \] In a neural network, the loss \(L\) is a function of the final layer’s output \(\mathbf{y}^{(L)}\), which is a function of \(\mathbf{z}^{(L)}\), which in turn is a function of the previous layer’s output \(\mathbf{y}^{(L-1)}\) and parameters \(\mathbf{W}^{(L)}, \mathbf{b}^{(L)}\), and so on.

Backpropagation uses the chain rule to efficiently pass the ‘gradient signal’ from the last layer all the way back to the first.

In Practice: Predicting U.S. Recessions with Python

Enough theory. Let’s look at a practical economic application. We will build an MLP using scikit-learn to predict whether the U.S. economy is in a recession.

  • Target Variable: The NBER recession indicator USREC (1=Recession, 0=Expansion).
  • Feature Variables: We will select some common macroeconomic indicators.
  • Task: This is a binary classification problem.

Feature Selection

We will use three classic leading indicators for economic recessions:

  1. Term Spread: The difference between the 10-year and 3-month Treasury yields (T10Y3M). An inverted yield curve (spread < 0) is a strong recession signal.
  2. Unemployment Rate: (UNRATE). Recessions are typically accompanied by a rise in unemployment.
  3. Consumer Sentiment: (UMCSENT). A decline in consumer confidence suggests that future consumer spending may decrease, dragging down the economy.

Step 1: Acquiring and Preparing the Data

We will use the fredapi package to fetch data from the St. Louis Fed’s FRED database. For a real project, you would need to request your own free API key.

To ensure the code is runnable without an API key, we will generate a mock dataset here that has similar statistical properties to the real data.

import pandas as pd
import numpy as np

# --- MOCK DATA GENERATION ---
# In a real scenario, you would use fredapi to fetch data.
# For reproducibility, we create a mock dataset here.
def create_mock_fred_data(start_date='1970-01-01', end_date='2023-12-31'):
    dates = pd.date_range(start=start_date, end=end_date, freq='MS')
    n = len(dates)
    
    # Simulate term spread (can be negative, cyclical)
    term_spread = 1.5 + np.sin(np.linspace(0, 10 * np.pi, n)) * 2 + np.random.randn(n) * 0.5
    
    # Simulate unemployment (positive, negatively correlated with spread)
    unemployment = 6 - 0.8 * term_spread + np.random.randn(n) * 0.5
    unemployment = np.clip(unemployment, 3, 10) # Keep in a reasonable range
    
    # Simulate consumer sentiment (positive, negatively correlated with unemployment)
    consumer_sentiment = 100 - 3 * unemployment + np.random.randn(n) * 5
    consumer_sentiment = np.clip(consumer_sentiment, 50, 110)
    
    # Simulate recession (probability increases when spread is low and unemployment is high)
    recession_prob = 1 / (1 + np.exp(-( -2 * term_spread + 1.5 * (unemployment - 6) - 5)))
    recession = (np.random.rand(n) < recession_prob).astype(int)
    
    df = pd.DataFrame({
        'recession': recession,
        'term_spread': term_spread,
        'unemployment': unemployment,
        'consumer_sentiment': consumer_sentiment
    }, index=dates)
    
    return df

df = create_mock_fred_data()
# --- END MOCK DATA ---

print('Data Preview:')
print(df.head())
print('\nDescriptive Statistics:')
print(df.describe().round(2))
Data Preview:
            recession  term_spread  unemployment  consumer_sentiment
1970-01-01          0     1.686008      4.876468           85.663417
1970-02-01          0     1.466348      4.953032           78.748148
1970-03-01          0     1.368672      5.258407           85.458114
1970-04-01          0     2.143784      3.673283           87.587612
1970-05-01          0     1.770527      4.834233           78.320318

Descriptive Statistics:
       recession  term_spread  unemployment  consumer_sentiment
count     648.00       648.00        648.00              648.00
mean        0.02         1.51          4.85               85.45
std         0.13         1.51          1.27                6.32
min         0.00        -2.12          3.00               67.16
25%         0.00         0.21          3.79               81.01
50%         0.00         1.52          4.78               85.47
75%         0.00         2.84          5.91               89.84
max         1.00         4.74          8.33              104.32

Step 2: Define Features and Target, and Split the Dataset

  • Features (X): term_spread, unemployment, consumer_sentiment
  • Target (y): recession
  • We will split the data into a training set (80%) and a test set (20%).
  • stratify=y ensures that the proportion of recession and expansion periods is the same in both the training and test sets, which is crucial for imbalanced datasets.
from sklearn.model_selection import train_test_split

X = df[['term_spread', 'unemployment', 'consumer_sentiment']]
y = df['recession']

# Split dataset (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f'Training set size: {X_train.shape} samples')
print(f'Test set size: {X_test.shape} samples')
print(f'Recession proportion in training set: {y_train.mean():.2%}')
print(f'Recession proportion in test set: {y_test.mean():.2%}')
Training set size: (518, 3) samples
Test set size: (130, 3) samples
Recession proportion in training set: 1.93%
Recession proportion in test set: 1.54%

Step 3: Feature Scaling

Neural networks are very sensitive to the scale of input features. If different features have vastly different numerical ranges, the training process can become unstable. Standardization, which scales all features to have a mean of 0 and a standard deviation of 1, is a crucial preprocessing step. Note: We fit_transform only on the training set. The test set must be transformed using the same scaling rules learned from the training set to avoid data leakage.

from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Fit on the training data and transform it
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same transformation to the test data
X_test_scaled = scaler.transform(X_test)

print('Pre-scaling train set means:', np.mean(X_train, axis=0).values.round(2))
print('Post-scaling train set means:', np.mean(X_train_scaled, axis=0).round(2))
print('Post-scaling train set std devs:', np.std(X_train_scaled, axis=0).round(2))
Pre-scaling train set means: [ 1.49  4.86 85.55]
Post-scaling train set means: [0. 0. 0.]
Post-scaling train set std devs: [1. 1. 1.]

Step 4: Building and Training the MLP Model

We use sklearn.neural_network.MLPClassifier to build the model. * hidden_layer_sizes=(50, 50): Defines a network with two hidden layers, each with 50 neurons. * activation='relu': Use the ReLU activation function for the hidden layers. * solver='adam': Adam is an efficient gradient descent optimization algorithm. * max_iter=500: The maximum number of training epochs.

from sklearn.neural_network import MLPClassifier

# Build the MLP model
mlp = MLPClassifier(
    hidden_layer_sizes=(50, 50), 
    activation='relu',
    solver='adam',
    max_iter=500,
    random_state=42
)

# Train the model
print('Starting model training...')
mlp.fit(X_train_scaled, y_train)
print('Model training complete!')
Starting model training...
Model training complete!

Step 5: Evaluating Model Performance

After training, we evaluate the model’s performance on the test set to check its generalization ability. We will use a classification report, which includes several key metrics:

  • Precision: Of all instances predicted as ‘Recession’, how many were actually recessions? (TP / (TP + FP))
  • Recall: Of all actual ‘Recession’ instances, how many did the model successfully identify? (TP / (TP + FN))
  • F1-score: The harmonic mean of precision and recall.
from sklearn.metrics import classification_report

# Make predictions on the test set
y_pred = mlp.predict(X_test_scaled)

print('Classification Report (Test Set):')
# target_names provides labels for classes 0 and 1
print(classification_report(y_test, y_pred, target_names=['Expansion', 'Recession']))
Classification Report (Test Set):
              precision    recall  f1-score   support

   Expansion       0.98      1.00      0.99       128
   Recession       0.00      0.00      0.00         2

    accuracy                           0.98       130
   macro avg       0.49      0.50      0.50       130
weighted avg       0.97      0.98      0.98       130

Visualizing the Confusion Matrix

A confusion matrix provides a clear visual breakdown of the model’s performance across different classes.

  • Top-Left (TN): True Negative (Correctly predicted ‘Expansion’)
  • Bottom-Right (TP): True Positive (Correctly predicted ‘Recession’)
  • Top-Right (FP): False Positive (Predicted ‘Recession’ but was ‘Expansion’ - Type I Error)
  • Bottom-Left (FN): False Negative (Predicted ‘Expansion’ but was ‘Recession’ - Type II Error, often more costly)
Figure 1: Confusion Matrix for the Recession Prediction Model

Introduction: Convolutional Neural Networks (CNNs)

The MLPs we’ve discussed are fully-connected, meaning every neuron in a layer is connected to every neuron in the previous layer.

When processing data with spatial or temporal structure, like images or time series, the number of parameters in a fully-connected network explodes, and it fails to leverage the local structure of the data.

A Convolutional Neural Network (CNN) is a special type of feedforward network that addresses these issues through local connectivity and weight sharing.

The Core Idea of CNNs: Analyzing Data Like a Visual System

CNNs are inspired by the biological visual cortex.

  1. Receptive Field: Each neuron focuses only on a small region of the input (local connectivity).
  2. Feature Map: A ‘filter’ or ‘kernel’ slides across the entire input, searching for a specific pattern (like an edge or corner) and generating a feature map (weight sharing).

This is like how we look at a photo: we don’t process every pixel at once, but rather identify local lines and shapes first, then combine them into more complex objects.

The Key Layer of a CNN: The Convolutional Layer

The convolutional layer is the core of a CNN. It slides a small kernel over the input data, computing the element-wise product sum between the kernel and the corresponding input region at each position, plus a bias. This process extracts local features from the input.

CNN Convolution Operation A 3x3 kernel slides over a 5x5 input, computing and generating one pixel value for the feature map. Input Data (5x5) 11100 01110 0011 100110 01100 Kernel (3x3) 101 010 101 = 1*1 + 1*0 + 1*1 = 2 0*0 + 1*1 + 1*0 = 1 0*1 + 0*0 + 1*1 = 1 SUM = 4 Feature Map (3x3) 4

The Key Layer of a CNN: The Pooling Layer

A pooling layer (or downsampling layer) typically follows a convolutional layer.

Purpose:

  1. Dimensionality Reduction: Reduces the size of the feature map, thereby decreasing computational load and the number of parameters.
  2. Invariance: Makes the model less sensitive to small translations or rotations in the input.

Common Methods:

  • Max Pooling: Takes the maximum value from a region.
  • Average Pooling: Calculates the average value of a region.

Visualizing Max Pooling

The diagram below shows a 2x2 max pooling operation on a 4x4 feature map.

Max Pooling Operation A 4x4 grid is downsampled to a 2x2 grid using a 2x2 max pooling operation. Input Feature Map 3824 5193 1367 2345 2x2 Max Pooling Output 8 9 7 5

CNN Applications in Economics?

Although CNNs were born from image recognition, their core idea of recognizing local patterns can be applied to economics:

  • Time Series Analysis: A financial time series (e.g., stock prices) can be treated as a 1D ‘image’. CNNs can be used to identify technical analysis patterns like ‘head and shoulders’ or ‘double bottoms’.
  • Textual Analysis: A matrix of word vectors from a sentence can be treated as a 2D image. CNNs can extract local semantic features for analyzing the sentiment or topics of financial reports and news articles.
  • Satellite Imagery Analysis: Using satellite data like nighttime lights or ships in ports to predict regional economic activity.

A Brief History of Neural Networks: A Tour of Famous Models

Since 2012, the field of deep learning has seen a surge of landmark CNN architectures. Understanding them helps us appreciate how networks have become progressively deeper and more powerful.

  • LeNet-5 (1998): The ancestor of modern CNNs.
  • AlexNet (2012): Ignited the deep learning revolution; first to use ReLU and Dropout.
  • VGGNet (2014): Demonstrated the importance of network depth.
  • GoogLeNet (2014): Introduced the ‘Inception module’, improving network width and efficiency.
  • ResNet (2015): Introduced ‘residual connections’, solving the training problem for extremely deep networks.

LeNet-5 (1998): The Founder of a Classic Architecture

Proposed by Yann LeCun for recognizing handwritten digits on checks. Its classic architecture [CONV -> POOL -> CONV -> POOL -> FC -> OUTPUT] is still influential today.

LeNet-5 Simplified Architecture A simplified diagram showing the layered structure of LeNet-5, including convolutional, pooling, and fully-connected layers. Input 32x32 C1: Conv S2: Pool C3: Conv S4: Pool Fully Connected

AlexNet (2012): The ‘Big Bang’ of Deep Learning

AlexNet won the 2012 ImageNet competition by a massive margin, heralding the dawn of the deep learning era.

Key Contributions:

  • The first successful training of a deep CNN.
  • Widespread use of the ReLU activation function, which significantly sped up training.
  • Use of Dropout to prevent overfitting.
  • Use of GPUs for parallel computation, making it possible to train large models.

VGGNet (2014): Depth is Power

The VGG team explored a simple but profound question: does making the network deeper improve performance?

Core Idea:

  • Minimalism: Used only small 3x3 convolution kernels and 2x2 pooling layers.
  • Stacking: By repeatedly stacking these simple blocks, they built very deep networks (e.g., VGG16, VGG19).

VGG proved that, to a certain extent, increasing network depth can significantly boost performance.

VGGNet Simplified Structure A series of stacked blocks representing VGG's philosophy of building depth by repeating simple modules. ... Building depth by stacking simple modules [Conv x N -> Pool]

GoogLeNet (2014): Wider and More Efficient Networks

Google’s GoogLeNet (aka Inception-v1) defeated VGGNet in the same year’s ImageNet competition.

Core Idea: The Inception Module

  • It uses different-sized convolution kernels (1x1, 3x3, 5x5) and a pooling operation in parallel, then concatenates their results. This allows the network to learn features at different scales within the same layer.
  • It extensively uses 1x1 convolutions for dimensionality reduction, which drastically reduces the number of parameters.
GoogLeNet Inception Module A diagram of the Inception module, showing four parallel branches for multi-scale feature extraction (1x1, 3x3, 5x5 convolutions, and pooling) and their concatenation. GoogLeNet Inception Module Parallel Multi-Scale Feature Extraction Input 1×1 Conv 1×1 Conv 3×3 Conv 1×1 Conv 5×5 Conv 3×3 Pool 1×1 Conv Concat Output

ResNet (2015): Bridging the Depth Gap

As networks get extremely deep, a ‘degradation’ problem emerges: the training error of a deeper network is higher than that of its shallower counterpart. ResNet (Residual Network), proposed by Kaiming He et al. at Microsoft Research Asia, elegantly solved this problem.

Core Idea: The Shortcut/Skip Connection * It allows information to ‘skip’ one or more layers. The network no longer needs to learn an identity mapping from scratch; it only needs to learn the ‘residual’ between the input and the output. \[ \large{H(x) = F(x) + x} \]

ResNet’s innovation made it possible to train ultra-deep networks of hundreds or even thousands of layers.

ResNet Residual Block Illustration of a residual connection (skip connection) where the input 'x' is added to the output of the weight layers 'F(x)' to produce the final output 'H(x)'. ResNet: The Residual Block x Weight Layer Weight Layer + Addition Input F(x) Identity Shortcut (x) H(x) = F(x) + x

Chapter Summary

  • Why do we need neural networks? The economic world is full of non-linearity, and traditional linear models have their limits. Neural networks are powerful tools for capturing complex patterns.
  • What is the basic principle? Inspired by biological neurons, they process information through weighted sums and non-linear activations. Activation functions (especially ReLU) are key to introducing non-linearity.
  • How do we go from simple to complex? A single neuron (Perceptron) has limited power. By stacking them into layers (MLP), we can solve complex non-linear problems (like XOR).
  • How do networks ‘learn’? By defining a loss function, and then using gradient descent and the backpropagation algorithm to systematically adjust network parameters to minimize prediction error.
  • Are there more specialized networks? Yes. For example, CNNs, which use convolution and pooling, are especially good at handling data with spatial/temporal structure (like time series or images).

Conclusion: A New Paradigm for Economic Modeling

  • Capturing Non-linearity: The core strength of neural networks is their powerful ability to fit non-linear relationships, helping us understand complex economic phenomena that linear models cannot explain.
  • Data-Driven: They are highly data-driven models capable of automatically learning features and patterns from large-scale datasets.
  • A Powerful Toolkit: From simple MLPs to complex CNNs, we have a rich set of tools suitable for various data types and tasks.

Future Outlook and Caveats

  • Explainability (XAI): Neural networks are often called ‘black box’ models because their decision-making processes are not transparent. This is a major hurdle for their application in high-stakes areas like policy advice and credit scoring, and it is a hot research topic.
  • Causal Inference: Neural networks excel at prediction (finding correlations) but cannot be directly used for causal inference. Combining neural networks with causal inference frameworks (like Diff-in-Diff or Instrumental Variables) is a frontier research area.
  • More Models: We only introduced feedforward networks today. For time series data, Recurrent Neural Networks (RNNs) and their variants (like LSTM, GRU) are a more natural choice.

Thank You!

Q & A