It is powerful and highly interpretable, but it rests on a critical assumption: a linear relationship between variables.
The Beauty and Burden of the Linear Assumption
A linear relationship means that for every one-unit increase in an independent variable \(X\), the change in the dependent variable \(Y\) is constant (\(\beta\)).
But is the real world always this simple?
The Real World: A Web of Complex, Non-Linear Relationships
Many economic phenomena cannot be perfectly described by a straight line.
Diminishing Marginal Utility: The happiness gained from an increase in income diminishes at higher income levels.
The Laffer Curve: The relationship between tax rates and tax revenue is an ‘inverted U-shape’.
‘Fear’ and ‘Greed’ in Financial Markets: Asset prices react non-linearly to news, exhibiting thresholds and sharp fluctuations.
When faced with these complex non-linear relationships, traditional econometric models may fall short.
Example 1: Diminishing Marginal Utility
The higher the income, the smaller the increase in happiness from the same amount of money.
Example 2: The Laffer Curve
Higher tax rates are not always better. Excessively high rates can stifle economic activity, leading to a decrease in tax revenue.
This Chapter’s Goal: Introduce a Powerful Non-Linear Tool
In this chapter, we will learn a new modeling paradigm inspired by the workings of the human brain:
Artificial Neural Networks (ANNs)
Our objectives are to: 1. Understand the basic building block of a neural network—the neuron. 2. Grasp how networks introduce non-linearity through activation functions. 3. Learn how to build and train a feedforward neural network. 4. Explore its potential applications in economics and finance.
Inspiration: The Human Brain’s Neuron
Before diving into the math, let’s look at the source of inspiration. A biological neuron consists of three main parts:
Dendrites: Receive signals from other neurons.
Soma (Cell Body): Processes the received signals.
Axon: Transmits the processed signal outwards.
Signals are passed between neurons across a Synapse.
The Mathematical Abstraction: The McCulloch-Pitts Neuron
In 1943, Warren McCulloch and Walter Pitts proposed the first mathematical model of a neuron, known as the ‘M-P model’.
It simulates two key processes of a biological neuron:
Signal Aggregation: It receives input signals from multiple upstream neurons and calculates their weighted sum.
Activation Decision: It compares this weighted sum to a threshold. If the sum exceeds the threshold, the neuron ‘fires’ and outputs a signal; otherwise, it remains ‘inhibited’ and outputs nothing.
M-P Model Step 1: Signal Aggregation
Assume a neuron receives p input signals \(x_1, x_2, \dots, x_p\) from other neurons.
First, a linear transformation (weighted sum) is performed: \[ \large{u = \sum_{i=1}^{p} w_i x_i} \] Here, \(w_i\) represents the ‘weight’ of the \(i\)-th connection, simulating the strength of a synapse. A higher weight means the corresponding input signal is more important.
M-P Model Step 2: Activation Decision
Next, the weighted sum \(u\) is compared with a threshold \(\theta\):
This is an ‘all-or-nothing’ response pattern, like a switch that is either on (1) or off (0).
Graphical Representation of the M-P Model
We can represent the M-P model with a simple computation graph.
A More Convenient Formulation: Introducing the Bias Term
Working with a threshold \(\theta\) is algebraically inconvenient. We can perform a simple transformation.
Let \(b = -\theta\). This \(b\) is called the bias.
Then, the condition \(u \ge \theta\) is equivalent to \(u - \theta \ge 0\), which is \(u + b \ge 0\).
This allows us to treat the bias \(b\) as a special weight whose corresponding input is always 1. \[ \large{z = \left(\sum_{i=1}^{p} w_i x_i\right) + b} \] The activation process then becomes checking if \(z\) is greater than or equal to 0.
The Modern Neuron: From Threshold to Smooth Activation
The M-P model’s ‘all-or-nothing’ activation (a step function) is too crude because it is not differentiable at the threshold.
This is a fatal flaw for using gradient-based optimization algorithms to ‘learn’ the optimal weights and biases.
Therefore, modern artificial neural networks replace the simple threshold with a smooth, differentiable Activation Function\(f(\cdot)\).
\[ \large{z = \mathbf{w}^T \mathbf{x} + b} \]\[ \large{y = f(z) = f(\mathbf{w}^T \mathbf{x} + b)} \] Here, \(y\) is no longer just 0 or 1, but can be a continuous value.
The Soul of the Network: The Activation Function
The activation function is the soul of a neural network. It is responsible for introducing non-linearity into the model.
Key Insight: If there were no activation function (or if it were linear, \(f(x)=x\)), then no matter how many layers you stack, the entire network would be equivalent to a single, simple linear model.
Saturated
The function’s curve flattens out at both ends.
Sigmoid
Tanh
Non-Saturated (ReLU-based)
The derivative is constant in the positive region.
ReLU
Leaky ReLU
Saturated Activation 1: The Sigmoid Function
The Sigmoid function, also known as the Logistic function, was one of the most common activation functions in early neural networks.
\[ \large{\sigma(z) = \frac{1}{1 + e^{-z}}} \]
Role: Squeezes any real-valued input into the range \((0, 1)\).
Economic Connection: Its form is identical to the Logit model, making it perfectly suited for outputting probabilities.
Pros and Cons of the Sigmoid Function
Advantages
Output is bounded, allowing for a probabilistic interpretation.
Smooth and differentiable everywhere.
Disadvantages
Vanishing Gradients: The derivative is close to 0 in the saturated regions, making deep networks hard to train.
Not Zero-Centered: The output mean is around 0.5, not 0, which can slow down the convergence of gradient descent.
Visualizing the Sigmoid Function and Its Derivative
Question: Why is the function σ(z) rising for z > 0, while its derivative σ'(z) is falling?
This is an excellent question that gets to the heart of what a derivative represents. Let’s use an analogy: driving a car.
Function σ(z): The distance traveled by the car.
Derivative σ'(z): The car’s instantaneous speed.
Now let’s describe the journey: 1. z < 0 (Starting and accelerating): The car is moving forward (distance increases), and you are pressing the gas (speed increases). The σ(z) curve gets steeper. 2. z = 0 (Peak speed): The car is still moving forward (distance increases), but you have reached your maximum speed. This is the steepest point on the σ(z) curve. 3. z > 0 (Approaching destination, easing off the gas): The car is still moving forward (distance increases), but you are easing off the gas, so your speed is decreasing. The σ(z) curve is still rising, but it’s becoming less steep.
The derivative’s value tells you the function’s direction. The derivative’s trend tells you about the function’s curvature.
Saturated Activation 2: The Tanh Function
The hyperbolic tangent (Tanh) function is a variant of the Sigmoid.
It acts like a gatekeeper: negative values are blocked (set to zero), while positive values pass through unchanged.
Pros and Cons of the ReLU Function
Advantages
Extremely simple to compute (just a max operation).
The derivative is a constant 1 for positive inputs, which alleviates the vanishing gradient problem.
Promotes sparsity in the network (some neurons output 0), reducing the risk of overfitting.
Disadvantages
Not zero-centered.
The Dying ReLU Problem: If a neuron’s input is consistently negative, its gradient will always be 0, and the neuron effectively ‘dies’.
Visualizing the ReLU Function and Its Derivative
Derivative: \(\text{ReLU}'(z) = \begin{cases} 1, & \text{if } z > 0 \\ 0, & \text{if } z \le 0 \end{cases}\)
A ReLU Variant: Leaky ReLU
To solve the ‘Dying ReLU’ problem, researchers proposed Leaky ReLU.
\[ \large{\text{LeakyReLU}(z) = \max(\alpha z, z) = \begin{cases} z, & \text{if } z > 0 \\ \alpha z, & \text{if } z \le 0 \end{cases}} \] where \(\alpha\) is a small positive constant, such as 0.01.
Core Idea: When the input is negative, it has a small, non-zero gradient of \(\alpha\). This ensures that the neuron’s gradient never becomes completely zero, preventing it from ‘dying’.
Visualizing the Leaky ReLU Function and Its Derivative
Derivative: \(\text{LeakyReLU}'(z) = \begin{cases} 1, & \text{if } z > 0 \\ \alpha, & \text{if } z \le 0 \end{cases}\)
Activation Function Choice Strategy
Layer
Task Type
Recommended Activation
Rationale
Hidden Layers
(General)
ReLU
Fast computation, good performance, the default choice.
(If ReLU fails)
Leaky ReLU / ELU
Solves the ‘Dying ReLU’ problem.
Output Layer
Binary Classification
Sigmoid
Outputs a probability in the (0, 1) range.
Multiclass Classification
Softmax
Outputs a probability distribution over all classes, summing to 1.
Regression
None (Linear)
Outputs a continuous value in any range.
Rule of Thumb: Never start with Sigmoid as a hidden layer activation. Default to ReLU, and try others only if performance is poor.
From a Single Neuron to a Network: The Perceptron
In 1957, Frank Rosenblatt introduced the Perceptron, which can be considered the first complete, learnable neural network model.
Structure: A single M-P model neuron.
Activation Function: The sign function, which outputs -1 or 1. \[ \large{\hat{y} = \text{sign}(\mathbf{w}^T \mathbf{x} + b)} \]
Capability: The Perceptron is a linear classifier. It can find a line (or hyperplane) in the feature space to separate data points into two classes.
The Perceptron Learning Algorithm: Error-Driven
The Perceptron’s learning rule is very intuitive: ‘Correct mistakes as you see them’.
Initialize weights \(\mathbf{w}\) and bias \(b\).
For each training example \((\mathbf{x}, y)\):
Make a prediction \(\hat{y}\) using the current parameters.
If the prediction is wrong (\(y \neq \hat{y}\)), update the parameters: \[ \large{\mathbf{w} \leftarrow \mathbf{w} + \eta y \mathbf{x}} \]\[ \large{b \leftarrow b + \eta y} \] where \(\eta\) is the learning rate.
If the prediction is correct, do nothing.
Repeat step 2 until all examples are classified correctly.
The Perceptron’s Achilles’ Heel: The XOR Problem
As a linear classifier, the Perceptron has a famous limitation—it cannot solve the Exclusive OR (XOR) problem.
The XOR logic is as follows:
\(x_1\)
\(x_2\)
\(y\)
0
0
0
0
1
1
1
0
1
1
1
0
Visualizing the XOR Problem: Linearly Inseparable
It is impossible to draw a single straight line to separate the blue squares (y=0) from the orange triangles (y=1).
The Solution: Stacking Neurons to Form a Network
The solution to the XOR problem is to combine multiple neurons into a network. By introducing one or more ‘Hidden Layers’, we can build a Multi-Layer Perceptron (MLP), also known as a Feedforward Neural Network (FNN).
How MLPs Solve the XOR Problem
An MLP with a hidden layer can perform a non-linear transformation on the original input space, mapping it to a new feature space. In this new space, data that was previously linearly inseparable can become linearly separable.
Mathematical Representation of an MLP: Layer by Layer
Consider an L-layer MLP. For the \(l\)-th layer (where \(l=1, \dots, L\)):
Linear Transformation: \[ \large{\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{y}^{(l-1)} + \mathbf{b}^{(l)}} \]
Where: * \(\mathbf{y}^{(l-1)}\) is the output of the \((l-1)\)-th layer (or the original input \(\mathbf{x}\) when \(l=1\)). * \(\mathbf{W}^{(l)}\) and \(\mathbf{b}^{(l)}\) are the weight matrix and bias vector for the \(l\)-th layer. * \(f^{(l)}\) is the activation function for the \(l\)-th layer.
Network Architecture: Depth vs. Width
Width
The number of neurons in a hidden layer.
Wider networks can learn more complex features at a given layer.
Risk: Prone to overfitting.
Depth
The number of hidden layers.
Deeper networks can learn a hierarchy of features (from simple to complex).
Universal Approximation Theorem: A single hidden layer network with enough width can approximate any continuous function. However, in practice, deep networks are often more efficient than shallow, wide ones.
How to Train an MLP: The Core Idea
We have the network structure, but how do we find the optimal parameter values for the thousands (or millions) of parameters (all the W’s and b’s)?
Define a Loss Function: First, we need a function to measure how ‘bad’ the model’s predictions are.
Regression: Mean Squared Error (MSE)
Classification: Cross-Entropy
Objective: Find the set of parameters \((\mathbf{W}, \mathbf{b})\) that minimizes the total loss over the entire training set.
Method: Use the Gradient Descent algorithm.
An Intuitive Understanding of Gradient Descent
Imagine you are on a dark mountain and your goal is to walk to the lowest point in the valley.
You feel around with your foot to find the direction of the steepest slope (this is the gradient).
You take a small step in the direction of the steepest descent.
You repeat this process, step by step, making your way down to the valley floor.
The Mathematics of Gradient Descent
The parameter update rule is: \[
\large{\theta_{\text{new}} = \theta_{\text{old}} - \eta \nabla_{\theta} J(\theta)}
\]
\(\theta\): Represents all model parameters (W, b).
\(J(\theta)\): The loss function.
\(\nabla_{\theta} J(\theta)\): The gradient of the loss function with respect to the parameters. It points in the direction of the steepest ascent.
\(-\nabla_{\theta} J(\theta)\): Points in the direction of the steepest descent.
\(\eta\): The learning rate, which determines the size of each step.
The Biggest Challenge: How to Compute the Gradient?
For a deep network, the loss function is an extremely complex composite function of thousands or millions of parameters.
Taking the derivative of this directly is nearly impossible. We need an efficient algorithm to compute this gradient.
The Solution: The Backpropagation Algorithm
The Backpropagation (BP) algorithm is the cornerstone of training neural networks. It is essentially an efficient application of the Chain Rule from calculus to a neural network.
It involves two phases:
Forward Pass: From input to output, compute the prediction and the loss.
Backward Pass: From output to input, compute the gradient of the loss with respect to the parameters of each layer.
The Core of Backpropagation: The Chain Rule
If we have \(y = f(u)\) and \(u = g(x)\), then the derivative of \(y\) with respect to \(x\) is: \[ \large{\frac{\partial y}{\partial x} = \frac{\partial y}{\partial u} \cdot \frac{\partial u}{\partial x}} \] In a neural network, the loss \(L\) is a function of the final layer’s output \(\mathbf{y}^{(L)}\), which is a function of \(\mathbf{z}^{(L)}\), which in turn is a function of the previous layer’s output \(\mathbf{y}^{(L-1)}\) and parameters \(\mathbf{W}^{(L)}, \mathbf{b}^{(L)}\), and so on.
Backpropagation uses the chain rule to efficiently pass the ‘gradient signal’ from the last layer all the way back to the first.
In Practice: Predicting U.S. Recessions with Python
Enough theory. Let’s look at a practical economic application. We will build an MLP using scikit-learn to predict whether the U.S. economy is in a recession.
Target Variable: The NBER recession indicator USREC (1=Recession, 0=Expansion).
Feature Variables: We will select some common macroeconomic indicators.
Task: This is a binary classification problem.
Feature Selection
We will use three classic leading indicators for economic recessions:
Term Spread: The difference between the 10-year and 3-month Treasury yields (T10Y3M). An inverted yield curve (spread < 0) is a strong recession signal.
Unemployment Rate: (UNRATE). Recessions are typically accompanied by a rise in unemployment.
Consumer Sentiment: (UMCSENT). A decline in consumer confidence suggests that future consumer spending may decrease, dragging down the economy.
We will use the fredapi package to fetch data from the St. Louis Fed’s FRED database. For a real project, you would need to request your own free API key.
To ensure the code is runnable without an API key, we will generate a mock dataset here that has similar statistical properties to the real data.
import pandas as pdimport numpy as np# --- MOCK DATA GENERATION ---# In a real scenario, you would use fredapi to fetch data.# For reproducibility, we create a mock dataset here.def create_mock_fred_data(start_date='1970-01-01', end_date='2023-12-31'): dates = pd.date_range(start=start_date, end=end_date, freq='MS') n =len(dates)# Simulate term spread (can be negative, cyclical) term_spread =1.5+ np.sin(np.linspace(0, 10* np.pi, n)) *2+ np.random.randn(n) *0.5# Simulate unemployment (positive, negatively correlated with spread) unemployment =6-0.8* term_spread + np.random.randn(n) *0.5 unemployment = np.clip(unemployment, 3, 10) # Keep in a reasonable range# Simulate consumer sentiment (positive, negatively correlated with unemployment) consumer_sentiment =100-3* unemployment + np.random.randn(n) *5 consumer_sentiment = np.clip(consumer_sentiment, 50, 110)# Simulate recession (probability increases when spread is low and unemployment is high) recession_prob =1/ (1+ np.exp(-( -2* term_spread +1.5* (unemployment -6) -5))) recession = (np.random.rand(n) < recession_prob).astype(int) df = pd.DataFrame({'recession': recession,'term_spread': term_spread,'unemployment': unemployment,'consumer_sentiment': consumer_sentiment }, index=dates)return dfdf = create_mock_fred_data()# --- END MOCK DATA ---print('Data Preview:')print(df.head())print('\nDescriptive Statistics:')print(df.describe().round(2))
Features (X): term_spread, unemployment, consumer_sentiment
Target (y): recession
We will split the data into a training set (80%) and a test set (20%).
stratify=y ensures that the proportion of recession and expansion periods is the same in both the training and test sets, which is crucial for imbalanced datasets.
from sklearn.model_selection import train_test_splitX = df[['term_spread', 'unemployment', 'consumer_sentiment']]y = df['recession']# Split dataset (80% train, 20% test)X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, stratify=y)print(f'Training set size: {X_train.shape} samples')print(f'Test set size: {X_test.shape} samples')print(f'Recession proportion in training set: {y_train.mean():.2%}')print(f'Recession proportion in test set: {y_test.mean():.2%}')
Training set size: (518, 3) samples
Test set size: (130, 3) samples
Recession proportion in training set: 1.93%
Recession proportion in test set: 1.54%
Neural networks are very sensitive to the scale of input features. If different features have vastly different numerical ranges, the training process can become unstable. Standardization, which scales all features to have a mean of 0 and a standard deviation of 1, is a crucial preprocessing step. Note: We fit_transform only on the training set. The test set must be transformed using the same scaling rules learned from the training set to avoid data leakage.
from sklearn.preprocessing import StandardScaler# Initialize the scalerscaler = StandardScaler()# Fit on the training data and transform itX_train_scaled = scaler.fit_transform(X_train)# Apply the same transformation to the test dataX_test_scaled = scaler.transform(X_test)print('Pre-scaling train set means:', np.mean(X_train, axis=0).values.round(2))print('Post-scaling train set means:', np.mean(X_train_scaled, axis=0).round(2))print('Post-scaling train set std devs:', np.std(X_train_scaled, axis=0).round(2))
Pre-scaling train set means: [ 1.49 4.86 85.55]
Post-scaling train set means: [0. 0. 0.]
Post-scaling train set std devs: [1. 1. 1.]
We use sklearn.neural_network.MLPClassifier to build the model. * hidden_layer_sizes=(50, 50): Defines a network with two hidden layers, each with 50 neurons. * activation='relu': Use the ReLU activation function for the hidden layers. * solver='adam': Adam is an efficient gradient descent optimization algorithm. * max_iter=500: The maximum number of training epochs.
from sklearn.neural_network import MLPClassifier# Build the MLP modelmlp = MLPClassifier( hidden_layer_sizes=(50, 50), activation='relu', solver='adam', max_iter=500, random_state=42)# Train the modelprint('Starting model training...')mlp.fit(X_train_scaled, y_train)print('Model training complete!')
Starting model training...
Model training complete!
After training, we evaluate the model’s performance on the test set to check its generalization ability. We will use a classification report, which includes several key metrics:
Precision: Of all instances predicted as ‘Recession’, how many were actually recessions? (TP / (TP + FP))
Recall: Of all actual ‘Recession’ instances, how many did the model successfully identify? (TP / (TP + FN))
F1-score: The harmonic mean of precision and recall.
from sklearn.metrics import classification_report# Make predictions on the test sety_pred = mlp.predict(X_test_scaled)print('Classification Report (Test Set):')# target_names provides labels for classes 0 and 1print(classification_report(y_test, y_pred, target_names=['Expansion', 'Recession']))
The MLPs we’ve discussed are fully-connected, meaning every neuron in a layer is connected to every neuron in the previous layer.
When processing data with spatial or temporal structure, like images or time series, the number of parameters in a fully-connected network explodes, and it fails to leverage the local structure of the data.
A Convolutional Neural Network (CNN) is a special type of feedforward network that addresses these issues through local connectivity and weight sharing.
The Core Idea of CNNs: Analyzing Data Like a Visual System
CNNs are inspired by the biological visual cortex.
Receptive Field: Each neuron focuses only on a small region of the input (local connectivity).
Feature Map: A ‘filter’ or ‘kernel’ slides across the entire input, searching for a specific pattern (like an edge or corner) and generating a feature map (weight sharing).
This is like how we look at a photo: we don’t process every pixel at once, but rather identify local lines and shapes first, then combine them into more complex objects.
The Key Layer of a CNN: The Convolutional Layer
The convolutional layer is the core of a CNN. It slides a small kernel over the input data, computing the element-wise product sum between the kernel and the corresponding input region at each position, plus a bias. This process extracts local features from the input.
The Key Layer of a CNN: The Pooling Layer
A pooling layer (or downsampling layer) typically follows a convolutional layer.
Purpose:
Dimensionality Reduction: Reduces the size of the feature map, thereby decreasing computational load and the number of parameters.
Invariance: Makes the model less sensitive to small translations or rotations in the input.
Common Methods:
Max Pooling: Takes the maximum value from a region.
Average Pooling: Calculates the average value of a region.
Visualizing Max Pooling
The diagram below shows a 2x2 max pooling operation on a 4x4 feature map.
CNN Applications in Economics?
Although CNNs were born from image recognition, their core idea of recognizing local patterns can be applied to economics:
Time Series Analysis: A financial time series (e.g., stock prices) can be treated as a 1D ‘image’. CNNs can be used to identify technical analysis patterns like ‘head and shoulders’ or ‘double bottoms’.
Textual Analysis: A matrix of word vectors from a sentence can be treated as a 2D image. CNNs can extract local semantic features for analyzing the sentiment or topics of financial reports and news articles.
Satellite Imagery Analysis: Using satellite data like nighttime lights or ships in ports to predict regional economic activity.
A Brief History of Neural Networks: A Tour of Famous Models
Since 2012, the field of deep learning has seen a surge of landmark CNN architectures. Understanding them helps us appreciate how networks have become progressively deeper and more powerful.
LeNet-5 (1998): The ancestor of modern CNNs.
AlexNet (2012): Ignited the deep learning revolution; first to use ReLU and Dropout.
VGGNet (2014): Demonstrated the importance of network depth.
GoogLeNet (2014): Introduced the ‘Inception module’, improving network width and efficiency.
ResNet (2015): Introduced ‘residual connections’, solving the training problem for extremely deep networks.
LeNet-5 (1998): The Founder of a Classic Architecture
Proposed by Yann LeCun for recognizing handwritten digits on checks. Its classic architecture [CONV -> POOL -> CONV -> POOL -> FC -> OUTPUT] is still influential today.
AlexNet (2012): The ‘Big Bang’ of Deep Learning
AlexNet won the 2012 ImageNet competition by a massive margin, heralding the dawn of the deep learning era.
Key Contributions:
The first successful training of a deep CNN.
Widespread use of the ReLU activation function, which significantly sped up training.
Use of Dropout to prevent overfitting.
Use of GPUs for parallel computation, making it possible to train large models.
VGGNet (2014): Depth is Power
The VGG team explored a simple but profound question: does making the network deeper improve performance?
Core Idea:
Minimalism: Used only small 3x3 convolution kernels and 2x2 pooling layers.
Stacking: By repeatedly stacking these simple blocks, they built very deep networks (e.g., VGG16, VGG19).
VGG proved that, to a certain extent, increasing network depth can significantly boost performance.
GoogLeNet (2014): Wider and More Efficient Networks
Google’s GoogLeNet (aka Inception-v1) defeated VGGNet in the same year’s ImageNet competition.
Core Idea: The Inception Module
It uses different-sized convolution kernels (1x1, 3x3, 5x5) and a pooling operation in parallel, then concatenates their results. This allows the network to learn features at different scales within the same layer.
It extensively uses 1x1 convolutions for dimensionality reduction, which drastically reduces the number of parameters.
ResNet (2015): Bridging the Depth Gap
As networks get extremely deep, a ‘degradation’ problem emerges: the training error of a deeper network is higher than that of its shallower counterpart. ResNet (Residual Network), proposed by Kaiming He et al. at Microsoft Research Asia, elegantly solved this problem.
Core Idea: The Shortcut/Skip Connection * It allows information to ‘skip’ one or more layers. The network no longer needs to learn an identity mapping from scratch; it only needs to learn the ‘residual’ between the input and the output. \[ \large{H(x) = F(x) + x} \]
ResNet’s innovation made it possible to train ultra-deep networks of hundreds or even thousands of layers.
Chapter Summary
Why do we need neural networks? The economic world is full of non-linearity, and traditional linear models have their limits. Neural networks are powerful tools for capturing complex patterns.
What is the basic principle? Inspired by biological neurons, they process information through weighted sums and non-linear activations. Activation functions (especially ReLU) are key to introducing non-linearity.
How do we go from simple to complex? A single neuron (Perceptron) has limited power. By stacking them into layers (MLP), we can solve complex non-linear problems (like XOR).
How do networks ‘learn’? By defining a loss function, and then using gradient descent and the backpropagation algorithm to systematically adjust network parameters to minimize prediction error.
Are there more specialized networks? Yes. For example, CNNs, which use convolution and pooling, are especially good at handling data with spatial/temporal structure (like time series or images).
Conclusion: A New Paradigm for Economic Modeling
Capturing Non-linearity: The core strength of neural networks is their powerful ability to fit non-linear relationships, helping us understand complex economic phenomena that linear models cannot explain.
Data-Driven: They are highly data-driven models capable of automatically learning features and patterns from large-scale datasets.
A Powerful Toolkit: From simple MLPs to complex CNNs, we have a rich set of tools suitable for various data types and tasks.
Future Outlook and Caveats
Explainability (XAI): Neural networks are often called ‘black box’ models because their decision-making processes are not transparent. This is a major hurdle for their application in high-stakes areas like policy advice and credit scoring, and it is a hot research topic.
Causal Inference: Neural networks excel at prediction (finding correlations) but cannot be directly used for causal inference. Combining neural networks with causal inference frameworks (like Diff-in-Diff or Instrumental Variables) is a frontier research area.
More Models: We only introduced feedforward networks today. For time series data, Recurrent Neural Networks (RNNs) and their variants (like LSTM, GRU) are a more natural choice.