Introduction to Deep Learning

peter(邱飞)

zhejiang wanli university

What is Deep Learning?

Very Active Research Area: 🚀 Deep learning is at the forefront of artificial intelligence research, with new discoveries and advancements happening constantly.
Subfield of Machine Learning: It’s a specialized branch of machine learning, focusing on algorithms inspired by the structure and function of the human brain.
Neural Networks are Key: The core building block of deep learning is the neural network, a complex structure of interconnected nodes (called “neurons”) that can learn intricate patterns from data.

What is Machine Learning?

Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Arthur Samuel, 1959
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. - Tom M. Mitchell, 1997

Note

Machine learning focuses on the development of algorithms that can learn from and make predictions on data. These algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions.

What is Statistical Learning?

Statistical learning is a framework for understanding data based on statistics. It involves building a statistical model for inference or prediction.
The goal is often to estimate a function \(f\) such that \(Y = f(X) + \epsilon\), where \(X\) represents the input variables, \(Y\) represents the output variable, and \(\epsilon\) is the random error term.
Emphasis: Statistical learning emphasizes models and their interpretability and precision, offering tools to assess the model (such as confidence intervals).

Note

Statistical learning focuses on drawing inferences from data using statistical models. Unlike machine learning, which often prioritizes predictive accuracy, statistical learning also focuses on the interpretability of models and understanding the underlying data generation process.

Deep Learning in Action

A simple neural network.

Note

This image visually represents a simple neural network. The circles represent neurons, and the lines connecting them represent the connections between neurons. The input layer receives data, the hidden layers process it, and the output layer produces the result.

Defining the Landscape: Data Mining, Machine Learning, and Statistical Learning

Let’s clarify these related terms, which are often used interchangeably, but have important distinctions:

Data Mining: 🔍 The process of discovering patterns, anomalies, and insights from large datasets. It’s like “unearthing” valuable information hidden within a mountain of data.
Machine Learning: 🤖 Algorithms that can learn from data and improve their performance without being explicitly programmed. Focuses on making accurate predictions or decisions.
Statistical Learning: 📊 A subfield of statistics focused on statistical models for understanding data. Bridges statistics and machine learning, often emphasizing interpretability.

Relationships Between Concepts

graph LR
    A[Data Mining] --> C(Common Ground)
    B[Machine Learning] --> C
    D[Statistical Learning] --> C
    C --> E[Insights & Predictions]

Note

This diagram illustrates the overlapping nature of these fields. They all share the common goal of extracting insights and making predictions from data, but they approach this goal with different tools and emphases. Data mining is often applied to very large datasets, machine learning focuses on prediction, and statistical learning emphasizes inference and model understanding.

A Bit of History

Early Days (Late 1980s): Neural networks first gained attention, sparking initial excitement.
Stabilization: The initial hype subsided as researchers explored the practical challenges of training and applying these networks.
Decline in Popularity: Other machine learning methods, like Support Vector Machines (SVMs), boosting, and random forests, became more popular. These were considered more “automatic.”
The Resurgence (Post-2010): Neural networks made a spectacular comeback, rebranded as “Deep Learning.”

Drivers of the Deep Learning Renaissance

What fueled this resurgence? Three key factors:

New Architectures: Researchers developed more sophisticated and effective neural network architectures.
Larger Datasets: The explosion of digital data provided the massive datasets needed to train complex models.
More Computing Power: Advances in hardware, especially GPUs (Graphics Processing Units), made training large models feasible.

Neural Networks: The Building Blocks

Brain Inspiration: 🧠 Neural networks are inspired by the human brain, mimicking how neurons connect and process information.
Interconnected Nodes: Composed of interconnected nodes (“neurons”), organized in layers.
Learning Complex Relationships: Can learn intricate, non-linear relationships between inputs (data) and outputs (predictions).
Foundation of Deep Learning: Form the foundation for most deep learning models.

Single Layer Neural Networks: The Basics

Input: It starts with an input vector \(\mathbf{X} = (X_1, X_2, \dots, X_p)\), representing your data (e.g., pixel values of an image).
Goal: Build a non-linear function \(f(\mathbf{X})\) to predict a response \(Y\) (e.g., the category of an image).

Single Layer Neural Network Structure

Input Layer: Receives the input features \(X_1, \dots, X_p\).
Hidden Layer: Computes activations \(A_k = h_k(\mathbf{X})\). These are non-linear transformations of linear combinations of the inputs. Think of them as “hidden features.”
Output Layer: A linear model that uses the activations from the hidden layer, producing the final output \(f(\mathbf{X})\).

Learning the Functions

The functions \(h_k(\cdot)\) in the hidden layer are not pre-defined; they are learned during training. This allows the network to adapt to the data.

Visualizing a Single Layer Network

A single-layer neural network.

Note

This diagram shows a single-layer neural network. The input layer (left) receives the data. The hidden layer (middle) performs the non-linear transformation. The output layer (right) produces the prediction. The arrows represent the weights that connect the neurons.

Single Layer Network: The Math

The neural network model can be expressed mathematically as:

\[ f(\mathbf{X}) = \beta_0 + \sum_{k=1}^K \beta_k h_k(\mathbf{X}) \]

where:

\[ A_k = h_k(\mathbf{X}) = g\left(w_{k0} + \sum_{j=1}^p w_{kj}X_j\right) \]

Breaking Down the Equation

\(K\): The number of hidden units (neurons in the hidden layer).
\(g(\cdot)\): The activation function (introduces non-linearity).
\(w_{kj}\) and \(\beta_k\): The weights or parameters of the network (learned during training).
\(A_k\): called activations.

Activation Functions: Introducing Non-Linearity

Why Non-Linearity? Without it, the network would collapse into a simple linear model. Activation functions allow learning complex, non-linear relationships.
Common Choices:
- Sigmoid: \(g(z) = \frac{1}{1 + e^{-z}}\) (Squashes values between 0 and 1. A “smooth” on/off switch.)
- ReLU (Rectified Linear Unit): \(g(z) = \max(0, z)\) (Computationally efficient. Zero for negative inputs, the input itself for positive inputs.)

Visualizing Activation Functions

Sigmoid (left) and ReLU (right) activation functions.

Left (Sigmoid): Notice the smooth, S-shaped curve. It “squashes” any input value into the range between 0 and 1. This was historically popular but can lead to problems during training (vanishing gradients).

Visualizing Activation Functions

Sigmoid (left) and ReLU (right) activation functions.

Right (ReLU): Notice how it’s simply zero for negative inputs and a straight line for positive inputs. This simple form is computationally efficient and helps with training deeper networks. It’s a very popular choice in modern deep learning.

Why Non-linearity Matters: Capturing Complexity

Real-World Complexity: Relationships in the real world are rarely simple straight lines.
Interactions: Non-linearities enable the model to capture interactions between input variables. The effect of one variable might depend on another.

Example: Modeling Interactions

Consider:
- \(p=2\) inputs (\(X_1\) and \(X_2\)).
- \(K=2\) hidden units.
- \(g(z) = z^2\) (a simple non-linear activation function).
With appropriate weights, this can model the interaction \(X_1X_2\):

\[ f(\mathbf{X}) = X_1X_2 \]

Note

By choosing the weights \(w_{10} = 0, w_{11} = 1, w_{12} = 1, w_{20} = 0, w_{21} = 1, w_{22} = -1\) and activation function \(g(z)=z^2\), the activations can be formed as: \(A_1 = (X_1 + X_2)^2, A_2 = (X_1 - X_2)^2\). If we further set \(\beta_0 = 0, \beta_1=0.25, \beta_2 = -0.25\), the output becomes: \(f(\mathbf{X}) = 0.25(X_1 + X_2)^2 - 0.25(X_1 - X_2)^2 = X_1X_2\). This illustrates how non-linear activation functions, combined with appropriate weights, allow neural networks to capture interaction effects.

Multilayer Neural Networks: Going Deeper

Multiple Hidden Layers: Modern neural networks often have multiple hidden layers, stacked one after another (“deep” learning).
Hierarchical Representations: Each layer builds upon the previous, creating increasingly complex and abstract representations.

MNIST Digit Classification

MNIST: A classic dataset: images of handwritten digits (0-9).
Image Format: Each image is 28x28 pixels, grayscale values (0-255). 784 (28*28) input features.

Visualizing a Multilayer Network (MNIST)

A multilayer neural network for MNIST digit classification.

Note

This diagram shows a multilayer neural network for classifying handwritten digits from the MNIST dataset. The input layer is large (784 pixels). Subsequent layers progressively reduce the number of units. The output layer has 10 units (one for each digit).

The MNIST Dataset: Details

Goal: Classify each image into its correct digit category (0-9).
One-Hot Encoding: Output: a vector \(\mathbf{Y} = (Y_0, Y_1, \dots, Y_9)\), where \(Y_i = 1\) if the image represents digit \(i\), and 0 otherwise.
Dataset Size: 60,000 training images and 10,000 test images.

MNIST Examples

Examples of handwritten digits from the MNIST dataset.

Note

This figure shows examples of handwritten digits from the MNIST dataset. The goal is to train a neural network to correctly classify these images. Notice the variations in handwriting style, making this a non-trivial task.

Multilayer Networks: Key Differences

Multiple Hidden Layers: Layers like \(L_1\) (e.g., 256 units), \(L_2\) (e.g., 128 units). Each layer learns increasingly complex features.
Multiple Outputs: Output layer can have multiple units (e.g., 10 for MNIST).
Multitask Learning: A single network can predict multiple responses simultaneously.
Loss Function: Choice depends on the task. For multi-class classification (like MNIST), use cross-entropy loss.

Multilayer Networks: Mathematical Formulation (1/3)

First Hidden Layer: Similar to the single-layer, but with a superscript (1) for the layer:

\[ A_k^{(1)} = h_k^{(1)}(\mathbf{X}) = g\left(w_{k0}^{(1)} + \sum_{j=1}^p w_{kj}^{(1)} X_j\right) \]

Multilayer Networks: Mathematical Formulation (2/3)

Second Hidden Layer: Takes activations from the first hidden layer as input:

\[ A_l^{(2)} = h_l^{(2)}(\mathbf{X}) = g\left(w_{l0}^{(2)} + \sum_{k=1}^{K_1} w_{lk}^{(2)} A_k^{(1)}\right) \]
- Notice how \(A_k^{(1)}\) (output of the first layer) is now the input.

Multilayer Networks: Mathematical Formulation (3/3)

Output Layer (multi-class classification): Use softmax for probabilities:

\[ f_m(\mathbf{X}) = \Pr(Y = m | \mathbf{X}) = \frac{e^{Z_m}}{\sum_{m'=0}^9 e^{Z_{m'}}} \]

where \(Z_m = \beta_{m0} + \sum_{l=1}^{K_2} \beta_{ml} A_l^{(2)}\).
- Softmax ensures outputs are probabilities (between 0 and 1, sum to 1).

Loss Function and Optimization

Cross-Entropy Loss (multi-class classification):

\[ -\sum_{i=1}^n \sum_{m=0}^9 y_{im} \log(f_m(\mathbf{x}_i)) \]
- Measures the difference between predicted probabilities and true labels (one-hot encoded).
Goal: Find weights (all \(w\) and \(\beta\) values) that minimize this loss.
Optimization: Use gradient descent (and variants).
Weights: refers to all trainable parameters, including the coefficients and bias terms.

Comparison with Linear Models (MNIST)

Method	Test Error
Neural Network + Ridge	2.3%
Neural Network + Dropout	1.8%
Multinomial Logistic Regression	7.2%
Linear Discriminant Analysis	12.7%

Note

This table shows that neural networks significantly outperform traditional linear methods (like multinomial logistic regression and linear discriminant analysis) on the MNIST digit classification task. This highlights the power of deep learning for complex pattern recognition. Ridge and Dropout are regularization techniques.

Convolutional Neural Networks (CNNs): Specialized for Images

Image Classification (and more): CNNs are designed for images (and data with spatial structure, like audio).
Human Vision Inspiration: Inspired by how humans process visual information, focusing on local patterns.
Key Idea: Learn local features (edges, corners) and combine them for global patterns (objects).

CNN Architecture: Key Components

Convolutional Layers:
- Apply convolution filters (small templates/kernels) to the image.
- Detect local features (edges, textures). Filters are learned.
- Weight sharing: Same filter applied across the entire image (efficient).
Pooling Layers:
- Downsample feature maps.
- Reduce dimensionality, provide translation invariance (small shifts don’t change output much).
- Max pooling: Takes the maximum value in a region.
Flatten Layer: convert multi-dimension feature maps to vector.
Fully Connected Layers: Standard neural network layers for classification.

CNN Example: Tiger Classification 🐅

Input: An image of a tiger.
Convolutional Layers: Early layers find edges, stripes.
Pooling Layers: Downsample, provide invariance.
Higher Layers: Combine features (eyes, ears).
Output: Probability of a tiger.

Convolution Operation: A Closer Look

Convolution Filter: A small matrix (e.g., 3x3) of numbers (a kernel).
Sliding the Filter: “Slid” across the image, one pixel at a time (or larger stride).
Element-wise Multiplication and Summation: At each position, element-wise multiplication between filter and image, then sum.
Convolved Image: Produces a single value in the convolved image (feature map).
Different Filters, Different Features: Different filters detect different features.

Visualizing Convolution

Note

This figure illustrates the convolution operation. The filter (small square) is moved across the input image (large square). At each position, the element-wise product of the filter and the corresponding part of the image is computed, and the results are summed to produce a single value in the convolved image.

Convolution Operation: Example

Input Image: \[ Original Image = \begin{bmatrix} a & b & c\\ d & e & f\\ g & h & i\\ j & k & l \end{bmatrix} \]
Filter: \[ ConvolutionFilter = \begin{bmatrix} \alpha & \beta\\ \gamma & \delta \end{bmatrix} \]
Convolved Image: \[ Convolved Image = \begin{bmatrix} aa + b\beta + d\gamma + e\delta & ba + c\beta + e\gamma + f\delta\\ da + e\beta + g\gamma + h\delta & ea + f\beta + h\gamma + i\delta\\ ga + h\beta + j\gamma + k\delta & ha + i\beta + k\gamma + l\delta \end{bmatrix} \]

Convolutional Layer: Details

Multiple Channels: Color images: 3 channels (Red, Green, Blue). Filters match input channels.
Multiple Filters: A layer uses many filters (e.g., 32, 64) for various features.
ReLU Activation: Usually, ReLU after convolution.
Detector Layer: Convolution + ReLU is a detector layer.

Pooling Layer: Downsampling

Purpose: Reduce spatial dimensions (width, height).
Max Pooling: Takes the maximum in a non-overlapping region (e.g., 2x2).
Benefits:
- Reduces Computation: Smaller feature maps, fewer computations.
- Translation Invariance: Robust to small shifts.

Max Pooling: Example

Input:

\[ \begin{bmatrix} 1 & 2 & 5 & 3 \\ 3 & 0 & 1 & 2 \\ 2 & 1 & 3 & 4 \\ 1 & 1 & 2 & 0 \\ \end{bmatrix} \]

Output (2x2 max pooling):

\[ \begin{bmatrix} 3 & 5 \\ 2 & 4 \end{bmatrix} \]

Putting it All Together: A Complete CNN Architecture

Sequence of Layers: Convolutional and pooling layers.
Feature Extraction: Convolutional layers extract.
Downsampling: Pooling layers downsample.

Putting it All Together: A Complete CNN Architecture

Flatten Layer: Converts 3D feature maps to 1D vector.
Fully Connected Layers: Classification.
Softmax Output Layer: Probabilities for each class.

Data Augmentation: Expanding the Training Set

Purpose: Artificially increase training set size.
Method: Random transformations:
- Rotation 📐
- Zoom 🔍
- Shift ↔︎️
- Flip 🔄
Benefits:
- Reduces Overfitting: Model sees more variations.
- Improves Generalization: More robust.
- Regularization: Acts as regularization.

Data Augmentation: Examples

Note

This figure shows examples of data augmentation. A single image of a digit is transformed in various ways (rotated, shifted, zoomed) to create multiple training examples. This helps the model generalize better to unseen data and reduces overfitting.

Pretrained Classifiers: Leveraging Existing Knowledge

Leverage Pre-trained Models: Use models trained on massive datasets (e.g., ImageNet).
Example: ResNet50: Popular pre-trained CNN.
Weight Freezing: Use convolutional layers as feature extractors. Freeze weights, train only final layers. Effective for small datasets.

Pretrained Classifier: Illustration

Note

This figure shows how a pretrained classifier (ResNet50) can be used. The convolutional layers (pretrained on ImageNet) are frozen, and only the final layers are trained on a new dataset. This leverages the knowledge learned from a large dataset, even if your own dataset is small.

Pretrained Classifiers: Example Predictions (Table 10.10)

Image	True Label	Prediction 1	Prob 1	Prediction 2	Prob 2	Prediction 3	Prob 3
(Flamingo)	Flamingo	Flamingo	0.83	Spoonbill	0.17	White stork	0.00
(Cooper’s Hawk)	Cooper’s Hawk	Kite	0.60	Great grey owl	0.09	Nail	0.12
(Cooper’s Hawk)	Cooper’s Hawk	Fountain	0.35	nail	0.12	hook	0.07
(Lhasa Apso)	Lhasa Apso	Tibetan terrier	0.56	Lhasa	0.32	cocker spaniel	0.03
(Cat)	Cat	Old English sheepdog	0.82	Shih-Tzu	0.04	Persian cat	0.04
(Cape weaver)	Cape weaver	jacamar	0.28	macaw	0.12	robin	0.12

Predictions and Probabilities: The table displays the top 3 predictions and associated probabilities from ResNet50 for different images.

Pretrained Classifiers: Example Predictions (Table 10.10)

Image	True Label	Prediction 1	Prob 1	Prediction 2	Prob 2	Prediction 3	Prob 3
(Flamingo)	Flamingo	Flamingo	0.83	Spoonbill	0.17	White stork	0.00
(Cooper’s Hawk)	Cooper’s Hawk	Kite	0.60	Great grey owl	0.09	Nail	0.12
(Cooper’s Hawk)	Cooper’s Hawk	Fountain	0.35	nail	0.12	hook	0.07
(Lhasa Apso)	Lhasa Apso	Tibetan terrier	0.56	Lhasa	0.32	cocker spaniel	0.03
(Cat)	Cat	Old English sheepdog	0.82	Shih-Tzu	0.04	Persian cat	0.04
(Cape weaver)	Cape weaver	jacamar	0.28	macaw	0.12	robin	0.12

Correct Predictions: In several cases (Flamingo, Lhasa Apso), the model correctly predicts the true label with high probability.
Reasonable Alternatives: Even when incorrect, the model often provides reasonable alternative predictions (e.g., Spoonbill and White stork for Flamingo).
Uncertainty: The probabilities show the model’s uncertainty. Lower probabilities indicate less confidence.
Limitation: Pretrained models are limited by the data they were trained on. They might misclassify objects not well-represented in the original training data.

Document Classification: Beyond Images

Another Key Application: Deep learning is effective for document classification.
Goal: Predict document attributes:
- Sentiment (positive, negative)
- Topic (sports, politics)
- Author
- Spam detection
Featurization: Converting text to numerical representation.
Example: Sentiment analysis of IMDb movie reviews.

Bag-of-Words Model: A Simple Featurization

Simplest Approach: Surprisingly effective.
How it Works:
1. Create a dictionary of all unique words.
2. Represent each document as a vector: presence (1) or absence (0) of each word. (Or use word counts).
Ignores Word Order: “Bag” of words – order lost.
Sparse Representation: Vectors are mostly zeros.
Bag-of-n-grams: Consider sequences of n words (captures some context).

Recurrent Neural Networks (RNNs): Designed for Sequences

Sequential Data: RNNs for:
- Text (sequences of words)
- Time series (measurements over time)
- Audio
- DNA
Key Idea: Process input one element at a time, maintaining a “hidden state” (memory) of previous elements.
Unrolling the RNN: Visualizing it “unrolled” in time.

RNN Architecture: Visualized

Note

This figure shows the architecture of a recurrent neural network (RNN). The input sequence (X) is processed one element at a time. The hidden state (A) is updated at each step, carrying information from previous steps. The output (O) can be produced at each step or only at the final step. Crucially, the same weights (W, U, B) are used at each step (weight sharing). This allows the RNN to handle sequences of varying lengths.

RNN: Mathematical Formulation

Hidden State Update:

\[ \mathbf{A}_{lk} = g\left(w_{k0} + \sum_{j=1}^p w_{kj} \mathbf{X}_{lj} + \sum_{s=1}^K u_{ks} \mathbf{A}_{l-1,s}\right) \]
Output:

\[ \mathbf{O}_l = \beta_0 + \sum_{k=1}^K \beta_k \mathbf{A}_{lk} \]
\(g(\cdot)\): activation function (e.g., ReLU).
\(\mathbf{W, U, B}\): shared weight matrices.

Explaining the RNN Equations

Hidden State Update: New state (\(\mathbf{A}_{lk}\)) depends on:
- Current input (\(\mathbf{X}_{lj}\)).
- Previous hidden state (\(\mathbf{A}_{l-1,s}\)) – the memory.
- Weights (\(\mathbf{W, U}\)).
Output: Output (\(\mathbf{O}_l\)) is a linear combination of the hidden state.
Weight Sharing: Same weight matrices (\(\mathbf{W, U, B}\)) at every time step.

RNNs for Document Classification

Beyond Bag-of-Words: Use the sequence of words.
Word Embeddings: Represent each word as a dense vector (word embedding). Captures semantic relationships (e.g., “king,” “queen”). word2vec, GloVe.
Embedding Layer: Maps each word (one-hot encoded) to its embedding.
Processing the Sequence: RNN processes embeddings, updates state, produces output (e.g., sentiment).

Word Embeddings: Example

Note

One hot encoding vector length is the vocabulary size.
Word embeddings provide dense representation, where semantically similar words have similar vectors. For example, “king” and “queen” would be closer to each other than “king” and “apple”.

RNNs for Time Series Forecasting

Time Series Prediction: RNNs are useful.
Example: Stock market trading volume.
Input: Past values (volume, return, volatility).
Output: Predicted volume for next day.
Autocorrelation: Values correlated with past (today’s volume similar to yesterday’s). RNNs capture this.

Time Series Data: Example

Note

This figure shows an example of time series data: daily trading statistics from the New York Stock Exchange (NYSE). The data includes log trading volume (top), the Dow Jones Industrial Average (DJIA) return (middle), and log volatility (bottom). Notice the patterns and fluctuations over time. Predicting future values of such series is a common task in finance.

Autocorrelation Function (ACF)

Measures Correlation with Past: The ACF measures correlation with lagged values (past).
Interpreting the ACF: Shows how strongly related values are at different lags. High ACF at lag 1: today’s value correlated with yesterday’s.

RNN Forecaster: Setting up the Data

Input Sequence: \(L\) past observations (e.g., \(L=5\) days): \[ \mathbf{X}_1 = \begin{pmatrix} v_{t-L} \\ r_{t-L} \\ z_{t-L} \end{pmatrix}, \mathbf{X}_2 = \begin{pmatrix} v_{t-L+1} \\ r_{t-L+1} \\ z_{t-L+1} \end{pmatrix}, \dots, \mathbf{X}_L = \begin{pmatrix} v_{t-1} \\ r_{t-1} \\ z_{t-1} \end{pmatrix} \]
\(v_t\): trading volume on day \(t\).
- \(r_t\): return on day \(t\).
- \(z_t\): volatility on day \(t\).
Output: \(Y = v_t\) (trading volume on day \(t\), to predict).
Creating Training Examples: Many \((\mathbf{X}, Y)\) pairs from historical data. Sequence of past, corresponding future value.

RNN Forecasting Results

Performance: RNN achieves \(R^2\) of 0.42 on test data. Explains 42% of variance in volume.
Comparison to Baseline: Outperforms a baseline using yesterday’s volume.

Autoregression (AR) Model: A Traditional Approach

Traditional Time Series Model: Autoregression (AR) is classic.
Linear Prediction: Predicts based on linear combination of past.
AR(L) Model: Uses \(L\) previous values:

\[ \hat{v}_t = \beta_0 + \beta_1 v_{t-1} + \beta_2 v_{t-2} + \dots + \beta_L v_{t-L} \]
Including Other Variables: Can include lagged values of other variables (return, volatility).
RNN as a Non-Linear Extension: RNN is a non-linear extension of AR. RNN: complex, non-linear relationships. AR: linear.

Long Short-Term Memory (LSTM): A More Powerful RNN

Addressing Vanishing Gradients: LSTMs address “vanishing gradient” problem (in standard RNNs with long sequences).
Two Hidden States:
- Short-term memory: Like standard RNN.
- Long-term memory: Retains information longer.
Improved Performance: LSTMs often better, especially for long sequences.

Fitting Neural Networks: The Optimization Challenge

Complex Optimization: Fitting (finding optimal weights) is non-convex.
Local Minima: Loss function has many local minima.
Key Techniques:
- Gradient Descent: Iteratively adjust weights “downhill.”
- Backpropagation: Efficient gradient computation.
- Regularization: Prevent overfitting (ridge, lasso, dropout).
- Stochastic Gradient Descent (SGD): Small batches for faster, robust optimization.
- Early Stopping: Stop training when validation error starts to increase.

Gradient Descent: Illustration

Note

Gradient descent is like rolling a ball down a hill. The ball will eventually settle at the lowest point (the minimum). The learning rate controls how big of a step the ball takes at each iteration. A large learning rate might cause the ball to overshoot, while a small learning rate might make the descent slow.

Gradient Descent: The Algorithm

Iterative Algorithm: Gradient descent is iterative.
Goal: Find parameters (\(\theta\)) that minimize loss \(R(\theta)\).
Steps:
1. Initialize \(\theta\) (often randomly).
2. Repeatedly update \(\theta\) opposite the gradient:
  
  \[ \theta^{(m+1)} \leftarrow \theta^{(m)} - \rho \nabla R(\theta^{(m)}) \]
3. \(\rho\): learning rate (step size). Smaller: smaller steps (slower). Larger: larger steps (might overshoot).

Backpropagation: Efficient Gradient Computation

Efficient Gradient Calculation: Backpropagation is efficient.
Chain Rule: Uses chain rule to “propagate” error backward.
Essential for Training: Essential; without it, training is slow.

Regularization: Preventing Overfitting

Purpose: Prevent overfitting (model learns training data too well, performs poorly on unseen data).
Methods:
- Ridge/Lasso: Penalty for large weights.
- Dropout: Randomly “drop out” units during training (more robust features).
- Early Stopping: Monitor validation set performance. Stop when validation error increases (even if training error improves).

Stochastic Gradient Descent (SGD): Speeding Up Training

Minibatches: Use small batches of data (not entire dataset) for gradient.
Faster Updates: Much faster weight updates.
Randomness: Randomness helps escape local minima.
Standard Approach: SGD (and variants) is standard.

Training and Validation Errors

Note

It’s essential to monitor both training and validation error during training. The training error typically decreases as the model learns. The validation error initially decreases, but if it starts to increase, it indicates overfitting. Early stopping is a form of regularization that stops the training process before overfitting becomes significant.

Dropout Learning: Illustration

Note

Dropout randomly removes units (neurons) during training. This forces the network to learn more robust features and prevents it from relying too heavily on any single neuron. It’s like training multiple different networks and averaging their predictions.

Network Tuning: Finding the Right Architecture

Architecture and Hyperparameters: Choosing architecture (layers, units, activations) and hyperparameters (learning rate, batch size, regularization) is crucial.
Key Considerations:
- Number of hidden layers.
- Units per layer.
- Regularization (dropout, ridge/lasso).
- Learning rate.
- Batch size.
- Epochs (passes through training data).
Trial and Error: Finding optimal settings: trial and error (time-consuming). Systematic: grid search, random search. Advanced: Bayesian optimization.

Interpolation and Double Descent: A Surprising Phenomenon

Interpolation: Model perfectly fits training data (zero training error).
Double Descent: Test error decreases again after increasing (as complexity increases beyond interpolation).
Not a Contradiction: Does not contradict bias-variance tradeoff.

Double Descent: Explanation

Test error shows the U-shape at first, then decreases as model complexity further increases.
Regularization is helpful to reduce test error.
Does not contradict the bias-variance tradeoff. The classical bias-variance tradeoff applies within a fixed model class. Double descent occurs when we consider a sequence of increasingly complex models.

When to Use Deep Learning: Choosing the Right Tool

Large Datasets: Deep learning shines with lots of data. Small datasets: simpler models might be better (interpretable).
Complex Relationships: Highly non-linear, complex: deep learning effective.
Difficult Feature Engineering: Deep learning automates feature learning.
Interpretability is Less Important: Deep learning: “black boxes.” Interpretability crucial: simpler models.
Computational Resources: Training: computationally expensive (GPUs).

Summary

Deep Learning Power: Powerful techniques (neural networks), learning complex patterns.
CNNs and RNNs: CNNs: images. RNNs: sequential data.
Optimization Challenges: Complex, but software simplifies.
Regularization is Key: Prevent overfitting, improve generalization.
Data and Complexity: Excels with large data, complex relationships.
Consider Simpler Models: Simpler model, good performance, interpretable: might be better.

Thoughts and Discussion

Ethical Implications: Ethics of deep learning (facial recognition, loans, hiring)? Fairness, avoid bias?
Interpretability: How to make models more interpretable? Why is it important?
Limitations: Limitations of deep learning? When are other methods (decision trees, SVMs) better?
Future of Deep Learning: Future evolution? New architectures, applications?
Model Complexity: How to decide model complexity? What will happen if we use a too simple or too complex model?
Data Quality: What are the actions to ensure data quality, and remove bias in data sets?