graph LR A[Data Mining] --> C(Common Ground) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
zhejiang wanli university
Very Active Research Area: 🚀 Deep learning is at the forefront of artificial intelligence research, with new discoveries and advancements happening constantly.
Subfield of Machine Learning: It’s a specialized branch of machine learning, focusing on algorithms inspired by the structure and function of the human brain.
Neural Networks are Key: The core building block of deep learning is the neural network, a complex structure of interconnected nodes (called “neurons”) that can learn intricate patterns from data.
Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Arthur Samuel, 1959
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. - Tom M. Mitchell, 1997
Note
Machine learning focuses on the development of algorithms that can learn from and make predictions on data. These algorithms operate by building a model from example inputs in order to make data-driven predictions or decisions, rather than following strictly static program instructions.
Statistical learning is a framework for understanding data based on statistics. It involves building a statistical model for inference or prediction.
The goal is often to estimate a function \(f\) such that \(Y = f(X) + \epsilon\), where \(X\) represents the input variables, \(Y\) represents the output variable, and \(\epsilon\) is the random error term.
Emphasis: Statistical learning emphasizes models and their interpretability and precision, offering tools to assess the model (such as confidence intervals).
Note
Statistical learning focuses on drawing inferences from data using statistical models. Unlike machine learning, which often prioritizes predictive accuracy, statistical learning also focuses on the interpretability of models and understanding the underlying data generation process.
A simple neural network.
Note
This image visually represents a simple neural network. The circles represent neurons, and the lines connecting them represent the connections between neurons. The input layer receives data, the hidden layers process it, and the output layer produces the result.
Let’s clarify these related terms, which are often used interchangeably, but have important distinctions:
Data Mining: 🔍 The process of discovering patterns, anomalies, and insights from large datasets. It’s like “unearthing” valuable information hidden within a mountain of data.
Machine Learning: 🤖 Algorithms that can learn from data and improve their performance without being explicitly programmed. Focuses on making accurate predictions or decisions.
Statistical Learning: 📊 A subfield of statistics focused on statistical models for understanding data. Bridges statistics and machine learning, often emphasizing interpretability.
graph LR A[Data Mining] --> C(Common Ground) B[Machine Learning] --> C D[Statistical Learning] --> C C --> E[Insights & Predictions]
Note
This diagram illustrates the overlapping nature of these fields. They all share the common goal of extracting insights and making predictions from data, but they approach this goal with different tools and emphases. Data mining is often applied to very large datasets, machine learning focuses on prediction, and statistical learning emphasizes inference and model understanding.
What fueled this resurgence? Three key factors:
Brain Inspiration: 🧠 Neural networks are inspired by the human brain, mimicking how neurons connect and process information.
Interconnected Nodes: Composed of interconnected nodes (“neurons”), organized in layers.
Learning Complex Relationships: Can learn intricate, non-linear relationships between inputs (data) and outputs (predictions).
Foundation of Deep Learning: Form the foundation for most deep learning models.
Input: It starts with an input vector \(\mathbf{X} = (X_1, X_2, \dots, X_p)\), representing your data (e.g., pixel values of an image).
Goal: Build a non-linear function \(f(\mathbf{X})\) to predict a response \(Y\) (e.g., the category of an image).
Input Layer: Receives the input features \(X_1, \dots, X_p\).
Hidden Layer: Computes activations \(A_k = h_k(\mathbf{X})\). These are non-linear transformations of linear combinations of the inputs. Think of them as “hidden features.”
Output Layer: A linear model that uses the activations from the hidden layer, producing the final output \(f(\mathbf{X})\).
A single-layer neural network.
Note
This diagram shows a single-layer neural network. The input layer (left) receives the data. The hidden layer (middle) performs the non-linear transformation. The output layer (right) produces the prediction. The arrows represent the weights that connect the neurons.
The neural network model can be expressed mathematically as:
\[ f(\mathbf{X}) = \beta_0 + \sum_{k=1}^K \beta_k h_k(\mathbf{X}) \]
where:
\[ A_k = h_k(\mathbf{X}) = g\left(w_{k0} + \sum_{j=1}^p w_{kj}X_j\right) \]
Why Non-Linearity? Without it, the network would collapse into a simple linear model. Activation functions allow learning complex, non-linear relationships.
Common Choices:
Sigmoid: \(g(z) = \frac{1}{1 + e^{-z}}\) (Squashes values between 0 and 1. A “smooth” on/off switch.)
ReLU (Rectified Linear Unit): \(g(z) = \max(0, z)\) (Computationally efficient. Zero for negative inputs, the input itself for positive inputs.)
Sigmoid (left) and ReLU (right) activation functions.
Sigmoid (left) and ReLU (right) activation functions.
Real-World Complexity: Relationships in the real world are rarely simple straight lines.
Interactions: Non-linearities enable the model to capture interactions between input variables. The effect of one variable might depend on another.
Consider:
With appropriate weights, this can model the interaction \(X_1X_2\):
\[ f(\mathbf{X}) = X_1X_2 \]
Note
By choosing the weights \(w_{10} = 0, w_{11} = 1, w_{12} = 1, w_{20} = 0, w_{21} = 1, w_{22} = -1\) and activation function \(g(z)=z^2\), the activations can be formed as: \(A_1 = (X_1 + X_2)^2, A_2 = (X_1 - X_2)^2\). If we further set \(\beta_0 = 0, \beta_1=0.25, \beta_2 = -0.25\), the output becomes: \(f(\mathbf{X}) = 0.25(X_1 + X_2)^2 - 0.25(X_1 - X_2)^2 = X_1X_2\). This illustrates how non-linear activation functions, combined with appropriate weights, allow neural networks to capture interaction effects.
Multiple Hidden Layers: Modern neural networks often have multiple hidden layers, stacked one after another (“deep” learning).
Hierarchical Representations: Each layer builds upon the previous, creating increasingly complex and abstract representations.
MNIST: A classic dataset: images of handwritten digits (0-9).
Image Format: Each image is 28x28 pixels, grayscale values (0-255). 784 (28*28) input features.
A multilayer neural network for MNIST digit classification.
Note
This diagram shows a multilayer neural network for classifying handwritten digits from the MNIST dataset. The input layer is large (784 pixels). Subsequent layers progressively reduce the number of units. The output layer has 10 units (one for each digit).
Goal: Classify each image into its correct digit category (0-9).
One-Hot Encoding: Output: a vector \(\mathbf{Y} = (Y_0, Y_1, \dots, Y_9)\), where \(Y_i = 1\) if the image represents digit \(i\), and 0 otherwise.
Dataset Size: 60,000 training images and 10,000 test images.
Examples of handwritten digits from the MNIST dataset.
Note
This figure shows examples of handwritten digits from the MNIST dataset. The goal is to train a neural network to correctly classify these images. Notice the variations in handwriting style, making this a non-trivial task.
Multiple Hidden Layers: Layers like \(L_1\) (e.g., 256 units), \(L_2\) (e.g., 128 units). Each layer learns increasingly complex features.
Multiple Outputs: Output layer can have multiple units (e.g., 10 for MNIST).
Multitask Learning: A single network can predict multiple responses simultaneously.
Loss Function: Choice depends on the task. For multi-class classification (like MNIST), use cross-entropy loss.
First Hidden Layer: Similar to the single-layer, but with a superscript (1) for the layer:
\[ A_k^{(1)} = h_k^{(1)}(\mathbf{X}) = g\left(w_{k0}^{(1)} + \sum_{j=1}^p w_{kj}^{(1)} X_j\right) \]
Second Hidden Layer: Takes activations from the first hidden layer as input:
\[ A_l^{(2)} = h_l^{(2)}(\mathbf{X}) = g\left(w_{l0}^{(2)} + \sum_{k=1}^{K_1} w_{lk}^{(2)} A_k^{(1)}\right) \]
Output Layer (multi-class classification): Use softmax for probabilities:
\[ f_m(\mathbf{X}) = \Pr(Y = m | \mathbf{X}) = \frac{e^{Z_m}}{\sum_{m'=0}^9 e^{Z_{m'}}} \]
where \(Z_m = \beta_{m0} + \sum_{l=1}^{K_2} \beta_{ml} A_l^{(2)}\).
Cross-Entropy Loss (multi-class classification):
\[ -\sum_{i=1}^n \sum_{m=0}^9 y_{im} \log(f_m(\mathbf{x}_i)) \]
Goal: Find weights (all \(w\) and \(\beta\) values) that minimize this loss.
Optimization: Use gradient descent (and variants).
Weights: refers to all trainable parameters, including the coefficients and bias terms.
Method | Test Error |
---|---|
Neural Network + Ridge | 2.3% |
Neural Network + Dropout | 1.8% |
Multinomial Logistic Regression | 7.2% |
Linear Discriminant Analysis | 12.7% |
Note
This table shows that neural networks significantly outperform traditional linear methods (like multinomial logistic regression and linear discriminant analysis) on the MNIST digit classification task. This highlights the power of deep learning for complex pattern recognition. Ridge and Dropout are regularization techniques.
Image Classification (and more): CNNs are designed for images (and data with spatial structure, like audio).
Human Vision Inspiration: Inspired by how humans process visual information, focusing on local patterns.
Key Idea: Learn local features (edges, corners) and combine them for global patterns (objects).
Convolutional Layers:
Pooling Layers:
Flatten Layer: convert multi-dimension feature maps to vector.
Fully Connected Layers: Standard neural network layers for classification.
Convolution Filter: A small matrix (e.g., 3x3) of numbers (a kernel).
Sliding the Filter: “Slid” across the image, one pixel at a time (or larger stride).
Element-wise Multiplication and Summation: At each position, element-wise multiplication between filter and image, then sum.
Convolved Image: Produces a single value in the convolved image (feature map).
Different Filters, Different Features: Different filters detect different features.
Note
This figure illustrates the convolution operation. The filter (small square) is moved across the input image (large square). At each position, the element-wise product of the filter and the corresponding part of the image is computed, and the results are summed to produce a single value in the convolved image.
Input Image: \[ Original Image = \begin{bmatrix} a & b & c\\ d & e & f\\ g & h & i\\ j & k & l \end{bmatrix} \]
Filter: \[ ConvolutionFilter = \begin{bmatrix} \alpha & \beta\\ \gamma & \delta \end{bmatrix} \]
Convolved Image: \[ Convolved Image = \begin{bmatrix} aa + b\beta + d\gamma + e\delta & ba + c\beta + e\gamma + f\delta\\ da + e\beta + g\gamma + h\delta & ea + f\beta + h\gamma + i\delta\\ ga + h\beta + j\gamma + k\delta & ha + i\beta + k\gamma + l\delta \end{bmatrix} \]
Multiple Channels: Color images: 3 channels (Red, Green, Blue). Filters match input channels.
Multiple Filters: A layer uses many filters (e.g., 32, 64) for various features.
ReLU Activation: Usually, ReLU after convolution.
Detector Layer: Convolution + ReLU is a detector layer.
Purpose: Reduce spatial dimensions (width, height).
Max Pooling: Takes the maximum in a non-overlapping region (e.g., 2x2).
Benefits:
Input:
\[ \begin{bmatrix} 1 & 2 & 5 & 3 \\ 3 & 0 & 1 & 2 \\ 2 & 1 & 3 & 4 \\ 1 & 1 & 2 & 0 \\ \end{bmatrix} \]
Output (2x2 max pooling):
\[ \begin{bmatrix} 3 & 5 \\ 2 & 4 \end{bmatrix} \]
Purpose: Artificially increase training set size.
Method: Random transformations:
Benefits:
Note
This figure shows examples of data augmentation. A single image of a digit is transformed in various ways (rotated, shifted, zoomed) to create multiple training examples. This helps the model generalize better to unseen data and reduces overfitting.
Leverage Pre-trained Models: Use models trained on massive datasets (e.g., ImageNet).
Example: ResNet50: Popular pre-trained CNN.
Weight Freezing: Use convolutional layers as feature extractors. Freeze weights, train only final layers. Effective for small datasets.
Note
This figure shows how a pretrained classifier (ResNet50) can be used. The convolutional layers (pretrained on ImageNet) are frozen, and only the final layers are trained on a new dataset. This leverages the knowledge learned from a large dataset, even if your own dataset is small.
Image | True Label | Prediction 1 | Prob 1 | Prediction 2 | Prob 2 | Prediction 3 | Prob 3 |
---|---|---|---|---|---|---|---|
(Flamingo) | Flamingo | Flamingo | 0.83 | Spoonbill | 0.17 | White stork | 0.00 |
(Cooper’s Hawk) | Cooper’s Hawk | Kite | 0.60 | Great grey owl | 0.09 | Nail | 0.12 |
(Cooper’s Hawk) | Cooper’s Hawk | Fountain | 0.35 | nail | 0.12 | hook | 0.07 |
(Lhasa Apso) | Lhasa Apso | Tibetan terrier | 0.56 | Lhasa | 0.32 | cocker spaniel | 0.03 |
(Cat) | Cat | Old English sheepdog | 0.82 | Shih-Tzu | 0.04 | Persian cat | 0.04 |
(Cape weaver) | Cape weaver | jacamar | 0.28 | macaw | 0.12 | robin | 0.12 |
Image | True Label | Prediction 1 | Prob 1 | Prediction 2 | Prob 2 | Prediction 3 | Prob 3 |
---|---|---|---|---|---|---|---|
(Flamingo) | Flamingo | Flamingo | 0.83 | Spoonbill | 0.17 | White stork | 0.00 |
(Cooper’s Hawk) | Cooper’s Hawk | Kite | 0.60 | Great grey owl | 0.09 | Nail | 0.12 |
(Cooper’s Hawk) | Cooper’s Hawk | Fountain | 0.35 | nail | 0.12 | hook | 0.07 |
(Lhasa Apso) | Lhasa Apso | Tibetan terrier | 0.56 | Lhasa | 0.32 | cocker spaniel | 0.03 |
(Cat) | Cat | Old English sheepdog | 0.82 | Shih-Tzu | 0.04 | Persian cat | 0.04 |
(Cape weaver) | Cape weaver | jacamar | 0.28 | macaw | 0.12 | robin | 0.12 |
Correct Predictions: In several cases (Flamingo, Lhasa Apso), the model correctly predicts the true label with high probability.
Reasonable Alternatives: Even when incorrect, the model often provides reasonable alternative predictions (e.g., Spoonbill and White stork for Flamingo).
Uncertainty: The probabilities show the model’s uncertainty. Lower probabilities indicate less confidence.
Limitation: Pretrained models are limited by the data they were trained on. They might misclassify objects not well-represented in the original training data.
Another Key Application: Deep learning is effective for document classification.
Goal: Predict document attributes:
Featurization: Converting text to numerical representation.
Example: Sentiment analysis of IMDb movie reviews.
Simplest Approach: Surprisingly effective.
How it Works:
Ignores Word Order: “Bag” of words – order lost.
Sparse Representation: Vectors are mostly zeros.
Bag-of-n-grams: Consider sequences of n words (captures some context).
Sequential Data: RNNs for:
Key Idea: Process input one element at a time, maintaining a “hidden state” (memory) of previous elements.
Unrolling the RNN: Visualizing it “unrolled” in time.
Note
This figure shows the architecture of a recurrent neural network (RNN). The input sequence (X) is processed one element at a time. The hidden state (A) is updated at each step, carrying information from previous steps. The output (O) can be produced at each step or only at the final step. Crucially, the same weights (W, U, B) are used at each step (weight sharing). This allows the RNN to handle sequences of varying lengths.
Hidden State Update:
\[ \mathbf{A}_{lk} = g\left(w_{k0} + \sum_{j=1}^p w_{kj} \mathbf{X}_{lj} + \sum_{s=1}^K u_{ks} \mathbf{A}_{l-1,s}\right) \]
Output:
\[ \mathbf{O}_l = \beta_0 + \sum_{k=1}^K \beta_k \mathbf{A}_{lk} \]
\(g(\cdot)\): activation function (e.g., ReLU).
\(\mathbf{W, U, B}\): shared weight matrices.
Hidden State Update: New state (\(\mathbf{A}_{lk}\)) depends on:
Output: Output (\(\mathbf{O}_l\)) is a linear combination of the hidden state.
Weight Sharing: Same weight matrices (\(\mathbf{W, U, B}\)) at every time step.
Beyond Bag-of-Words: Use the sequence of words.
Word Embeddings: Represent each word as a dense vector (word embedding). Captures semantic relationships (e.g., “king,” “queen”). word2vec, GloVe.
Embedding Layer: Maps each word (one-hot encoded) to its embedding.
Processing the Sequence: RNN processes embeddings, updates state, produces output (e.g., sentiment).
Note
Time Series Prediction: RNNs are useful.
Example: Stock market trading volume.
Input: Past values (volume, return, volatility).
Output: Predicted volume for next day.
Autocorrelation: Values correlated with past (today’s volume similar to yesterday’s). RNNs capture this.
Note
This figure shows an example of time series data: daily trading statistics from the New York Stock Exchange (NYSE). The data includes log trading volume (top), the Dow Jones Industrial Average (DJIA) return (middle), and log volatility (bottom). Notice the patterns and fluctuations over time. Predicting future values of such series is a common task in finance.
Measures Correlation with Past: The ACF measures correlation with lagged values (past).
Interpreting the ACF: Shows how strongly related values are at different lags. High ACF at lag 1: today’s value correlated with yesterday’s.
Input Sequence: \(L\) past observations (e.g., \(L=5\) days): \[ \mathbf{X}_1 = \begin{pmatrix} v_{t-L} \\ r_{t-L} \\ z_{t-L} \end{pmatrix}, \mathbf{X}_2 = \begin{pmatrix} v_{t-L+1} \\ r_{t-L+1} \\ z_{t-L+1} \end{pmatrix}, \dots, \mathbf{X}_L = \begin{pmatrix} v_{t-1} \\ r_{t-1} \\ z_{t-1} \end{pmatrix} \]
\(v_t\): trading volume on day \(t\).
Output: \(Y = v_t\) (trading volume on day \(t\), to predict).
Creating Training Examples: Many \((\mathbf{X}, Y)\) pairs from historical data. Sequence of past, corresponding future value.
Performance: RNN achieves \(R^2\) of 0.42 on test data. Explains 42% of variance in volume.
Comparison to Baseline: Outperforms a baseline using yesterday’s volume.
Traditional Time Series Model: Autoregression (AR) is classic.
Linear Prediction: Predicts based on linear combination of past.
AR(L) Model: Uses \(L\) previous values:
\[ \hat{v}_t = \beta_0 + \beta_1 v_{t-1} + \beta_2 v_{t-2} + \dots + \beta_L v_{t-L} \]
Including Other Variables: Can include lagged values of other variables (return, volatility).
RNN as a Non-Linear Extension: RNN is a non-linear extension of AR. RNN: complex, non-linear relationships. AR: linear.
Addressing Vanishing Gradients: LSTMs address “vanishing gradient” problem (in standard RNNs with long sequences).
Two Hidden States:
Improved Performance: LSTMs often better, especially for long sequences.
Complex Optimization: Fitting (finding optimal weights) is non-convex.
Local Minima: Loss function has many local minima.
Key Techniques:
Note
Gradient descent is like rolling a ball down a hill. The ball will eventually settle at the lowest point (the minimum). The learning rate controls how big of a step the ball takes at each iteration. A large learning rate might cause the ball to overshoot, while a small learning rate might make the descent slow.
Iterative Algorithm: Gradient descent is iterative.
Goal: Find parameters (\(\theta\)) that minimize loss \(R(\theta)\).
Steps:
Initialize \(\theta\) (often randomly).
Repeatedly update \(\theta\) opposite the gradient:
\[ \theta^{(m+1)} \leftarrow \theta^{(m)} - \rho \nabla R(\theta^{(m)}) \]
\(\rho\): learning rate (step size). Smaller: smaller steps (slower). Larger: larger steps (might overshoot).
Efficient Gradient Calculation: Backpropagation is efficient.
Chain Rule: Uses chain rule to “propagate” error backward.
Essential for Training: Essential; without it, training is slow.
Purpose: Prevent overfitting (model learns training data too well, performs poorly on unseen data).
Methods:
Minibatches: Use small batches of data (not entire dataset) for gradient.
Faster Updates: Much faster weight updates.
Randomness: Randomness helps escape local minima.
Standard Approach: SGD (and variants) is standard.
Note
It’s essential to monitor both training and validation error during training. The training error typically decreases as the model learns. The validation error initially decreases, but if it starts to increase, it indicates overfitting. Early stopping is a form of regularization that stops the training process before overfitting becomes significant.
Note
Dropout randomly removes units (neurons) during training. This forces the network to learn more robust features and prevents it from relying too heavily on any single neuron. It’s like training multiple different networks and averaging their predictions.
Architecture and Hyperparameters: Choosing architecture (layers, units, activations) and hyperparameters (learning rate, batch size, regularization) is crucial.
Key Considerations:
Trial and Error: Finding optimal settings: trial and error (time-consuming). Systematic: grid search, random search. Advanced: Bayesian optimization.
Interpolation: Model perfectly fits training data (zero training error).
Double Descent: Test error decreases again after increasing (as complexity increases beyond interpolation).
Not a Contradiction: Does not contradict bias-variance tradeoff.
Large Datasets: Deep learning shines with lots of data. Small datasets: simpler models might be better (interpretable).
Complex Relationships: Highly non-linear, complex: deep learning effective.
Difficult Feature Engineering: Deep learning automates feature learning.
Interpretability is Less Important: Deep learning: “black boxes.” Interpretability crucial: simpler models.
Computational Resources: Training: computationally expensive (GPUs).
Ethical Implications: Ethics of deep learning (facial recognition, loans, hiring)? Fairness, avoid bias?
Interpretability: How to make models more interpretable? Why is it important?
Limitations: Limitations of deep learning? When are other methods (decision trees, SVMs) better?
Future of Deep Learning: Future evolution? New architectures, applications?
Model Complexity: How to decide model complexity? What will happen if we use a too simple or too complex model?
Data Quality: What are the actions to ensure data quality, and remove bias in data sets?
邱飞(peter) 💌 [email protected]