Objective: Build a model \(f(X)\) that takes applicant information \(X\) as input and outputs the optimal decision \(Y\).
Why Choose the Bayesian Approach?
Bayesian classifiers provide a powerful decision-making framework based on probability theory.
Our Learning Roadmap for This Chapter
We will build a complete Bayesian decision-making system step by step.
Part 1: Foundations of Probability
The Language of Uncertainty
Core Concept 1: Prior Probability
The Prior Probability \(P(y=c)\) is our inherent belief or the historical frequency of a class c occurring, before observing any new data.
In the credit approval example, the prior probability refers to:
Without looking at any specific information about an applicant, what is the probability that a customer will default, based on the bank’s historical data?
It is the baseline or starting point for our decision.
Prior Probability: A Concrete Example
Suppose the bank has processed 10,000 loan applications in the past, of which 500 ultimately defaulted.
Core Concept 2: Probability Density Function (PDF)
For continuous variables (like income, age), we use a Probability Density Function (p(x)) to describe their distribution.
The PDF itself is not a probability; p(x) can be greater than 1.
The probability that a variable falls into an interval [a, b] is the integral of the PDF over that interval: \(P(a \le x \le b) = \int_a^b p(x) dx\).
The total area under the curve is 1.
PDF (Continuous) vs. PMF (Discrete)
An important distinction:
Visualization: The Normal (Gaussian) Distribution
The Normal distribution is the most common PDF in finance and business, defined by its mean μ (center) and variance σ² (spread).
Core Concept 3: Class-Conditional Probability
The Class-Conditional Probability \(p(x | y=c)\) answers: “If we already know a sample belongs to class c, what is the probability density of observing data x?”
In our credit approval example:
p(income | y = Default): Given a customer is a defaulter, what does their income distribution look like?
p(income | y = No Default): Given a customer is a good client, what does their income distribution look like?
This is the bridge connecting our observation (data x) and the unknown state (class y). We call this the Likelihood.
Visualizing Income Distributions for Different Customer Classes
Suppose historical data shows that non-defaulting customers (blue) generally have higher incomes than defaulting customers (red).
Core Concept 4: Posterior Probability
The Posterior Probability \(P(y=c | x)\) is the core of Bayesian decision-making. It is our updated belief that a sample belongs to class cafter we have observed the data x.
It answers our most pressing question:
“Given that this applicant’s income is $120k, what is the probability that they are a default risk?”
This is the direct basis for our decision!
Bayes’ Theorem: The Magic Formula from Prior to Posterior
Bayes’ Theorem is the mathematical cornerstone that connects these four core concepts.
Posterior: Our updated belief after seeing the evidence.
Likelihood: The probability of seeing this evidence, given the class.
Prior: Our initial belief.
Evidence: The overall probability of seeing this evidence, used for normalization.
Visualizing the Components of Bayes’ Theorem
Calculating the ‘Evidence’ p(x): The Law of Total Probability
The denominator p(x) is the total probability of observing the data x, regardless of its class. We calculate it using the Law of Total Probability.
Part 2: Bayesian Decision Theory
From Probabilities to Profits: The Art of Optimal Decision
With Posterior Probabilities, How Do We Decide?
We can now calculate P(y=Default | x) and P(y=No Default | x).
A seemingly simple and intuitive rule is: Choose the class with the highest posterior probability.
\[
\large{f(\mathbf{x}) = \underset{c}{\arg\max} \ P(y=c \mid \mathbf{x})}
\] This is known as the Maximum a Posteriori (MAP) decision rule.
The Hidden Assumption of the MAP Rule
The MAP rule implies a very strong assumption: the cost of all types of errors is the same.
In the real world, this assumption is often false.
The Costs of Misclassification Are Different: Introducing the Loss Function
In credit approval, there are two types of errors:
Type I Error (False Negative): Misclassifying a defaulting customer as ‘No Default’ (approving a bad loan).
Consequence: The bank loses the principal, a huge cost.
Type II Error (False Positive): Misclassifying a non-defaulting customer as ‘Default’ (rejecting a good customer).
Consequence: The bank loses potential interest income, a smaller cost.
Quantifying Costs: The Loss Matrix L(i, j)
We use a Loss Matrix L(i, j) to quantify the cost of classifying a sample whose true class is j as class i.
Expected Loss: Measuring the Average Cost of a Decision
For a given observation x, if we classify it as class i, the Expected Loss or Conditional Risk\(R(i|x)\) is the weighted average of losses for all possible true classes j, where the weights are the posterior probabilities.
\[
\large{R(\text{Reject} \mid \mathbf{x}) = (1) \times 0.8 + (0) \times 0.2 = 0.8}
\] The expected loss for choosing ‘Reject’ is $0.8k.
The Bayes Decision Rule: Minimize Expected Loss
The optimal decision \(f*(x)\) is to choose, for each \(x\), the class \(i\) that minimizes the expected loss \(R(i|x)\).
\[
\large{f^*(\mathbf{x}) = \underset{i}{\arg\min} \ R(i \mid \mathbf{x})}
\] In our example: R(Approve|x) = 84 vs. R(Reject|x) = 0.8. Since 0.8 < 84, the optimal decision is to ‘Reject’.
Even though the applicant has an 80% chance of being a good customer! The 20% risk of default, combined with the high cost, makes rejection the more rational choice.
A Special Case: The 0-1 Loss Function
If the cost of all misclassifications is equal (e.g., 1) and the cost of correct classification is 0, this is the 0-1 loss function.
Under 0-1 loss, the expected loss R(i|x) simplifies to: \[
\large{R(i \mid \mathbf{x}) = \sum_{j=1}^{C} L(i, j) P(y=j \mid \mathbf{x}) = \sum_{j \neq i} P(y=j \mid \mathbf{x})}
\]\[
\large{= 1 - P(y=i \mid \mathbf{x})}
\] Minimizing R(i|x) is equivalent to minimizing 1 - P(y=i|x), which means maximizing the posterior probability P(y=i|x).
Therefore, the MAP decision rule is a special case of minimizing expected loss under a 0-1 loss function.
Visualizing the Decision Boundary
The decision rule divides the feature space into regions, each corresponding to a decision class. The border between these regions is called the decision boundary.
For the MAP rule, the decision boundary is where the posterior probabilities are equal, e.g., P(y=1|x) = P(y=2|x).
Part 3: Parameter Estimation
Learning from Data: The Estimation Challenge
The Real-World Challenge: We Don’t Know the True Probability Distributions
So far, we have assumed that P(y=c) and p(x|y=c) are known.
In reality, we don’t know them. All we have is a set of historical data (the training set).
Core Task: How can we estimate the parameters of these probability distributions from the data?
The Parameter Estimation Approach
Choose a Model: We first assume the data follows a specific form of probability distribution, such as a Gaussian distribution \(\mathcal{N}(\mu, \sigma^2)\).
Estimate Parameters: Our task then becomes estimating the model’s unknown parameters, \(\theta = (\mu, \sigma^2)\), from the data.
Method 1: Maximum Likelihood Estimation (MLE)
The core idea of Maximum Likelihood Estimation (MLE) is:
Choose the set of parameters \(θ\) that maximizes the joint probability of observing the data we have, \(X = {x_1, ..., x_N}\).
In other words, which set of parameters provides the “best explanation” for the data we see?
The Likelihood Function L(θ|X)
We define the likelihood function L(θ|X), which represents the probability of observing data X given parameters θ.
Assuming the samples are independent and identically distributed (i.i.d.), the joint probability is the product of individual sample probabilities: \[
\large{L(\theta \mid X) = p(X \mid \theta) = \prod_{n=1}^{N} p(x_n \mid \theta)}
\]
Our goal is to find the θ* that maximizes L(θ|X). \[
\large{\theta^*_{MLE} = \underset{\theta}{\arg\max} \ L(\theta \mid X)}
\]
MLE Trick: Use the Log-Likelihood Function
Products are difficult to differentiate. We typically maximize the log-likelihood function LL(θ|X) instead.
Since the logarithm is a monotonically increasing function, maximizing L is the same as maximizing log(L). \[
\large{LL(\theta \mid X) = \log p(X \mid \theta) = \sum_{n=1}^{N} \log p(x_n \mid \theta)}
\] This turns the product into a sum, greatly simplifying the calculation.
MLE Example: Estimating the Mean of a Normal Distribution
Suppose our data \(x_1, ..., x_N\) comes from a Normal distribution with a known variance \(σ²\) but an unknown mean \(μ\).
The log-likelihood function is: \[
\large{LL(\mu) = \sum_{n=1}^{N} \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_n - \mu)^2}{2\sigma^2}} \right)}
\]
By taking the derivative with respect to \(μ\), setting it to 0, and solving, we find: \[
\large{\hat{\mu}_{MLE} = \frac{1}{N} \sum_{n=1}^{N} x_n}
\]
The result is very intuitive: the sample mean is the maximum likelihood estimate of the population mean.
The Limitation of MLE: Overfitting
MLE completely “trusts” the data. If the dataset is small or biased, the MLE result can be poor.
Classic example: Tossing a coin.
You flip a coin 3 times and get Heads, Heads, Heads.
MLE will estimate the probability of Heads as \(p(H) = 3/3 = 1\).
This implies you will never predict Tails. This is obviously unreasonable.
We need a way to incorporate our prior knowledge about the world (e.g., that coins are usually fair).
Method 2: Maximum a Posteriori (MAP) Estimation
MAP (Maximum a Posteriori) estimation builds on MLE by introducing a prior distribution p(θ) over the parameters θ themselves.
It no longer maximizes the likelihood p(X|θ), but rather the posterior probability of the parameters p(θ|X). \[
\large{p(\theta \mid X) \propto p(X \mid \theta) p(\theta)}
\] The goal of MAP is: \[
\large{\theta^*_{MAP} = \underset{\theta}{\arg\max} \ [p(X \mid \theta) p(\theta)]}
\]
The final MAP estimate is a balance between the evidence from the data and our initial beliefs.
Part 4: Handling Complex Distributions
When Reality is Messy: Mixture Models
The Limitation of a Single Model
So far, we have assumed that data from a single class can be described by one simple distribution (like a single Gaussian). But what if the data distribution is more complex?
A Solution for Complexity: Gaussian Mixture Models (GMM)
What if our data distribution is a mix of several “clusters”?
For example, customer spending habits might be divided into ‘high-spending’, ‘medium-spending’, and ‘low-spending’ groups, with each group being approximately normally distributed.
The Gaussian Mixture Model (GMM) is designed precisely for this situation.
GMM Definition: A Weighted Sum of Multiple Gaussians
A GMM models a complex probability distribution as a weighted sum of K Gaussian components.
\(K\): The number of mixture components (a hyperparameter).
\(π_k\): The mixing coefficient (weight) of the \(k\)-th component, with \(\sum_{k=1}^{K} \pi_k = 1\).
\(μ_k, Σ_k\): The mean and covariance matrix of the \(k\)-th Gaussian component.
Visualization: Fitting Complex Data with a GMM
Let’s generate some data composed of 3 clusters using Python and fit it with a GMM.
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.mixture import GaussianMixturefrom sklearn.datasets import make_blobs# 1. Generate simulated dataX, y_true = make_blobs(n_samples=400, centers=3, cluster_std=0.9, random_state=42)# 2. Initialize and fit the GMM modelgmm = GaussianMixture(n_components=3, random_state=42).fit(X)labels = gmm.predict(X)# 3. Visualize the resultsfig, ax = plt.subplots(figsize=(8, 4.5))scatter = ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)ax.set_title('Clusters Identified by GMM', fontsize=16)ax.set_xlabel('Feature 1', fontsize=12)ax.set_ylabel('Feature 2', fontsize=12)ax.legend(handles=scatter.legend_elements(), labels=['Cluster 0', 'Cluster 1', 'Cluster 2'], title="Clusters")plt.show()
Figure 1: Fitting data generated from a mixture of three Gaussians using a GMM
The GMM Challenge: A Latent Variable Problem
Estimating the parameters of a GMM (\(π_k, μ_k, Σ_k\)) is much more complex than for a single Gaussian.
This is because we don’t know which Gaussian component each data point \(x_n\)actually belongs to. This “membership” information is a latent variable.
This is a “chicken-and-egg” problem:
If we knew which cluster each point belonged to, we could easily estimate each cluster’s parameters.
If we knew each cluster’s parameters, we could calculate the probability of each point belonging to each cluster.
The Solution: The Expectation-Maximization (EM) Algorithm
The Expectation-Maximization (EM) algorithm is an iterative algorithm designed specifically to solve parameter estimation problems with latent variables.
It elegantly solves the “chicken-and-egg” dilemma by alternating between two steps until convergence.
The EM Algorithm: E-Step (Expectation)
E-Step (Expectation): Based on the current model parameters, compute the posterior probability (also called the “responsibility” \(r_{nk}\)) that each data point \(x_n\) was generated by each Gaussian component \(k\).
In simple terms, we make a “soft assignment” for each data point, guessing how likely it is to have come from each cluster.
The EM Algorithm: M-Step (Maximization)
M-Step (Maximization): Based on the “responsibilities” \(r_{nk}\) calculated in the E-step, update the model parameters \(π_k, μ_k, Σ_k\) to maximize the expected log-likelihood.
This is equivalent to performing a weighted MLE estimation for each cluster, where the weights are the responsibilities \(r_{nk}\).
For example, the new mean is a weighted average of all data points: \[
\large{\mu_k^{new} = \frac{\sum_{n=1}^N r_{nk} x_n}{\sum_{n=1}^N r_{nk}}}
\]
The EM Algorithm Flow
Part 5: The Naive Bayes Classifier
Putting it all Together for a Practical Solution
The Challenge: The Curse of Dimensionality
When our data x has many features, directly estimating the d-dimensional joint probability distribution p(x|y=c) is extremely difficult and requires vast amounts of data. This is known as the “curse of dimensionality”.
The ‘Naive’ Assumption: Conditional Independence of Features
The Naïve Bayes classifier makes a very bold (but often effective) simplifying assumption:
Given the class y, all features \(x_i\) are mutually independent.
This means: once we know a customer is in the ‘Default’ class, their ‘income’ level and their ‘age’ are considered independent pieces of information.
The Power of the Independence Assumption
This “naive” assumption makes calculating the joint probability incredibly simple:
The originally complex joint probability: \[ \large{p(\mathbf{x}|y=c) = p(x_1, x_2, \ldots, x_d | y=c)} \]
Under the independence assumption, it decomposes into the product of the individual class-conditional probabilities for each feature: \[ \large{p(\mathbf{x}|y=c) = \prod_{i=1}^{d} p(x_i | y=c)} \]
Now we only need to estimate a one-dimensional \(p(x_i|y=c)\) for each feature separately, which is much easier than estimating a high-dimensional joint distribution!
The Graphical Model of Naive Bayes
This assumption can be represented by a simple graphical model.
Case Study: Iris Flower Classification with Naive Bayes
Let’s apply Naive Bayes to a classic dataset: the Iris dataset.
Goal: Classify an iris flower into one of three species based on four features (sepal length/width, petal length/width).
Model:Gaussian Naive Bayes, which assumes that \(p(x_i|y=c)\) for each feature follows a Gaussian distribution.
Data Loading and Exploration
First, we use scikit-learn to load the data and split it into training and testing sets.
# Import necessary librariesfrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.naive_bayes import GaussianNBfrom sklearn.metrics import accuracy_score# Load the data: X contains the features, y contains the labels (species)X, y = load_iris(return_X_y=True)# Split the data into a training set (70%) and a testing set (30%)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)print(f'Training set size: {X_train.shape}')print(f'Testing set size: {X_test.shape}')
Training set size: (105, 4)
Testing set size: (45, 4)
Model Training and Prediction
Training a Naive Bayes model is very fast. It only needs to calculate the mean and variance for each feature within each class, without complex iterative optimization.
# Initialize the Gaussian Naive Bayes modelgnb = GaussianNB()# Use the training data to 'learn' the model parameters# This step calculates the mean and variance for each feature per classgnb.fit(X_train, y_train)# Use the trained model to make predictions on the test sety_pred = gnb.predict(X_test)# Print some predictions to compare with the true labelsprint(f'True labels (first 10): {y_test[:10]}')print(f'Predicted labels (first 10): {y_pred[:10]}')
We use Accuracy to evaluate the model’s performance, which is the proportion of correctly classified samples.
# Calculate the accuracy scoreaccuracy = accuracy_score(y_test, y_pred)print(f'Accuracy of the Gaussian Naive Bayes classifier: {accuracy:.4f}')
Accuracy of the Gaussian Naive Bayes classifier: 0.9778
Despite the “naive” independence assumption being almost certainly false in reality (petal length and width are likely correlated), the classifier’s performance is still excellent!
Why Does Naive Bayes Work So Well?
Lower Bar for Success: It doesn’t need to produce accurate probability estimates, only to ensure that the posterior probability for the correct class is higher than for any other class.
Resists Overfitting: Its simple assumptions lead to a low-complexity model with few parameters, making it more robust than complex models, especially with limited data.
Efficiency: Training and prediction are extremely fast, making it ideal for large datasets and real-time applications.
Part 6: Real-World Issues and Solutions
From Theory to Practice: Final Touches
Another Practical Problem: The Zero-Probability Issue
Consider a text classification task (e.g., spam detection), where the features are discrete words.
We use MLE to estimate the class-conditional probability p(word="offer" | class="spam") by calculating the frequency of “offer” in spam emails from the training set.
Problem: What if the word “stock” never appeared in any spam emails in our training data? Then: \[
\large{p(\text{word}="\text{stock}" \mid y=\text{spam}) = 0}
\]
The Catastrophic Consequence of Zero Probability
If p("stock" | spam) = 0, then for any new email containing the word “stock”, the posterior probability of it being spam will be zero!
A single word unseen in the training set completely rules out a class. This makes the model fragile and unreasonable. It has “seen” too little and is too certain.
The Solution: Laplace Smoothing
Laplace Smoothing, also known as Additive Smoothing, is a simple and effective method for handling the zero-probability problem.
The core idea is: When calculating probabilities, artificially add a small constant λ (usually 1) to the count of every possible event.
This is like giving every possible outcome a “head start,” ensuring that even if an event was never seen in the sample, its estimated probability will not be zero.
The Formula for Laplace Smoothing
For a discrete feature, the unsmoothed MLE estimate is: \[
\large{P(x_i = v \mid y=c) = \frac{N_{cv}}{N_c}}
\] * \(N_{cv}\): Number of samples in class c where feature has value v * \(N_c\): Total number of samples in class c
With Laplace smoothing (λ > 0): \[
\large{P_{\lambda}(x_i = v \mid y=c) = \frac{N_{cv} + \lambda}{N_c + \lambda S_i}}
\] * \(S_i\): Total number of possible values for feature \(i\) (e.g., vocabulary size) * \(λ\): The smoothing parameter (when \(λ=1\), it’s called “add-one smoothing”)
The Effect of Smoothing: An Example
Assume a feature has two values {Yes, No}, and we have 10 samples in class c.
Chapter Summary: The Core Ideas of Bayesian Classifiers
Preview of Next Lecture
Today, we explored a classic Generative Model—the Bayesian classifier.
We learned how to model the class-conditional density p(x|y) and the prior P(y).
Next, we will study another major class of models: Discriminative Models, such as Logistic Regression. These models bypass modeling p(x|y) and model the posterior probability P(y|x) directly.