03 Bayesian Classifiers

Welcome to Chapter 3: Bayesian Classifiers

The Core Question: How to Make Optimal Classifications Under Uncertainty?

Imagine you are a credit manager at a bank, facing a critical decision every day:

Should this loan application be approved?

This is fundamentally a Classification problem.

Input Data (X): Applicant’s features, such as annual income, credit score, debt level, etc.
Decision Classes (Y): ‘Approve’ (predict ‘No Default’) or ‘Reject’ (predict ‘Default’).

Case Study: Credit Approval Decisions

Let’s look at some simplified applicant data.

Applicant ID	Annual Income ($10k)	Credit Score	History of Default	Our Decision
001	15	720	No	?
002	6	580	Yes	?
003	9	650	No	?
…	…	…	…	…

Objective: Build a model $f(X)$ that takes applicant information $X$ as input and outputs the optimal decision $Y$.

Why Choose the Bayesian Approach?

Bayesian classifiers provide a powerful decision-making framework based on probability theory.

Our Learning Roadmap for This Chapter

We will build a complete Bayesian decision-making system step by step.

Part 1: Foundations of Probability

The Language of Uncertainty

Core Concept 1: Prior Probability

The Prior Probability $P(y=c)$ is our inherent belief or the historical frequency of a class c occurring, before observing any new data.

In the credit approval example, the prior probability refers to:

Without looking at any specific information about an applicant, what is the probability that a customer will default, based on the bank’s historical data?

It is the baseline or starting point for our decision.

Prior Probability: A Concrete Example

Suppose the bank has processed 10,000 loan applications in the past, of which 500 ultimately defaulted.

Core Concept 2: Probability Density Function (PDF)

For continuous variables (like income, age), we use a Probability Density Function (p(x)) to describe their distribution.

The PDF itself is not a probability; p(x) can be greater than 1.
The probability that a variable falls into an interval [a, b] is the integral of the PDF over that interval: $P(a \le x \le b) = \int_a^b p(x) dx$.
The total area under the curve is 1.

PDF (Continuous) vs. PMF (Discrete)

An important distinction:

Visualization: The Normal (Gaussian) Distribution

The Normal distribution is the most common PDF in finance and business, defined by its mean μ (center) and variance σ² (spread).

Core Concept 3: Class-Conditional Probability

The Class-Conditional Probability $p(x | y=c)$ answers: “If we already know a sample belongs to class c, what is the probability density of observing data x?”

In our credit approval example:

p(income | y = Default): Given a customer is a defaulter, what does their income distribution look like?
p(income | y = No Default): Given a customer is a good client, what does their income distribution look like?

This is the bridge connecting our observation (data x) and the unknown state (class y). We call this the Likelihood.

Visualizing Income Distributions for Different Customer Classes

Suppose historical data shows that non-defaulting customers (blue) generally have higher incomes than defaulting customers (red).

Core Concept 4: Posterior Probability

The Posterior Probability $P(y=c | x)$ is the core of Bayesian decision-making. It is our updated belief that a sample belongs to class c after we have observed the data x.

It answers our most pressing question:

“Given that this applicant’s income is $120k, what is the probability that they are a default risk?”

This is the direct basis for our decision!

Bayes’ Theorem: The Magic Formula from Prior to Posterior

Bayes’ Theorem is the mathematical cornerstone that connects these four core concepts.

\[ \large{P(y=c \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid y=c) P(y=c)}{p(\mathbf{x})}} \]

This formula tells us how to update our beliefs using data.

An Intuitive Reading of Bayes’ Theorem

\[ \large{\underbrace{P(y=c \mid \mathbf{x})}_{\text{Posterior}} = \frac{\overbrace{p(\mathbf{x} \mid y=c)}^{\text{Likelihood}} \times \overbrace{P(y=c)}^{\text{Prior}}}{\underbrace{p(\mathbf{x})}_{\text{Evidence}}}} \]

Posterior: Our updated belief after seeing the evidence.
Likelihood: The probability of seeing this evidence, given the class.
Prior: Our initial belief.
Evidence: The overall probability of seeing this evidence, used for normalization.

Visualizing the Components of Bayes’ Theorem

Calculating the ‘Evidence’ `p(x)`: The Law of Total Probability

The denominator p(x) is the total probability of observing the data x, regardless of its class. We calculate it using the Law of Total Probability.

Part 2: Bayesian Decision Theory

From Probabilities to Profits: The Art of Optimal Decision

With Posterior Probabilities, How Do We Decide?

We can now calculate P(y=Default | x) and P(y=No Default | x).

A seemingly simple and intuitive rule is: Choose the class with the highest posterior probability.

\[ \large{f(\mathbf{x}) = \underset{c}{\arg\max} \ P(y=c \mid \mathbf{x})} \] This is known as the Maximum a Posteriori (MAP) decision rule.

The Hidden Assumption of the MAP Rule

The MAP rule implies a very strong assumption: the cost of all types of errors is the same.

In the real world, this assumption is often false.

The Costs of Misclassification Are Different: Introducing the Loss Function

In credit approval, there are two types of errors:

Type I Error (False Negative): Misclassifying a defaulting customer as ‘No Default’ (approving a bad loan).
- Consequence: The bank loses the principal, a huge cost.
Type II Error (False Positive): Misclassifying a non-defaulting customer as ‘Default’ (rejecting a good customer).
- Consequence: The bank loses potential interest income, a smaller cost.

Quantifying Costs: The Loss Matrix `L(i, j)`

We use a Loss Matrix L(i, j) to quantify the cost of classifying a sample whose true class is j as class i.

Expected Loss: Measuring the Average Cost of a Decision

For a given observation x, if we classify it as class i, the Expected Loss or Conditional Risk $R(i|x)$ is the weighted average of losses for all possible true classes j, where the weights are the posterior probabilities.

\[ \large{R(i \mid \mathbf{x}) = \sum_{j=1}^{C} L(i, j) P(y=j \mid \mathbf{x})} \]

This formula calculates: “If I make decision i, what is the average loss I will incur?”

Calculating Expected Loss: A Step-by-Step Example

Suppose for a certain applicant x, we have calculated the posterior probabilities:

P(y=No Default | x) = 0.8
P(y=Default | x) = 0.2

And recall our loss matrix (simplified to numbers):

L(Approve, No Default) = -20 (gain 20k)
L(Approve, Default) = 500 (lose 500k)
L(Reject, No Default) = 1 (opportunity cost 1k)
L(Reject, Default) = 0

Calculating `R(Approve | x)`

If we choose to ‘Approve’ this loan:

R(Approve | x) = L(Approve, No Default) * P(No Default|x) + L(Approve, Default) * P(Default|x)

\[ \large{R(\text{Approve} \mid \mathbf{x}) = (-20) \times 0.8 + (500) \times 0.2 = -16 + 100 = 84} \] The expected loss for choosing ‘Approve’ is $84k.

Calculating `R(Reject | x)`

If we choose to ‘Reject’ this loan:

R(Reject | x) = L(Reject, No Default) * P(No Default|x) + L(Reject, Default) * P(Default|x)

\[ \large{R(\text{Reject} \mid \mathbf{x}) = (1) \times 0.8 + (0) \times 0.2 = 0.8} \] The expected loss for choosing ‘Reject’ is $0.8k.

The Bayes Decision Rule: Minimize Expected Loss

The optimal decision $f*(x)$ is to choose, for each $x$, the class $i$ that minimizes the expected loss $R(i|x)$.

\[ \large{f^*(\mathbf{x}) = \underset{i}{\arg\min} \ R(i \mid \mathbf{x})} \] In our example: R(Approve|x) = 84 vs. R(Reject|x) = 0.8. Since 0.8 < 84, the optimal decision is to ‘Reject’.

Even though the applicant has an 80% chance of being a good customer! The 20% risk of default, combined with the high cost, makes rejection the more rational choice.

A Special Case: The 0-1 Loss Function

If the cost of all misclassifications is equal (e.g., 1) and the cost of correct classification is 0, this is the 0-1 loss function.

\[ \large{L(i, j) = \begin{cases} 0, & \text{if } i = j \\ 1, & \text{if } i \neq j \end{cases}} \]

Decision Rule Under 0-1 Loss

Under 0-1 loss, the expected loss R(i|x) simplifies to: \[ \large{R(i \mid \mathbf{x}) = \sum_{j=1}^{C} L(i, j) P(y=j \mid \mathbf{x}) = \sum_{j \neq i} P(y=j \mid \mathbf{x})} \] \[ \large{= 1 - P(y=i \mid \mathbf{x})} \] Minimizing R(i|x) is equivalent to minimizing 1 - P(y=i|x), which means maximizing the posterior probability P(y=i|x).

Therefore, the MAP decision rule is a special case of minimizing expected loss under a 0-1 loss function.

Visualizing the Decision Boundary

The decision rule divides the feature space into regions, each corresponding to a decision class. The border between these regions is called the decision boundary.

For the MAP rule, the decision boundary is where the posterior probabilities are equal, e.g., P(y=1|x) = P(y=2|x).

Part 3: Parameter Estimation

Learning from Data: The Estimation Challenge

The Real-World Challenge: We Don’t Know the True Probability Distributions

So far, we have assumed that P(y=c) and p(x|y=c) are known.

In reality, we don’t know them. All we have is a set of historical data (the training set).

Core Task: How can we estimate the parameters of these probability distributions from the data?

The Parameter Estimation Approach

Choose a Model: We first assume the data follows a specific form of probability distribution, such as a Gaussian distribution $\mathcal{N}(\mu, \sigma^2)$.
Estimate Parameters: Our task then becomes estimating the model’s unknown parameters, $\theta = (\mu, \sigma^2)$, from the data.

Method 1: Maximum Likelihood Estimation (MLE)

The core idea of Maximum Likelihood Estimation (MLE) is:

Choose the set of parameters $θ$ that maximizes the joint probability of observing the data we have, $X = {x_1, ..., x_N}$.

In other words, which set of parameters provides the “best explanation” for the data we see?

The Likelihood Function `L(θ|X)`

We define the likelihood function L(θ|X), which represents the probability of observing data X given parameters θ.

Assuming the samples are independent and identically distributed (i.i.d.), the joint probability is the product of individual sample probabilities: \[ \large{L(\theta \mid X) = p(X \mid \theta) = \prod_{n=1}^{N} p(x_n \mid \theta)} \]

Our goal is to find the θ* that maximizes L(θ|X). \[ \large{\theta^*_{MLE} = \underset{\theta}{\arg\max} \ L(\theta \mid X)} \]

MLE Trick: Use the Log-Likelihood Function

Products are difficult to differentiate. We typically maximize the log-likelihood function LL(θ|X) instead.

Since the logarithm is a monotonically increasing function, maximizing L is the same as maximizing log(L). \[ \large{LL(\theta \mid X) = \log p(X \mid \theta) = \sum_{n=1}^{N} \log p(x_n \mid \theta)} \] This turns the product into a sum, greatly simplifying the calculation.

MLE Example: Estimating the Mean of a Normal Distribution

Suppose our data $x_1, ..., x_N$ comes from a Normal distribution with a known variance $σ²$ but an unknown mean $μ$.

The log-likelihood function is: \[ \large{LL(\mu) = \sum_{n=1}^{N} \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_n - \mu)^2}{2\sigma^2}} \right)} \]

By taking the derivative with respect to $μ$, setting it to 0, and solving, we find: \[ \large{\hat{\mu}_{MLE} = \frac{1}{N} \sum_{n=1}^{N} x_n} \]

The result is very intuitive: the sample mean is the maximum likelihood estimate of the population mean.

The Limitation of MLE: Overfitting

MLE completely “trusts” the data. If the dataset is small or biased, the MLE result can be poor.

Classic example: Tossing a coin.

You flip a coin 3 times and get Heads, Heads, Heads.
MLE will estimate the probability of Heads as $p(H) = 3/3 = 1$.
This implies you will never predict Tails. This is obviously unreasonable.

We need a way to incorporate our prior knowledge about the world (e.g., that coins are usually fair).

Method 2: Maximum a Posteriori (MAP) Estimation

MAP (Maximum a Posteriori) estimation builds on MLE by introducing a prior distribution p(θ) over the parameters θ themselves.

It no longer maximizes the likelihood p(X|θ), but rather the posterior probability of the parameters p(θ|X). \[ \large{p(\theta \mid X) \propto p(X \mid \theta) p(\theta)} \] The goal of MAP is: \[ \large{\theta^*_{MAP} = \underset{\theta}{\arg\max} \ [p(X \mid \theta) p(\theta)]} \]

The Log-Form of MAP

Again, using logarithms simplifies the calculation: \[ \large{\theta^*_{MAP} = \underset{\theta}{\arg\max} \ [\log p(X \mid \theta) + \log p(\theta)]} \] \[ \large{= \underset{\theta}{\arg\max} \left[ \sum_{n=1}^{N} \log p(x_n \mid \theta) + \log p(\theta) \right]} \]

MAP: A Balance Between Data Evidence and Prior Belief

\[ \large{\underbrace{\log p(\theta|X)}_{\text{Posterior}} \propto \underbrace{\sum \log p(x_n|\theta)}_{\text{Data Likelihood}} + \underbrace{\log p(\theta)}_{\text{Prior Belief}}} \]

The final MAP estimate is a balance between the evidence from the data and our initial beliefs.

Part 4: Handling Complex Distributions

When Reality is Messy: Mixture Models

The Limitation of a Single Model

So far, we have assumed that data from a single class can be described by one simple distribution (like a single Gaussian). But what if the data distribution is more complex?

A Solution for Complexity: Gaussian Mixture Models (GMM)

What if our data distribution is a mix of several “clusters”?

For example, customer spending habits might be divided into ‘high-spending’, ‘medium-spending’, and ‘low-spending’ groups, with each group being approximately normally distributed.

The Gaussian Mixture Model (GMM) is designed precisely for this situation.

GMM Definition: A Weighted Sum of Multiple Gaussians

A GMM models a complex probability distribution as a weighted sum of K Gaussian components.

\[ \large{p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} \mid \mu_k, \Sigma_k)} \]

Where:

$K$: The number of mixture components (a hyperparameter).
$π_k$: The mixing coefficient (weight) of the $k$-th component, with $\sum_{k=1}^{K} \pi_k = 1$.
$μ_k, Σ_k$: The mean and covariance matrix of the $k$-th Gaussian component.

Visualization: Fitting Complex Data with a GMM

Let’s generate some data composed of 3 clusters using Python and fit it with a GMM.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs

# 1. Generate simulated data
X, y_true = make_blobs(n_samples=400, centers=3, cluster_std=0.9, random_state=42)

# 2. Initialize and fit the GMM model
gmm = GaussianMixture(n_components=3, random_state=42).fit(X)
labels = gmm.predict(X)

# 3. Visualize the results
fig, ax = plt.subplots(figsize=(8, 4.5))
scatter = ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
ax.set_title('Clusters Identified by GMM', fontsize=16)
ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.legend(handles=scatter.legend_elements(), labels=['Cluster 0', 'Cluster 1', 'Cluster 2'], title="Clusters")
plt.show()

Figure 1: Fitting data generated from a mixture of three Gaussians using a GMM

The GMM Challenge: A Latent Variable Problem

Estimating the parameters of a GMM ($π_k, μ_k, Σ_k$) is much more complex than for a single Gaussian.

This is because we don’t know which Gaussian component each data point $x_n$ actually belongs to. This “membership” information is a latent variable.

This is a “chicken-and-egg” problem:

If we knew which cluster each point belonged to, we could easily estimate each cluster’s parameters.
If we knew each cluster’s parameters, we could calculate the probability of each point belonging to each cluster.

The Solution: The Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is an iterative algorithm designed specifically to solve parameter estimation problems with latent variables.

It elegantly solves the “chicken-and-egg” dilemma by alternating between two steps until convergence.

The EM Algorithm: E-Step (Expectation)

E-Step (Expectation): Based on the current model parameters, compute the posterior probability (also called the “responsibility” $r_{nk}$) that each data point $x_n$ was generated by each Gaussian component $k$.

\[ \large{r_{nk} = P(z_n=k \mid x_n; \theta_{old})} \]

In simple terms, we make a “soft assignment” for each data point, guessing how likely it is to have come from each cluster.

The EM Algorithm: M-Step (Maximization)

M-Step (Maximization): Based on the “responsibilities” $r_{nk}$ calculated in the E-step, update the model parameters $π_k, μ_k, Σ_k$ to maximize the expected log-likelihood.

This is equivalent to performing a weighted MLE estimation for each cluster, where the weights are the responsibilities $r_{nk}$.

For example, the new mean is a weighted average of all data points: \[ \large{\mu_k^{new} = \frac{\sum_{n=1}^N r_{nk} x_n}{\sum_{n=1}^N r_{nk}}} \]

The EM Algorithm Flow

Part 5: The Naive Bayes Classifier

Putting it all Together for a Practical Solution

The Challenge: The Curse of Dimensionality

When our data x has many features, directly estimating the d-dimensional joint probability distribution p(x|y=c) is extremely difficult and requires vast amounts of data. This is known as the “curse of dimensionality”.

The ‘Naive’ Assumption: Conditional Independence of Features

The Naïve Bayes classifier makes a very bold (but often effective) simplifying assumption:

Given the class y, all features $x_i$ are mutually independent.

This means: once we know a customer is in the ‘Default’ class, their ‘income’ level and their ‘age’ are considered independent pieces of information.

The Power of the Independence Assumption

This “naive” assumption makes calculating the joint probability incredibly simple:

The originally complex joint probability: \[ \large{p(\mathbf{x}|y=c) = p(x_1, x_2, \ldots, x_d | y=c)} \]

Under the independence assumption, it decomposes into the product of the individual class-conditional probabilities for each feature: \[ \large{p(\mathbf{x}|y=c) = \prod_{i=1}^{d} p(x_i | y=c)} \]

Now we only need to estimate a one-dimensional $p(x_i|y=c)$ for each feature separately, which is much easier than estimating a high-dimensional joint distribution!

The Graphical Model of Naive Bayes

This assumption can be represented by a simple graphical model.

Case Study: Iris Flower Classification with Naive Bayes

Let’s apply Naive Bayes to a classic dataset: the Iris dataset.

Goal: Classify an iris flower into one of three species based on four features (sepal length/width, petal length/width).
Model: Gaussian Naive Bayes, which assumes that $p(x_i|y=c)$ for each feature follows a Gaussian distribution.

Data Loading and Exploration

First, we use scikit-learn to load the data and split it into training and testing sets.

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the data: X contains the features, y contains the labels (species)
X, y = load_iris(return_X_y=True)

# Split the data into a training set (70%) and a testing set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f'Training set size: {X_train.shape}')
print(f'Testing set size: {X_test.shape}')

Training set size: (105, 4)
Testing set size: (45, 4)

Model Training and Prediction

Training a Naive Bayes model is very fast. It only needs to calculate the mean and variance for each feature within each class, without complex iterative optimization.

# Initialize the Gaussian Naive Bayes model
gnb = GaussianNB()

# Use the training data to 'learn' the model parameters
# This step calculates the mean and variance for each feature per class
gnb.fit(X_train, y_train)

# Use the trained model to make predictions on the test set
y_pred = gnb.predict(X_test)

# Print some predictions to compare with the true labels
print(f'True labels (first 10): {y_test[:10]}')
print(f'Predicted labels (first 10): {y_pred[:10]}')

True labels (first 10): [1 0 2 1 1 0 1 2 1 1]
Predicted labels (first 10): [1 0 2 1 1 0 1 2 1 1]

Evaluating Model Performance

We use Accuracy to evaluate the model’s performance, which is the proportion of correctly classified samples.

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy of the Gaussian Naive Bayes classifier: {accuracy:.4f}')

Accuracy of the Gaussian Naive Bayes classifier: 0.9778

Despite the “naive” independence assumption being almost certainly false in reality (petal length and width are likely correlated), the classifier’s performance is still excellent!

Why Does Naive Bayes Work So Well?

Lower Bar for Success: It doesn’t need to produce accurate probability estimates, only to ensure that the posterior probability for the correct class is higher than for any other class.
Resists Overfitting: Its simple assumptions lead to a low-complexity model with few parameters, making it more robust than complex models, especially with limited data.
Efficiency: Training and prediction are extremely fast, making it ideal for large datasets and real-time applications.

Part 6: Real-World Issues and Solutions

From Theory to Practice: Final Touches

Another Practical Problem: The Zero-Probability Issue

Consider a text classification task (e.g., spam detection), where the features are discrete words.

We use MLE to estimate the class-conditional probability p(word="offer" | class="spam") by calculating the frequency of “offer” in spam emails from the training set.

Problem: What if the word “stock” never appeared in any spam emails in our training data? Then: \[ \large{p(\text{word}="\text{stock}" \mid y=\text{spam}) = 0} \]

The Catastrophic Consequence of Zero Probability

If p("stock" | spam) = 0, then for any new email containing the word “stock”, the posterior probability of it being spam will be zero!

\[ \large{P(\text{spam} \mid \text{email}) \propto \ldots \times \overbrace{p(\text{"stock"} \mid \text{spam})}^{=0} \times \ldots = 0} \]

A single word unseen in the training set completely rules out a class. This makes the model fragile and unreasonable. It has “seen” too little and is too certain.

The Solution: Laplace Smoothing

Laplace Smoothing, also known as Additive Smoothing, is a simple and effective method for handling the zero-probability problem.

The core idea is: When calculating probabilities, artificially add a small constant λ (usually 1) to the count of every possible event.

This is like giving every possible outcome a “head start,” ensuring that even if an event was never seen in the sample, its estimated probability will not be zero.

The Formula for Laplace Smoothing

For a discrete feature, the unsmoothed MLE estimate is: \[ \large{P(x_i = v \mid y=c) = \frac{N_{cv}}{N_c}} \] * $N_{cv}$: Number of samples in class c where feature has value v * $N_c$: Total number of samples in class c

With Laplace smoothing (λ > 0): \[ \large{P_{\lambda}(x_i = v \mid y=c) = \frac{N_{cv} + \lambda}{N_c + \lambda S_i}} \] * $S_i$: Total number of possible values for feature $i$ (e.g., vocabulary size) * $λ$: The smoothing parameter (when $λ=1$, it’s called “add-one smoothing”)

The Effect of Smoothing: An Example

Assume a feature has two values {Yes, No}, and we have 10 samples in class c.

Observations: 10 ‘Yes’, 0 ‘No’.
$S_i = 2$ (two possible values)

MLE Estimate: * P('Yes'|c) = 10/10 = 1 * P('No'|c) = 0/10 = 0 (Dangerous!)

Laplace Smoothing (λ=1): * P('Yes'|c) = (10+1)/(10+1*2) = 11/12 ≈ 0.917 * P('No'|c) = (0+1)/(10+1*2) = 1/12 ≈ 0.083 (Problem solved!)

Chapter Summary: The Core Ideas of Bayesian Classifiers

Preview of Next Lecture

Today, we explored a classic Generative Model—the Bayesian classifier.

We learned how to model the class-conditional density p(x|y) and the prior P(y).

Next, we will study another major class of models: Discriminative Models, such as Logistic Regression. These models bypass modeling p(x|y) and model the posterior probability P(y|x) directly.

Thank You!

Q & A