03 Bayesian Classifiers

Welcome to Chapter 3: Bayesian Classifiers

Bayesian Classifier Workflow A three-stage diagram showing the process of a Bayesian classifier: Data Input (X), Probabilistic Model (P(Y|X)), and Decision Output (Y), connected by Learn and Predict steps. Bayesian Classifier Workflow 1. Data Input X 2. Probabilistic Model P(Y|X) 3. Decision Output Y Learn Predict

The Core Question: How to Make Optimal Classifications Under Uncertainty?

Imagine you are a credit manager at a bank, facing a critical decision every day:

Should this loan application be approved?

This is fundamentally a Classification problem.

  • Input Data (X): Applicant’s features, such as annual income, credit score, debt level, etc.
  • Decision Classes (Y): ‘Approve’ (predict ‘No Default’) or ‘Reject’ (predict ‘Default’).

Case Study: Credit Approval Decisions

Let’s look at some simplified applicant data.

Applicant ID Annual Income ($10k) Credit Score History of Default Our Decision
001 15 720 No ?
002 6 580 Yes ?
003 9 650 No ?

Objective: Build a model \(f(X)\) that takes applicant information \(X\) as input and outputs the optimal decision \(Y\).

Why Choose the Bayesian Approach?

Bayesian classifiers provide a powerful decision-making framework based on probability theory.

Four Advantages of Bayesian Methods A mind map with a central node for 'Bayesian Methods' branching out to four advantages: Handling Uncertainty, Incorporating Prior Knowledge, Strong Interpretability, and Model Elegance. The Four Advantages of Bayesian Methods Bayesian Methods Handles Uncertainty Outputs probabilities, not absolutes Incorporates Priors Combines with expert knowledge Highly Interpretable Decision process is transparent Model Elegance Built on a solid math foundation

Our Learning Roadmap for This Chapter

We will build a complete Bayesian decision-making system step by step.

Bayesian Classifier Learning Roadmap A horizontal roadmap with six steps, showing the learning process from probability basics to practical applications. 1. Probability Review (The Foundation) 2. Bayes Decision Rule (The Rule) 3. Parameter Estimation (The "How") 4. GMMs (The Complexity) 5. Naive Bayes (The Application) 6. Real-World Issues (The Fine-tuning)

Part 1: Foundations of Probability

The Language of Uncertainty

Core Concept 1: Prior Probability

The Prior Probability \(P(y=c)\) is our inherent belief or the historical frequency of a class c occurring, before observing any new data.

In the credit approval example, the prior probability refers to:

Without looking at any specific information about an applicant, what is the probability that a customer will default, based on the bank’s historical data?

It is the baseline or starting point for our decision.

Prior Probability: A Concrete Example

Suppose the bank has processed 10,000 loan applications in the past, of which 500 ultimately defaulted.

Prior Probability: Loan Default Example A donut chart and formulas showing the prior probabilities of loan default (5%) and non-default (95%) based on a total of 10,000 applications. Prior Probability: Loan Default Example Total Applications 10,000 P(y = No Default) = 9,500 / 10,000 = 95% P(y = Default) = 500 / 10,000 = 5% This is our 'initial belief' or 'base rate' before seeing new data.

Core Concept 2: Probability Density Function (PDF)

For continuous variables (like income, age), we use a Probability Density Function (p(x)) to describe their distribution.

  • The PDF itself is not a probability; p(x) can be greater than 1.
  • The probability that a variable falls into an interval [a, b] is the integral of the PDF over that interval: \(P(a \le x \le b) = \int_a^b p(x) dx\).
  • The total area under the curve is 1.

PDF (Continuous) vs. PMF (Discrete)

An important distinction:

Comparison of PDF (Probability Density Function) and PMF (Probability Mass Function) A side-by-side comparison. The left panel shows a continuous PDF curve where probability is the area under the curve for an interval. The right panel shows a discrete PMF where probability is the height of the bar at each point. PDF (Probability Density) For Continuous Variables (e.g., Height) Probability is the area of an interval a b PMF (Probability Mass) For Discrete Variables (e.g., Dice Roll) k Probability is the height at each point

Visualization: The Normal (Gaussian) Distribution

The Normal distribution is the most common PDF in finance and business, defined by its mean μ (center) and variance σ² (spread).

Probability Density Function (PDF) of Normal Distributions A chart comparing two Normal distributions on a dark background: N(μ=0, σ²=1) in red and N(μ=2, σ²=0.36) in blue. The blue curve is taller and narrower, indicating a smaller variance. Normal Distribution PDF 0.0 0.2 0.4 0.6 -2 0 2 4 x (Value) p(x) (Probability Density) N(μ=0, σ²=1) N(μ=2, σ²=0.36)

Core Concept 3: Class-Conditional Probability

The Class-Conditional Probability \(p(x | y=c)\) answers: “If we already know a sample belongs to class c, what is the probability density of observing data x?”

In our credit approval example:

  • p(income | y = Default): Given a customer is a defaulter, what does their income distribution look like?
  • p(income | y = No Default): Given a customer is a good client, what does their income distribution look like?

This is the bridge connecting our observation (data x) and the unknown state (class y). We call this the Likelihood.

Visualizing Income Distributions for Different Customer Classes

Suppose historical data shows that non-defaulting customers (blue) generally have higher incomes than defaulting customers (red).

Income Distributions of Different Customer Groups Two overlapping probability density curves. The red curve (defaulters) peaks at a lower income, while the blue curve (non-defaulters) peaks at a higher income. A vertical line shows the likelihoods for a new applicant with an income of $120k. 50k 100k 150k 200k 250k Annual Income ($) Prob. Density New Applicant's Income: $120k Likelihood(Default) ≈ 0.18 Likelihood(No Default) ≈ 0.08 p(Income | y=Default) p(Income | y=No Default) Income Distributions of Different Customer Groups

Core Concept 4: Posterior Probability

The Posterior Probability \(P(y=c | x)\) is the core of Bayesian decision-making. It is our updated belief that a sample belongs to class c after we have observed the data x.

It answers our most pressing question:

“Given that this applicant’s income is $120k, what is the probability that they are a default risk?”

This is the direct basis for our decision!

Bayes’ Theorem: The Magic Formula from Prior to Posterior

Bayes’ Theorem is the mathematical cornerstone that connects these four core concepts.

\[ \large{P(y=c \mid \mathbf{x}) = \frac{p(\mathbf{x} \mid y=c) P(y=c)}{p(\mathbf{x})}} \]

This formula tells us how to update our beliefs using data.

An Intuitive Reading of Bayes’ Theorem

\[ \large{\underbrace{P(y=c \mid \mathbf{x})}_{\text{Posterior}} = \frac{\overbrace{p(\mathbf{x} \mid y=c)}^{\text{Likelihood}} \times \overbrace{P(y=c)}^{\text{Prior}}}{\underbrace{p(\mathbf{x})}_{\text{Evidence}}}} \]

  • Posterior: Our updated belief after seeing the evidence.
  • Likelihood: The probability of seeing this evidence, given the class.
  • Prior: Our initial belief.
  • Evidence: The overall probability of seeing this evidence, used for normalization.

Visualizing the Components of Bayes’ Theorem

Visual Breakdown of Bayes' Theorem A diagram showing the four parts of Bayes' Theorem: Posterior P(y|x) equals Likelihood p(x|y) times Prior P(y), all divided by Evidence p(x). P(y|x) Posterior The Final Answer = × p(x|y) Likelihood P(y) Prior p(x) Evidence Normalization Data Evidence Initial Belief

Calculating the ‘Evidence’ p(x): The Law of Total Probability

The denominator p(x) is the total probability of observing the data x, regardless of its class. We calculate it using the Law of Total Probability.

Law of Total Probability A diagram showing the entire sample space partitioned into two categories, C1 (No Default) and C2 (Default). An event 'x' overlaps with both. The total probability of x, p(x), is the sum of its intersections with each category. Law of Total Probability C₁: No Default (Prior: P(C₁)) C₂: Default (Prior: P(C₂)) Event x p(x|C₁)P(C₁) p(x|C₂)P(C₂) p(x) = p(x|C₁)P(C₁) + p(x|C₂)P(C₂)

Part 2: Bayesian Decision Theory

From Probabilities to Profits: The Art of Optimal Decision

With Posterior Probabilities, How Do We Decide?

We can now calculate P(y=Default | x) and P(y=No Default | x).

A seemingly simple and intuitive rule is: Choose the class with the highest posterior probability.

\[ \large{f(\mathbf{x}) = \underset{c}{\arg\max} \ P(y=c \mid \mathbf{x})} \] This is known as the Maximum a Posteriori (MAP) decision rule.

The Hidden Assumption of the MAP Rule

The MAP rule implies a very strong assumption: the cost of all types of errors is the same.

In the real world, this assumption is often false.

The Costs of Misclassification Are Different: Introducing the Loss Function

In credit approval, there are two types of errors:

  1. Type I Error (False Negative): Misclassifying a defaulting customer as ‘No Default’ (approving a bad loan).
    • Consequence: The bank loses the principal, a huge cost.
  2. Type II Error (False Positive): Misclassifying a non-defaulting customer as ‘Default’ (rejecting a good customer).
    • Consequence: The bank loses potential interest income, a smaller cost.

Quantifying Costs: The Loss Matrix L(i, j)

We use a Loss Matrix L(i, j) to quantify the cost of classifying a sample whose true class is j as class i.

Loss Matrix for Credit Approval A 2x2 table showing the costs of correctly or incorrectly classifying defaulting/non-defaulting customers. If Our Decision is... If Reality is... Decision: Approve Decision: Reject No Default Default Loss: -$20k (Gain Interest) Loss: $1k (Opportunity Cost) Loss: $500k (Loss of Principal) Loss: $0 (Loss Avoided)

Expected Loss: Measuring the Average Cost of a Decision

For a given observation x, if we classify it as class i, the Expected Loss or Conditional Risk \(R(i|x)\) is the weighted average of losses for all possible true classes j, where the weights are the posterior probabilities.

\[ \large{R(i \mid \mathbf{x}) = \sum_{j=1}^{C} L(i, j) P(y=j \mid \mathbf{x})} \]

This formula calculates: “If I make decision i, what is the average loss I will incur?”

Calculating Expected Loss: A Step-by-Step Example

Suppose for a certain applicant x, we have calculated the posterior probabilities:

  • P(y=No Default | x) = 0.8
  • P(y=Default | x) = 0.2

And recall our loss matrix (simplified to numbers):

  • L(Approve, No Default) = -20 (gain 20k)
  • L(Approve, Default) = 500 (lose 500k)
  • L(Reject, No Default) = 1 (opportunity cost 1k)
  • L(Reject, Default) = 0

Calculating R(Approve | x)

If we choose to ‘Approve’ this loan:

R(Approve | x) = L(Approve, No Default) * P(No Default|x) + L(Approve, Default) * P(Default|x)

\[ \large{R(\text{Approve} \mid \mathbf{x}) = (-20) \times 0.8 + (500) \times 0.2 = -16 + 100 = 84} \] The expected loss for choosing ‘Approve’ is $84k.

Calculating R(Reject | x)

If we choose to ‘Reject’ this loan:

R(Reject | x) = L(Reject, No Default) * P(No Default|x) + L(Reject, Default) * P(Default|x)

\[ \large{R(\text{Reject} \mid \mathbf{x}) = (1) \times 0.8 + (0) \times 0.2 = 0.8} \] The expected loss for choosing ‘Reject’ is $0.8k.

The Bayes Decision Rule: Minimize Expected Loss

The optimal decision \(f*(x)\) is to choose, for each \(x\), the class \(i\) that minimizes the expected loss \(R(i|x)\).

\[ \large{f^*(\mathbf{x}) = \underset{i}{\arg\min} \ R(i \mid \mathbf{x})} \] In our example: R(Approve|x) = 84 vs. R(Reject|x) = 0.8. Since 0.8 < 84, the optimal decision is to ‘Reject’.

Even though the applicant has an 80% chance of being a good customer! The 20% risk of default, combined with the high cost, makes rejection the more rational choice.

A Special Case: The 0-1 Loss Function

If the cost of all misclassifications is equal (e.g., 1) and the cost of correct classification is 0, this is the 0-1 loss function.

\[ \large{L(i, j) = \begin{cases} 0, & \text{if } i = j \\ 1, & \text{if } i \neq j \end{cases}} \]

Decision Rule Under 0-1 Loss

Under 0-1 loss, the expected loss R(i|x) simplifies to: \[ \large{R(i \mid \mathbf{x}) = \sum_{j=1}^{C} L(i, j) P(y=j \mid \mathbf{x}) = \sum_{j \neq i} P(y=j \mid \mathbf{x})} \] \[ \large{= 1 - P(y=i \mid \mathbf{x})} \] Minimizing R(i|x) is equivalent to minimizing 1 - P(y=i|x), which means maximizing the posterior probability P(y=i|x).

Therefore, the MAP decision rule is a special case of minimizing expected loss under a 0-1 loss function.

Visualizing the Decision Boundary

The decision rule divides the feature space into regions, each corresponding to a decision class. The border between these regions is called the decision boundary.

For the MAP rule, the decision boundary is where the posterior probabilities are equal, e.g., P(y=1|x) = P(y=2|x).

Decision Boundary Illustration Two overlapping normal distribution curves, representing the class-conditional probabilities for two classes. Their intersection point determines the decision boundary. x p(x | y=1)P(y=1) p(x | y=2)P(y=2) Decision Boundary Decide: Class 1 Decide: Class 2

Part 3: Parameter Estimation

Learning from Data: The Estimation Challenge

The Real-World Challenge: We Don’t Know the True Probability Distributions

So far, we have assumed that P(y=c) and p(x|y=c) are known.

In reality, we don’t know them. All we have is a set of historical data (the training set).

Core Task: How can we estimate the parameters of these probability distributions from the data?

The Parameter Estimation Approach

  1. Choose a Model: We first assume the data follows a specific form of probability distribution, such as a Gaussian distribution \(\mathcal{N}(\mu, \sigma^2)\).
  2. Estimate Parameters: Our task then becomes estimating the model’s unknown parameters, \(\theta = (\mu, \sigma^2)\), from the data.
Parameter Estimation Flow A three-stage flowchart: 1. Training Data is input into 2. A Parameter Estimator, which outputs 3. A Probability Model with estimated parameters. Parameter Estimation Flow 1. Training Data 2. Parameter Estimator θ̂ = ? 3. Probability Model p(x | θ̂)

Method 1: Maximum Likelihood Estimation (MLE)

The core idea of Maximum Likelihood Estimation (MLE) is:

Choose the set of parameters \(θ\) that maximizes the joint probability of observing the data we have, \(X = {x_1, ..., x_N}\).

In other words, which set of parameters provides the “best explanation” for the data we see?

The Likelihood Function L(θ|X)

We define the likelihood function L(θ|X), which represents the probability of observing data X given parameters θ.

Assuming the samples are independent and identically distributed (i.i.d.), the joint probability is the product of individual sample probabilities: \[ \large{L(\theta \mid X) = p(X \mid \theta) = \prod_{n=1}^{N} p(x_n \mid \theta)} \]

Our goal is to find the θ* that maximizes L(θ|X). \[ \large{\theta^*_{MLE} = \underset{\theta}{\arg\max} \ L(\theta \mid X)} \]

MLE Trick: Use the Log-Likelihood Function

Products are difficult to differentiate. We typically maximize the log-likelihood function LL(θ|X) instead.

Since the logarithm is a monotonically increasing function, maximizing L is the same as maximizing log(L). \[ \large{LL(\theta \mid X) = \log p(X \mid \theta) = \sum_{n=1}^{N} \log p(x_n \mid \theta)} \] This turns the product into a sum, greatly simplifying the calculation.

MLE Example: Estimating the Mean of a Normal Distribution

Suppose our data \(x_1, ..., x_N\) comes from a Normal distribution with a known variance \(σ²\) but an unknown mean \(μ\).

The log-likelihood function is: \[ \large{LL(\mu) = \sum_{n=1}^{N} \log \left( \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_n - \mu)^2}{2\sigma^2}} \right)} \]

By taking the derivative with respect to \(μ\), setting it to 0, and solving, we find: \[ \large{\hat{\mu}_{MLE} = \frac{1}{N} \sum_{n=1}^{N} x_n} \]

The result is very intuitive: the sample mean is the maximum likelihood estimate of the population mean.

The Limitation of MLE: Overfitting

MLE completely “trusts” the data. If the dataset is small or biased, the MLE result can be poor.

Classic example: Tossing a coin.

  • You flip a coin 3 times and get Heads, Heads, Heads.
  • MLE will estimate the probability of Heads as \(p(H) = 3/3 = 1\).
  • This implies you will never predict Tails. This is obviously unreasonable.

We need a way to incorporate our prior knowledge about the world (e.g., that coins are usually fair).

Method 2: Maximum a Posteriori (MAP) Estimation

MAP (Maximum a Posteriori) estimation builds on MLE by introducing a prior distribution p(θ) over the parameters θ themselves.

It no longer maximizes the likelihood p(X|θ), but rather the posterior probability of the parameters p(θ|X). \[ \large{p(\theta \mid X) \propto p(X \mid \theta) p(\theta)} \] The goal of MAP is: \[ \large{\theta^*_{MAP} = \underset{\theta}{\arg\max} \ [p(X \mid \theta) p(\theta)]} \]

The Log-Form of MAP

Again, using logarithms simplifies the calculation: \[ \large{\theta^*_{MAP} = \underset{\theta}{\arg\max} \ [\log p(X \mid \theta) + \log p(\theta)]} \] \[ \large{= \underset{\theta}{\arg\max} \left[ \sum_{n=1}^{N} \log p(x_n \mid \theta) + \log p(\theta) \right]} \]

MAP: A Balance Between Data Evidence and Prior Belief

\[ \large{\underbrace{\log p(\theta|X)}_{\text{Posterior}} \propto \underbrace{\sum \log p(x_n|\theta)}_{\text{Data Likelihood}} + \underbrace{\log p(\theta)}_{\text{Prior Belief}}} \]

MAP Estimation as a Balance A balance scale metaphor. On the left pan is 'Data Likelihood', and on the right pan is 'Prior Belief'. The pivot point represents the final MAP Estimate, balancing the two. MAP Estimate Data Likelihood log p(X|θ) Prior Belief log p(θ)

The final MAP estimate is a balance between the evidence from the data and our initial beliefs.

Part 4: Handling Complex Distributions

When Reality is Messy: Mixture Models

The Limitation of a Single Model

So far, we have assumed that data from a single class can be described by one simple distribution (like a single Gaussian). But what if the data distribution is more complex?

Unimodal vs. Multimodal Data Distribution The left panel shows a simple, single-peaked distribution that is well-fit by a single Gaussian curve. The right panel shows a complex, double-peaked distribution where a single Gaussian fit is poor. Simple Distribution Single Gaussian fits well Complex (Multimodal) Distribution Single Gaussian is a poor fit

A Solution for Complexity: Gaussian Mixture Models (GMM)

What if our data distribution is a mix of several “clusters”?

For example, customer spending habits might be divided into ‘high-spending’, ‘medium-spending’, and ‘low-spending’ groups, with each group being approximately normally distributed.

The Gaussian Mixture Model (GMM) is designed precisely for this situation.

GMM Definition: A Weighted Sum of Multiple Gaussians

A GMM models a complex probability distribution as a weighted sum of K Gaussian components.

\[ \large{p(\mathbf{x}) = \sum_{k=1}^{K} \pi_k \mathcal{N}(\mathbf{x} \mid \mu_k, \Sigma_k)} \]

Where:

  • \(K\): The number of mixture components (a hyperparameter).
  • \(π_k\): The mixing coefficient (weight) of the \(k\)-th component, with \(\sum_{k=1}^{K} \pi_k = 1\).
  • \(μ_k, Σ_k\): The mean and covariance matrix of the \(k\)-th Gaussian component.

Visualization: Fitting Complex Data with a GMM

Let’s generate some data composed of 3 clusters using Python and fit it with a GMM.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.mixture import GaussianMixture
from sklearn.datasets import make_blobs

# 1. Generate simulated data
X, y_true = make_blobs(n_samples=400, centers=3, cluster_std=0.9, random_state=42)

# 2. Initialize and fit the GMM model
gmm = GaussianMixture(n_components=3, random_state=42).fit(X)
labels = gmm.predict(X)

# 3. Visualize the results
fig, ax = plt.subplots(figsize=(8, 4.5))
scatter = ax.scatter(X[:, 0], X[:, 1], c=labels, s=40, cmap='viridis', zorder=2)
ax.set_title('Clusters Identified by GMM', fontsize=16)
ax.set_xlabel('Feature 1', fontsize=12)
ax.set_ylabel('Feature 2', fontsize=12)
ax.legend(handles=scatter.legend_elements(), labels=['Cluster 0', 'Cluster 1', 'Cluster 2'], title="Clusters")
plt.show()
Figure 1: Fitting data generated from a mixture of three Gaussians using a GMM

The GMM Challenge: A Latent Variable Problem

Estimating the parameters of a GMM (\(π_k, μ_k, Σ_k\)) is much more complex than for a single Gaussian.

This is because we don’t know which Gaussian component each data point \(x_n\) actually belongs to. This “membership” information is a latent variable.

This is a “chicken-and-egg” problem:

  • If we knew which cluster each point belonged to, we could easily estimate each cluster’s parameters.
  • If we knew each cluster’s parameters, we could calculate the probability of each point belonging to each cluster.

The Solution: The Expectation-Maximization (EM) Algorithm

The Expectation-Maximization (EM) algorithm is an iterative algorithm designed specifically to solve parameter estimation problems with latent variables.

It elegantly solves the “chicken-and-egg” dilemma by alternating between two steps until convergence.

The EM Algorithm: E-Step (Expectation)

E-Step (Expectation): Based on the current model parameters, compute the posterior probability (also called the “responsibility” \(r_{nk}\)) that each data point \(x_n\) was generated by each Gaussian component \(k\).

\[ \large{r_{nk} = P(z_n=k \mid x_n; \theta_{old})} \]

In simple terms, we make a “soft assignment” for each data point, guessing how likely it is to have come from each cluster.

The EM Algorithm: M-Step (Maximization)

M-Step (Maximization): Based on the “responsibilities” \(r_{nk}\) calculated in the E-step, update the model parameters \(π_k, μ_k, Σ_k\) to maximize the expected log-likelihood.

This is equivalent to performing a weighted MLE estimation for each cluster, where the weights are the responsibilities \(r_{nk}\).

For example, the new mean is a weighted average of all data points: \[ \large{\mu_k^{new} = \frac{\sum_{n=1}^N r_{nk} x_n}{\sum_{n=1}^N r_{nk}}} \]

The EM Algorithm Flow

Expectation-Maximization (EM) Algorithm Flow A flowchart of the EM algorithm. It starts with parameter initialization, then enters a loop of two steps: the E-step (calculating responsibilities) and the M-step (updating parameters). The loop continues until the parameters converge. Expectation-Maximization (EM) Algorithm Flow 1. Initialize Parameters (θ₀) E-Step (Expectation) Calculate expectations (i.e., responsibilities) M-Step (Maximization) Update parameters to maximize expectation Iterate until convergence

Part 5: The Naive Bayes Classifier

Putting it all Together for a Practical Solution

The Challenge: The Curse of Dimensionality

When our data x has many features, directly estimating the d-dimensional joint probability distribution p(x|y=c) is extremely difficult and requires vast amounts of data. This is known as the “curse of dimensionality”.

Curse of Dimensionality Three panels show how a fixed number of data points becomes increasingly sparse as the dimensionality of the space increases from 1D (a line) to 2D (a square) to 3D (a cube). Curse of Dimensionality 1-Dimension 10 points cover the space well 2-Dimensions Need 10² = 100 points Data becomes sparse 3-Dimensions Need 10³ = 1000 points Data is extremely sparse

The ‘Naive’ Assumption: Conditional Independence of Features

The Naïve Bayes classifier makes a very bold (but often effective) simplifying assumption:

Given the class y, all features \(x_i\) are mutually independent.

This means: once we know a customer is in the ‘Default’ class, their ‘income’ level and their ‘age’ are considered independent pieces of information.

The Power of the Independence Assumption

This “naive” assumption makes calculating the joint probability incredibly simple:

The originally complex joint probability: \[ \large{p(\mathbf{x}|y=c) = p(x_1, x_2, \ldots, x_d | y=c)} \]

Under the independence assumption, it decomposes into the product of the individual class-conditional probabilities for each feature: \[ \large{p(\mathbf{x}|y=c) = \prod_{i=1}^{d} p(x_i | y=c)} \]

Now we only need to estimate a one-dimensional \(p(x_i|y=c)\) for each feature separately, which is much easier than estimating a high-dimensional joint distribution!

The Graphical Model of Naive Bayes

This assumption can be represented by a simple graphical model.

Graphical Model Comparison: Naive Bayes vs. General Bayesian Network The left panel shows the Naive Bayes model, where the class node Y is the parent of all feature nodes X, indicating features are conditionally independent. The right panel shows a general network where dependencies can exist between feature nodes. Naive Bayes Conditional Independence Assumption Y X₁ X₂ X₃ ... Xₙ General Bayesian Network Dependencies Can Exist Between Features Y X₁ X₂ X₃ ... Xₙ

Case Study: Iris Flower Classification with Naive Bayes

Let’s apply Naive Bayes to a classic dataset: the Iris dataset.

  • Goal: Classify an iris flower into one of three species based on four features (sepal length/width, petal length/width).
  • Model: Gaussian Naive Bayes, which assumes that \(p(x_i|y=c)\) for each feature follows a Gaussian distribution.

Data Loading and Exploration

First, we use scikit-learn to load the data and split it into training and testing sets.

# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load the data: X contains the features, y contains the labels (species)
X, y = load_iris(return_X_y=True)

# Split the data into a training set (70%) and a testing set (30%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print(f'Training set size: {X_train.shape}')
print(f'Testing set size: {X_test.shape}')
Training set size: (105, 4)
Testing set size: (45, 4)

Model Training and Prediction

Training a Naive Bayes model is very fast. It only needs to calculate the mean and variance for each feature within each class, without complex iterative optimization.

# Initialize the Gaussian Naive Bayes model
gnb = GaussianNB()

# Use the training data to 'learn' the model parameters
# This step calculates the mean and variance for each feature per class
gnb.fit(X_train, y_train)

# Use the trained model to make predictions on the test set
y_pred = gnb.predict(X_test)

# Print some predictions to compare with the true labels
print(f'True labels (first 10): {y_test[:10]}')
print(f'Predicted labels (first 10): {y_pred[:10]}')
True labels (first 10): [1 0 2 1 1 0 1 2 1 1]
Predicted labels (first 10): [1 0 2 1 1 0 1 2 1 1]

Evaluating Model Performance

We use Accuracy to evaluate the model’s performance, which is the proportion of correctly classified samples.

# Calculate the accuracy score
accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy of the Gaussian Naive Bayes classifier: {accuracy:.4f}')
Accuracy of the Gaussian Naive Bayes classifier: 0.9778

Despite the “naive” independence assumption being almost certainly false in reality (petal length and width are likely correlated), the classifier’s performance is still excellent!

Why Does Naive Bayes Work So Well?

  1. Lower Bar for Success: It doesn’t need to produce accurate probability estimates, only to ensure that the posterior probability for the correct class is higher than for any other class.
  2. Resists Overfitting: Its simple assumptions lead to a low-complexity model with few parameters, making it more robust than complex models, especially with limited data.
  3. Efficiency: Training and prediction are extremely fast, making it ideal for large datasets and real-time applications.

Part 6: Real-World Issues and Solutions

From Theory to Practice: Final Touches

Another Practical Problem: The Zero-Probability Issue

Consider a text classification task (e.g., spam detection), where the features are discrete words.

We use MLE to estimate the class-conditional probability p(word="offer" | class="spam") by calculating the frequency of “offer” in spam emails from the training set.

Problem: What if the word “stock” never appeared in any spam emails in our training data? Then: \[ \large{p(\text{word}="\text{stock}" \mid y=\text{spam}) = 0} \]

The Catastrophic Consequence of Zero Probability

If p("stock" | spam) = 0, then for any new email containing the word “stock”, the posterior probability of it being spam will be zero!

\[ \large{P(\text{spam} \mid \text{email}) \propto \ldots \times \overbrace{p(\text{"stock"} \mid \text{spam})}^{=0} \times \ldots = 0} \]

A single word unseen in the training set completely rules out a class. This makes the model fragile and unreasonable. It has “seen” too little and is too certain.

The Solution: Laplace Smoothing

Laplace Smoothing, also known as Additive Smoothing, is a simple and effective method for handling the zero-probability problem.

The core idea is: When calculating probabilities, artificially add a small constant λ (usually 1) to the count of every possible event.

This is like giving every possible outcome a “head start,” ensuring that even if an event was never seen in the sample, its estimated probability will not be zero.

The Formula for Laplace Smoothing

For a discrete feature, the unsmoothed MLE estimate is: \[ \large{P(x_i = v \mid y=c) = \frac{N_{cv}}{N_c}} \] * \(N_{cv}\): Number of samples in class c where feature has value v * \(N_c\): Total number of samples in class c

With Laplace smoothing (λ > 0): \[ \large{P_{\lambda}(x_i = v \mid y=c) = \frac{N_{cv} + \lambda}{N_c + \lambda S_i}} \] * \(S_i\): Total number of possible values for feature \(i\) (e.g., vocabulary size) * \(λ\): The smoothing parameter (when \(λ=1\), it’s called “add-one smoothing”)

The Effect of Smoothing: An Example

Assume a feature has two values {Yes, No}, and we have 10 samples in class c.

  • Observations: 10 ‘Yes’, 0 ‘No’.
  • \(S_i = 2\) (two possible values)

MLE Estimate: * P('Yes'|c) = 10/10 = 1 * P('No'|c) = 0/10 = 0 (Dangerous!)

Laplace Smoothing (λ=1): * P('Yes'|c) = (10+1)/(10+1*2) = 11/12 ≈ 0.917 * P('No'|c) = (0+1)/(10+1*2) = 1/12 ≈ 0.083 (Problem solved!)

Chapter Summary: The Core Ideas of Bayesian Classifiers

Summary of Core Ideas in Bayesian Classifiers A summary of five key concepts: Decision Framework, Optimal Decision, Model Learning, Practicality, and Robustness, each with an icon and a brief description. P(y|x) Framework

Bayes' theorem is the bridge connecting prior, likelihood, and posterior.

Optimal Decision

Minimize expected loss, not just maximize probability.

Model Learning

Estimate model parameters from data using MLE or MAP.

Practicality

Naive Bayes provides an efficient tool through simplifying assumptions.

Robustness

Techniques like smoothing help deal with data sparsity.

Preview of Next Lecture

Today, we explored a classic Generative Model—the Bayesian classifier.

We learned how to model the class-conditional density p(x|y) and the prior P(y).

Next, we will study another major class of models: Discriminative Models, such as Logistic Regression. These models bypass modeling p(x|y) and model the posterior probability P(y|x) directly.

Thank You!

Q & A