09 Probabilistic Graphical Models: Taming Uncertainty with Structure

Welcome to Lecture 9

Today’s Agenda

Learning Objectives

Upon completing this chapter, you will be able to:

Explain the ‘curse of dimensionality’ and why PGMs are an effective solution.
Differentiate between the structures and applications of Bayesian and Markov Networks.
Interpret the conditional independence assumptions (D-separation) implied by a graph structure.
Construct a simple Bayesian Network and perform probabilistic inference using the pgmpy library.
Describe the basic principles of Hidden Markov Models (HMMs) and their role in time-series analysis.

The Core Question: How to Model Complex Economic Systems?

Imagine forecasting GDP growth. It’s a complex system of numerous interrelated variables.

How can we clearly represent and reason about the complex dependencies between these variables?

Limitation of Traditional Methods: The Joint Probability Distribution

In theory, to fully describe a system, we need the joint probability distribution \(P(x_1, x_2, \dots, x_d)\).

But this is an almost impossible task.

The ‘Curse of Dimensionality’ is an Exponential Explosion

Imagine we have d binary economic indicators (e.g., rate hike / no rate hike).

To store their joint probability distribution, we need a parameter table.

# of Variables (d)	# of Parameters (2^d)	Analogy
2	4	A small note
10	1,024	A single page
20	1,048,576	A thick book
30	~1 Billion	A small library
50	~\(10^{15}\)	The entire US Library of Congress (digitized)

This is computationally intractable.

Visualizing the Curse of Dimensionality

The Solution: Decompose Complexity with ‘Conditional Independence’

Most variables are not directly connected to everything else. A variable is typically only directly related to a few others.

From ‘Fully Connected’ to ‘Sparsely Connected’

Instead of storing one massive, fully connected table, we decompose the joint probability into a product of many local probabilities.

\[ \large{P(x_1, \dots, x_d) = \prod_i P(\text{local subset of variables}_i)} \]

The ‘graph’ structure is precisely what visualizes which variables are ‘locally’ related and which are conditionally independent.

Probabilistic Graphical Models: Two Major Schools

We will primarily study two major classes of PGMs, which differ in how they handle dependencies.

Model Type	Graph Structure	Meaning of Edges	Core Application Areas
Bayesian Networks	Directed Acyclic Graph (DAG)	Represents causal or influential relationships	Gene networks, expert systems, time-series analysis
Markov Networks	Undirected Graph	Represents correlational or symmetric relationships	Image processing, social network analysis, physical systems

Our Journey Today

Today, we will begin our exploration with directed graphs: Bayesian Networks.

We will learn about their:

Representation (How to encode independence with a graph)
Inference (How to answer questions given evidence)
Learning (How to estimate the model from data)

Section 9.1: Representing Joint Probabilities

Recap: Three Ways to Represent a Joint Probability

Three ways to represent the joint probability \(P(\mathbf{x})\) of a \(d\)-dimensional random vector \(\mathbf{x} = (x_1, \dots, x_d)\):

Complete Independence Assumption
The Chain Rule
Probabilistic Graphical Models (PGMs)

Method 1: The Complete Independence Assumption

Assume all variables are mutually independent.

\[ \large{P(x_1, \dots, x_d) = P(x_1)P(x_2)\dots P(x_d)} \]

Pro: Extremely simple, minimal computational cost.
Con: Overly simplistic, completely ignores interactions between variables, rarely true in reality.

Method 2: The Chain Rule

Using the definition of conditional probability, any joint probability can be precisely decomposed.

\[ \large{P(x_1, \dots, x_d) = P(x_1)P(x_2|x_1)\dots P(x_d|x_1, \dots, x_{d-1})} \]

Pro: Completely general, makes no assumptions, it’s exact.
Con: Doesn’t simplify anything. The last term \(P(x_d|x_1, \dots, x_{d-1})\) has as many parameters as the original joint probability, still facing the curse of dimensionality.

Method 3: The PGM Middle Ground

PGMs strike a perfect balance between simplicity and expressive power.

The Essence of PGM: A ‘Smarter’ Chain Rule

This is the secret to how probabilistic graphical models reduce computational complexity.

The Basic Language of Graphs: Nodes & Edges

Section 9.2: Directed Graphical Models (Bayesian Networks)

What is a Bayesian Network (BN)?

Bayesian Networks (BNs), also known as Probabilistic Directed Graphical Models (PDGMs), use a Directed Acyclic Graph (DAG) to encode dependencies among variables.

It consists of two parts:

Graph Structure (Qualitative): A DAG that describes the conditional independence relationships among variables.
Parameters (Quantitative): A set of local Conditional Probability Distributions (CPDs) that quantify the strength of these dependencies.

Core Structure: Directed Acyclic Graph (DAG)

Directed: Edges have arrows, indicating the direction of influence, which can often be interpreted as a ‘causal’ relationship. For example, \(A \rightarrow B\) means A is a ‘cause’ or ‘parent’ of B.
Acyclic: Starting from any node and following the direction of the arrows, you can never return to the starting point. This ensures the model avoids logical paradoxes like ‘a variable causing itself’.

DAG Example: Legal vs. Illegal

Case Introduction: A Simplified Causal Model

Let’s use a graph structure to build a simplified model of some process.

Interpreting the Graph’s Language: Family Relations

In a directed graph, we use family relationships to describe the connections between nodes:

Parents: The parents of a node are the set of all nodes that point directly to it.
- parents(X3) is {X1, X2}.
- parents(X6) is {X4, X5}.
Children: The children of a node are the set of all nodes that it points directly to.
- children(X1) is {X3, X5}.
A node with no parents (like X1, X2) represents a ‘root cause’ or an ‘exogenous variable’ in the system.

The Cornerstone of Bayesian Networks: The Local Markov Property

The structure of the DAG directly defines a core conditional independence assumption: The Local Markov Property.

A node is conditionally independent of its non-descendants, given its parents.

Figure 1

Case Analysis: Reading Conditional Independence from the Graph

Let’s examine the graph in Figure 1 again:

Node X3:
- Parents: {X1, X2}
- Descendants: {X4, X6}
- Non-Descendants: {X5} (Note: X1,X2 are its ancestors, also considered non-descendants)
Applying the Property: Based on the local Markov property, we can assert that: X3 is conditionally independent of X5 given X1 and X2.
Mathematical Expression: \(P(X_3 | X_1, X_2, X_5) = P(X_3 | X_1, X_2)\).

Factorization of Joint Probability: The Magic of BNs

Based on the local Markov property, any joint probability distribution that is consistent with the DAG can be factorized into the product of the conditional probabilities of all nodes given their parents.

\[ \large{P(x_1, \dots, x_d) = \prod_{i=1}^d P(x_i | \text{parents}(x_i))} \qquad(1)\]

This formula is the heart of Bayesian Networks. It decomposes a vast, complex joint probability distribution into a series of small, manageable local probability models (also known as Conditional Probability Distributions, CPDs).

Case Calculation: Factorizing the Joint Probability of the Graph

Applying the formula Equation 1 to the graph structure in Figure 1:

\[ \large{ \begin{aligned} & P(X_1, X_2, X_3, X_4, X_5, X_6) \\ & = P(X_1) \cdot P(X_2) \cdot P(X_3|X_1, X_2) \cdot P(X_4|X_3) \\ & \quad \cdot P(X_5|X_1) \cdot P(X_6|X_4, X_5) \end{aligned} } \]

Comparison: If we used the standard chain rule, the last term would be \(P(X_6|X_1, X_2, X_3, X_4, X_5)\), requiring a huge parameter space.
Advantage: This factorization significantly reduces the number of parameters we need to estimate, making model learning feasible.

Parameterization: Populating the Model with CPTs

Now that we have the structure, we need to provide specific parameters for each local probability model. For discrete variables, this is typically done using Conditional Probability Tables (CPTs).

A CPT lists the probability of a node taking on different values for every possible combination of its parents’ values.
For a root node (like \(X_1\)), the CPT is a simple prior probability distribution \(P(X_1)\).
For a node with parents (like \(X_3\)), the CPT is \(P(X_3 | X_1, X_2)\).

9.2.2 D-Separation

Why Do We Need a General Rule?

The local Markov property tells us about the independence of a node from its ‘non-descendants’. But we often need to answer more general questions:

Are any two arbitrary sets of nodes A and B independent, given a third set E?

D-Separation (Directed Separation) is a complete, general set of rules that allows us to ‘read’ any conditional independence directly from the graph structure.

The Core Idea of D-Separation

The core idea of D-separation is to check if all paths between two sets of nodes are ‘blocked’.

If all paths from A to B are blocked by the set of observed variables E, then A and B are conditionally independent given E.
Conversely, if there is even one path that is unblocked, A and B are not conditionally independent.

The Three Basic Structures of D-Separation

Any path in a complex DAG can be broken down into three fundamental connection structures. Understanding these three is the key to mastering D-separation.

Structure 1: Tail-to-Tail (Common Cause)

b <- a -> c

Path: b and c are connected through their common parent a.
Rule: When a is observed (given), the path from b to c is blocked.
Example: Ice Cream Sales \(\leftarrow\) Hot Weather \(\rightarrow\) Heatstroke Cases
Explanation: If we already know it’s hot today, observing high ice cream sales tells us nothing new about the number of heatstroke cases. The weather information has ‘explained away’ the correlation. Conversely, if the weather is unknown, high ice cream sales would indeed increase our belief in more heatstroke cases.

Structure 2: Head-to-Tail (Chain)

b -> a -> c

Path: b influences c through the intermediate node a.
Rule: When a is observed (given), the path from b to c is blocked.
Example: Fed Policy \(\rightarrow\) Market Interest Rates \(\rightarrow\) Real Estate Prices
Explanation: Once we directly observe the market interest rates, the specific Fed policy that led to this rate is no longer important for predicting real estate prices. The interest rate itself contains all the necessary information.

Structure 3: Head-to-Head (Collider / V-Structure)

b -> a <- c

Path: b and c are both parents of a. a is their common effect.
Rule (Important! This is reversed!):
- When a (or any of its descendants) is unobserved, the path is blocked.
- When a (or any of its descendants) is observed, the path is unblocked!
Example: Academic Talent \(\rightarrow\) Gets Tenure \(\leftarrow\) Political Skill
Explanation: This phenomenon is known as ‘explaining away’. Knowing a professor is talented tells us nothing about their political skills (they are independent beforehand). However, if we know they got tenure and then find out they aren’t very talented, it strongly suggests they must be very politically skilled.

D-Separation Summary: Is the Path Blocked?

A path from a set of nodes A to B is blocked by a set of observed nodes E if there is a node v on the path such that either:

v is a tail-to-tail or head-to-tail node, and v is in the evidence set E.
v is a head-to-head (collider) node, and neither v nor any of its descendants are in the evidence set E.

If all paths from A to B are blocked by E, we say that A and B are D-separated given E, which implies conditional independence: A \(\perp\) B | E.

9.2.3 Case Study: A More Complete Student Model

Now, let’s apply these concepts to the classic ‘Student’ Bayesian Network. This network models the various factors that influence whether a student gets a good recommendation letter.

Variables:
- D (Difficulty): Course difficulty (0=easy, 1=hard)
- I (Intelligence): Student’s intelligence (0=low, 1=high)
- G (Grade): Student’s grade (0=C, 1=B, 2=A)
- S (SAT): SAT score (0=low, 1=high)
- L (Letter): Quality of recommendation letter (0=weak, 1=strong)

The Student Model Graph Structure

The structure of this graph encodes our prior beliefs about student performance.

The Student Model Factorization

Based on the graph structure and Equation 1, we can immediately write down the factorized form of the joint probability:

\[ \large{P(I, D, G, S, L) = P(I) \cdot P(D) \cdot P(S|I) \cdot P(G|I, D) \cdot P(L|G)} \]

Note how we’ve simplified the problem! For example, we only need \(P(L|G)\), not the intractable \(P(L|I, D, G, S)\).

The Student Model Parameterization (CPTs)

Now we need to populate each factor with a CPT. These probabilities are typically estimated from domain experts or historical data.

1. Root Nodes (Prior Probabilities)

\(P(I)\): P(I=0) = 0.7, P(I=1) = 0.3
\(P(D)\): P(D=0) = 0.6, P(D=1) = 0.4

2. Conditional Probabilities

\(P(S|I)\): How intelligence affects SAT scores.
\(P(G|I,D)\): How intelligence and difficulty jointly affect grades.
\(P(L|G)\): How the grade affects the letter quality.

CPT Example: P(G | I, D)

This table shows the probability of a student getting an A/B/C grade for different combinations of intelligence and course difficulty.

I (Intelligence)	D (Difficulty)	P(G=A)	P(G=B)	P(G=C)
0 (low)	0 (easy)	0.30	0.40	0.30
0 (low)	1 (hard)	0.05	0.25	0.70
1 (high)	0 (easy)	0.90	0.08	0.02
1 (high)	1 (hard)	0.50	0.30	0.20

Bayesian Network Inference: Answering ‘What If’ Questions

With a fully specified Bayesian Network, we can perform Inference. This means that after observing some variables (Evidence), we can compute the posterior probability distribution of other unobserved variables.

The Core Problem: Compute \(P(\text{Query Variables} | \text{Evidence Variables})\)

Example Inference Questions

A student received a strong letter (L=1). What’s the probability they are highly intelligent?
- \(P(I=1 | L=1) = ?\)
A highly intelligent student (I=1) is in a hard course (D=1). What is their probability of getting an A?
- \(P(G=A | I=1, D=1) = ?\) (This can be read directly from the CPT)
A student has a high SAT score (S=1) but a poor grade (G=C). What’s the probability the course was hard?
- \(P(D=1 | S=1, G=C) = ?\)

Hands-on: Building and Querying the Student Model with `pgmpy`

Enough theory, let’s make this real with code. We will use a powerful Python library pgmpy to implement the student model.

Steps:

Define Model Structure (add nodes and edges)
Define CPTs
Associate CPTs with the structure to form a complete model
Create an inference engine
Perform queries

`pgmpy` Hands-on: Building and Validating the Model

First, we’ll use a complete code block to build the student model.

# Import the necessary libraries
from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.factors.discrete import TabularCPD

# 1. Define the model structure (edges)
student_model = DiscreteBayesianNetwork([('D', 'G'), ('I', 'G'), ('I', 'S'), ('G', 'L')]) 

# 2. Define the Conditional Probability Distributions (CPDs)
# Root nodes
cpd_d = TabularCPD('D', 2, [[0.6], [0.4]])
cpd_i = TabularCPD('I', 2, [[0.7], [0.3]])

# Conditional nodes
cpd_s = TabularCPD('S', 2, 
                   values=[[0.95, 0.2], [0.05, 0.8]], 
                   evidence=['I'], evidence_card=[2])

# Note: The order of evidence columns is (D=0,I=0), (D=1,I=0), (D=0,I=1), (D=1,I=1)
cpd_g = TabularCPD('G', 3, 
                   values=[[0.30, 0.05, 0.90, 0.50],   # G=A (g_0)
                           [0.40, 0.25, 0.08, 0.30],   # G=B (g_1)
                           [0.30, 0.70, 0.02, 0.20]],  # G=C (g_2)
                   evidence=['D', 'I'], evidence_card=[2,2])

# Note: pgmpy assumes variable states start from 0 by default.
# G=A (g_0), G=B (g_1), G=C (g_2)
cpd_l = TabularCPD('L', 2, 
                   values=[[0.9, 0.4, 0.01], [0.1, 0.6, 0.99]], # L=0, L=1
                   evidence=['G'], evidence_card=[3])

# 3. Add the CPDs to the model
student_model.add_cpds(cpd_d, cpd_i, cpd_s, cpd_g, cpd_l)

# 4. Check if the model is valid
assert student_model.check_model()
print('Model built and validated successfully!')

Model built and validated successfully!

`pgmpy` Hands-on: Creating an Inference Engine and Querying

Now for the exciting part! We create an inference engine (here, using Variable Elimination) and ask our question.

Question: Suppose we observe a student with high intelligence (I=1) who received a strong recommendation letter (L=1). What is the probability that the course was hard (D=1)? i.e., compute \(P(D=1 | I=1, L=1)\).

# This is a complete, runnable code block (continues from previous slide)
from pgmpy.inference import VariableElimination
# (Model construction code was run on the previous slide, not repeated here)

# 5. Create an inference object
inference = VariableElimination(student_model)

# 6. Perform the query
query_result = inference.query(variables=['D'], evidence={'I': 1, 'L': 1})
print(query_result)

+------+----------+
| D    |   phi(D) |
+======+==========+
| D(0) |   0.7482 |
+------+----------+
| D(1) |   0.2518 |
+------+----------+

Result Interpretation and Economic Intuition

The query result shows \(P(D=1 | I=1, L=1) \approx 0.39\).

Prior Probability: With no information, the probability of a hard course is \(P(D=1)=0.4\).
Posterior Probability: After learning that this intelligent student got a strong letter, our belief that the course was hard slightly decreased, from 40% to 39%.
Intuition: Why? An intelligent student can get a good grade (and thus a strong letter) even in a hard course. However, it’s even easier for them to get a good grade in an easy course. So the evidence of a ‘strong letter’ slightly increases our belief in an ‘easy course’. This is the power of Bayesian inference: it quantitatively updates our beliefs in a logically consistent way.

9.2.4 Dynamic Bayesian Networks: When BNs Meet Time

If we ‘unroll’ a Bayesian Network over time, we get a Dynamic Bayesian Network (DBN).

The simplest and most famous DBN is the Hidden Markov Model (HMM).

HMM Intuition: Weather Forecasting from a Cave

Imagine you are trapped in a cave and can’t see the weather outside. But every day, some water seeps into the cave.

Your Observations (Visible): Is the cave ‘dry’, ‘damp’, or ‘dripping’ today?
True State (Hidden): Is the weather outside ‘sunny’, ‘cloudy’, or ‘rainy’?

Your task is to infer the most likely sequence of weather patterns outside based on the sequence of water seepage you observe.

Core Application of HMMs

Modeling time-series data where the true state of the system is unobservable (hidden), and we can only infer it through some visible observations.

Application in Economics:
- True State (Hidden): The economy is in an ‘Expansion’ or ‘Recession’ phase.
- Observations (Visible): Quarterly GDP growth rates, unemployment data.

The Two Core Assumptions of HMMs

HMMs are built on two key simplifying assumptions, which make the model tractable.

The Markov Assumption (State Transition)
The Observation Independence Assumption (Emission Probability)

Assumption 1: The Markov Assumption (State Transition)

The current true state \(z_t\) depends only on the true state at the previous time step, \(z_{t-1}\).

\[ \large{P(z_t | z_{t-1}, z_{t-2}, \dots, z_1) = P(z_t | z_{t-1})} \]

Economic Interpretation: Whether the economy is in recession or expansion primarily depends on whether it was in recession or expansion last quarter, not on its history long before that (in a first-order model).

Assumption 2: The Observation Independence Assumption

The current observation \(x_t\) depends only on the current true state \(z_t\).

\[ \large{P(x_t | z_t, x_{t-1}, z_{t-1}, \dots) = P(x_t | z_t)} \]

Economic Interpretation: This quarter’s GDP growth rate depends only on whether the economy is currently in recession or expansion, and not on previous states or observations.

The Graphical Structure of an HMM

These two assumptions together define a very simple chain-like graphical structure.

The Three Elements of an HMM

To fully define an HMM, we need three parameters, usually denoted as \(\lambda = (A, B, \pi)\).

State Transition Matrix A: \(A_{ij} = P(z_t = j | z_{t-1} = i)\)
- Describes the probability of transitioning from one hidden state to another.
Observation Emission Matrix B: \(B_{jk} = P(x_t = k | z_t = j)\)
- Describes the probability of generating a specific observation given a hidden state.
Initial State Distribution \(\pi\): \(\pi_i = P(z_1 = i)\)
- Describes the probability of the system being in each hidden state at the very first time step.

Economics Case: Identifying Business Cycles with an HMM

Let’s use a simplified model to identify whether the economy is in Expansion or Recession.

Hidden States Z: {0: Expansion, 1: Recession}
Observations X: We observe quarterly GDP growth and simplify it into three levels: {0: Negative Growth, 1: Low Growth (0-2%), 2: High Growth (>2%)}

Our Task: Given an observed sequence of GDP growth, infer the most likely sequence of economic states (Expansion/Recession).

HMM Parameters: Transition Matrix A

Matrix A represents the ‘inertia’ of the business cycle.

\[ \large{A = \begin{pmatrix} P(\text{Exp}|\text{Exp}) & P(\text{Rec}|\text{Exp}) \\ P(\text{Exp}|\text{Rec}) & P(\text{Rec}|\text{Rec}) \end{pmatrix} = \begin{pmatrix} 0.9 & 0.1 \\ 0.3 & 0.7 \end{pmatrix}} \]

Interpretation:
- If this quarter is an expansion, there’s a 90% chance the next will also be an expansion.
- If this quarter is a recession, there’s a 70% chance the next will also be a recession (recessions are more ‘unstable’ than expansions).

HMM Parameters: Emission Matrix B

Matrix B represents the typical economic output in different states.

\[ \large{B = \begin{pmatrix} P(\text{Neg}|\text{Exp}) & P(\text{Low}|\text{Exp}) & P(\text{High}|\text{Exp}) \\ P(\text{Neg}|\text{Rec}) & P(\text{Low}|\text{Rec}) & P(\text{High}|\text{Rec}) \end{pmatrix} = \begin{pmatrix} 0.05 & 0.45 & 0.50 \\ 0.60 & 0.35 & 0.05 \end{pmatrix}} \]

Interpretation:
- In an expansion, high growth is most likely (50%); in a recession, negative growth is most likely (60%).

HMM Parameters: Initial Distribution \(\pi\)

\(\pi\) represents our prior belief about the economic state at time t=1.

\[ \large{\pi = \begin{pmatrix} P(z_1=\text{Exp}) \\ P(z_1=\text{Rec}) \end{pmatrix} = \begin{pmatrix} 0.8 \\ 0.2 \end{pmatrix}} \]

Interpretation: We believe there’s an 80% chance the economy is in an expansion at the start of our sequence.

The Three Core Problems HMMs Solve

Once we have a model \(\lambda = (A, B, \pi)\), HMM theory primarily addresses three problems.

HMM Summary

HMMs are powerful tools for analyzing time series, a key special case of DBNs.
They model complex systems by distinguishing between ‘hidden states’ and ‘observations’.
Their core consists of three parameters \((A, B, \pi)\) and classic algorithms for solving the three core problems (Evaluation, Decoding, Learning).
In economics, they are widely used for identifying business cycles, financial market regime switching (bull/bear markets), and more.

Section 9.3: Undirected Graphical Models (Markov Networks)

From ‘Causal’ to ‘Correlational’: Why We Need Undirected Graphs

The directed edges in Bayesian Networks are excellent for representing causal relationships (\(A \rightarrow B\)). However, in many scenarios, the relationships between variables are symmetric and mutual, with no clear causal direction.

Social Networks: If I am your friend, you are my friend.
Image Pixels: The color of a pixel is highly correlated with the colors of its neighbors, but none is the ‘cause’ of the others.
Spatial Economics: Housing prices in one region mutually influence prices in neighboring regions.

What is a Markov Random Field (MRF)?

For these types of problems, Markov Random Fields (MRFs), also known as Probabilistic Undirected Graphical Models (PUGMs), are a more natural choice.

An MRF is defined by two parts:

An undirected graph \(G=(\mathcal{V}, \mathcal{E})\), where nodes represent random variables and edges represent direct dependencies.
A set of Potential Functions \(\phi_C(\mathbf{x}_C)\) defined on the ‘cliques’ of the graph, used to quantify the ‘compatibility’ between variables.

Conditional Independence in Undirected Graphs: Much Simpler!

Unlike the complex rules of D-separation, conditional independence in undirected graphs is very intuitive:

If all paths from a set of nodes A to a set of nodes B must pass through a set of nodes C, then A and B are conditionally independent given C.

In other words, C separates A and B.

Case: Reading Independence from an Undirected Graph

MRF Conditional Independence: Three Equivalent Markov Properties

Global Markov Property:
- \(A \perp B | C\) if C separates A and B. (As just discussed)
Local Markov Property:
- A node is conditionally independent of all other nodes given its neighbors.
- The neighbors (or Markov Blanket) are the set of nodes directly connected to it by an edge.
Pairwise Markov Property:
- Any two non-adjacent nodes are conditionally independent given all other nodes.

These three properties are equivalent.

Factorization of Joint Probability: Cliques and Potential Functions

How does the joint probability of an undirected graph factorize? The answer is through Cliques.

Clique: A subset of nodes in a graph where every two distinct nodes in the subset are adjacent. They are the ‘fully connected’ subgraphs.
Maximal Clique: A clique that cannot be extended by adding any other adjacent vertex.

The joint probability distribution of an MRF can be factorized into a product of potential functions defined on its maximal cliques.

Case: Identifying Maximal Cliques

Potential Functions

For each maximal clique \(C\) in the graph, we define a potential function \(\psi_C(\mathbf{x}_C)\).
This function takes a specific configuration of values \(\mathbf{x}_C\) for the variables in that clique and outputs a non-negative real number.
Intuitive Meaning: This number represents the ‘compatibility’ or ‘preference’ for the variables within that clique to be in that specific state. A higher value means the configuration is more ‘harmonious’ and thus more likely.
Important: It is not a probability! It’s just a score.

The Hammersley-Clifford Theorem

This important theorem provides the bridge between the joint probability distribution of an MRF and its potential functions:

For a strictly positive probability distribution \(P(\mathbf{x}) > 0\), if it satisfies the conditional independence properties of an undirected graph \(G\), then its joint probability can be factorized as a product of potential functions \(\psi_C\) over the maximal cliques \(C\) of the graph \(G\):

\[ \large{P(\mathbf{x}) = \frac{1}{Z} \prod_{C \in \mathcal{C}} \psi_C(\mathbf{x}_C)} \]

where \(\mathcal{C}\) is the set of all maximal cliques.

The Normalization Constant (Partition Function) Z

The \(Z\) in the formula is called the partition function or normalization constant.

\[ \large{Z = \sum_{\mathbf{x}} \prod_{C \in \mathcal{C}} \psi_C(\mathbf{x}_C)} \]

Its Role: To ensure that the probabilities of all possible states sum to 1. Since the product of potential functions is not guaranteed to do so, we need to divide by the sum of this product over all possible configurations to normalize it.
The Challenge: Computing \(Z\) is often the hardest part of working with MRFs. It requires summing over all possible configurations of all variables, which is an exponential computation. This is the main bottleneck for inference and learning in MRFs.

BN vs. MRF Summary

Feature	Bayesian Network (Directed)	Markov Network (Undirected)
Graph Structure	Directed Acyclic Graph (DAG)	Undirected Graph
Core Idea	Factorization of conditional probabilities	Factorization of potential/energy functions
Parameterization	Conditional Probability Distributions (CPDs)	Potential functions on maximal cliques
Normalization	Automatically satisfied (local normalization)	Requires a global partition function Z
Use Cases	Causal relationships, generative models	Symmetric relationships, discriminative models
Interpretability	Strong, easy to understand	Weaker, potential functions are less intuitive

9.3.3 Conditional Random Fields (CRF)

Generative vs. Discriminative Models

Generative Models: Model the joint probability \(P(X, Z)\). They learn how the data is ‘generated’.
- Examples: HMM, Bayesian Networks.
- Capability: Can be used to generate new sample data.
Discriminative Models: Directly model the conditional probability \(P(Z|X)\) that we often care more about. They learn how to ‘discriminate’ between different Z.
- Examples: Logistic Regression, CRF.
- Capability: Often perform better on classification and prediction tasks.

The Core Advantage of CRFs

Conditional Random Fields (CRFs) are a special type of MRF that is a discriminative model.

Comparison to HMMs: CRFs can relax the strict observation independence assumption of HMMs. In a CRF, the probability of the current state \(z_t\) can depend on features from the entire observation sequence \(X\), making the model much more powerful and flexible.

Linear-Chain CRF

The most common type of CRF is the linear-chain CRF, which is specialized for sequence data.

Its conditional probability \(P(Z|X)\) takes the form:

\[ \large{P(Z|X) = \frac{1}{Z(X)} \exp\left( \sum_{t,k} \lambda_k f_k(z_{t-1}, z_t, X, t) \right)} \]

\(f_k\) are called feature functions. They can be any function of the states and observations (e.g., ‘if the current word is ’bank’ and the previous word was ‘national’’).
\(\lambda_k\) are the weights for each feature, which are learned from data.
\(Z(X)\) is the partition function that depends on the observation sequence \(X\).

Section 9.4: Unifying Framework - Factor Graphs & Sum-Product Algorithm

Why a Unifying Framework?

We have learned about directed and undirected graphs. Both represent factorizations of a joint probability, but in different ways.

Directed: \(P(\mathbf{x}) = \prod_i P(x_i | \text{parents}(x_i))\)
Undirected: \(P(\mathbf{x}) = \frac{1}{Z} \prod_C \psi_C(\mathbf{x}_C)\)

Is there a more general representation that can unify both models and provide a basis for universal inference algorithms?

The answer is the Factor Graph.

Factor Graphs: A More Refined Representation

A factor graph is a bipartite graph that explicitly represents how a global function (like a joint probability) is factorized into a product of local functions (factors).

Two Types of Nodes:
1. Variable Nodes: Usually drawn as circles, representing random variables.
2. Factor Nodes: Usually drawn as squares, representing a factor in the factorization of the global function.
Edges: An edge only exists between a variable node and a factor node. A variable node is connected to a factor node if and only if that variable is an argument of that factor.

Case: Converting a Bayesian Network to a Factor Graph

Recall the student BN factorization: \(P(\cdot) = P(I) P(D) P(S|I) P(G|I,D) P(L|G)\)

Case: Converting a Markov Network to a Factor Graph

Recall the MRF factorization: \(P(\mathbf{x}) \propto \psi_1(A,B,E) \psi_2(B,C) \psi_3(E,D)\)

Advantages of Factor Graphs

Eliminates Ambiguity: Different graphs (directed or undirected) can correspond to the same factorization, but the factor graph representation is unique.
Universal Algorithms: Many important inference algorithms, such as Belief Propagation, are most clearly and generally defined on factor graphs.

Universal Inference Algorithm: Belief Propagation

Belief Propagation (BP), also known as the Sum-Product Algorithm, is a general algorithm for performing efficient, exact inference on graphical models.

Goal: To compute the Marginal Probability \(P(x_i)\) for each individual variable.
Core Idea: The algorithm works by passing ‘messages’ along the edges of the factor graph. This can be seen as nodes ‘telling’ each other their ‘beliefs’ about the states of variables.
Convergence: On graphs without cycles (i.e., trees), the algorithm converges after a finite number of message passes and yields exact marginal probabilities. On graphs with cycles (Loopy BP), the algorithm is not guaranteed to converge or give exact solutions, but it often works very well in practice.

Intuition of the Sum-Product Algorithm: Applying the Distributive Law

Why is this algorithm efficient? Because it cleverly uses the distributive law to avoid redundant computations.

Consider calculating the marginal probability \(P(x_1) = \sum_{x_2, x_3, x_4} P(x_1,x_2,x_3,x_4)\). Assume \(P(\mathbf{x}) \propto f_1(x_1,x_2) f_2(x_2,x_3) f_3(x_3,x_4)\).

Brute-force calculation: For each value of \(x_1\), iterate through all combinations of \(x_2, x_3, x_4\).

Sum-product algorithm (smart calculation):

\[ \large{P(x_1) \propto \sum_{x_2} f_1(x_1, x_2) \left( \sum_{x_3} f_2(x_2, x_3) \left( \sum_{x_4} f_3(x_3, x_4) \right) \right)} \]

We push the summations as far inside as possible, summing out the innermost variables first and passing the result outward as a ‘message’. This is the mathematical essence of the sum-product message-passing algorithm.

Course Summary

The curse of dimensionality motivates us to find simplified ways to represent joint probability distributions.
Probabilistic Graphical Models use conditional independence to decompose complex distributions into a product of local functions, using a graph to represent the independence assumptions.
Bayesian Networks (Directed) use conditional probabilities, are suitable for modeling causal relationships, and their joint probability is given by a simplified form of the chain rule.
Markov Networks (Undirected) use potential functions, are suitable for modeling symmetric relationships, and their joint probability is given by a product of potentials on maximal cliques (Hammersley-Clifford theorem).
Factor Graphs provide a unifying, more refined representation for both models and are the foundation for general inference algorithms.
Belief Propagation (Sum-Product Algorithm) is the core algorithm for efficient inference on graphical models, whose essence is using the distributive law to avoid redundant calculations.

Key Takeaways

Structure is an Assumption: Every node and every edge (or lack thereof) in a graph is an assumption about how the world works.
Decomposition is Key: The goal is to break down a large, intractable problem into many small, manageable ones.
Inference is the Goal: Once a model is built, its value lies in its ability to answer ‘what if’ questions through probabilistic inference.

Thank You!

Questions & Discussion