Pro: Completely general, makes no assumptions, it’s exact.
Con: Doesn’t simplify anything. The last term \(P(x_d|x_1, \dots, x_{d-1})\) has as many parameters as the original joint probability, still facing the curse of dimensionality.
Method 3: The PGM Middle Ground
PGMs strike a perfect balance between simplicity and expressive power.
The Essence of PGM: A ‘Smarter’ Chain Rule
This is the secret to how probabilistic graphical models reduce computational complexity.
Bayesian Networks (BNs), also known as Probabilistic Directed Graphical Models (PDGMs), use a Directed Acyclic Graph (DAG) to encode dependencies among variables.
It consists of two parts:
Graph Structure (Qualitative): A DAG that describes the conditional independence relationships among variables.
Parameters (Quantitative): A set of local Conditional Probability Distributions (CPDs) that quantify the strength of these dependencies.
Core Structure: Directed Acyclic Graph (DAG)
Directed: Edges have arrows, indicating the direction of influence, which can often be interpreted as a ‘causal’ relationship. For example, \(A \rightarrow B\) means A is a ‘cause’ or ‘parent’ of B.
Acyclic: Starting from any node and following the direction of the arrows, you can never return to the starting point. This ensures the model avoids logical paradoxes like ‘a variable causing itself’.
DAG Example: Legal vs. Illegal
Case Introduction: A Simplified Causal Model
Let’s use a graph structure to build a simplified model of some process.
Interpreting the Graph’s Language: Family Relations
In a directed graph, we use family relationships to describe the connections between nodes:
Parents: The parents of a node are the set of all nodes that point directly to it.
parents(X3) is {X1, X2}.
parents(X6) is {X4, X5}.
Children: The children of a node are the set of all nodes that it points directly to.
children(X1) is {X3, X5}.
A node with no parents (like X1, X2) represents a ‘root cause’ or an ‘exogenous variable’ in the system.
The Cornerstone of Bayesian Networks: The Local Markov Property
The structure of the DAG directly defines a core conditional independence assumption: The Local Markov Property.
A node is conditionally independent of its non-descendants, given its parents.
Figure 1
Case Analysis: Reading Conditional Independence from the Graph
Factorization of Joint Probability: The Magic of BNs
Based on the local Markov property, any joint probability distribution that is consistent with the DAG can be factorized into the product of the conditional probabilities of all nodes given their parents.
This formula is the heart of Bayesian Networks. It decomposes a vast, complex joint probability distribution into a series of small, manageable local probability models (also known as Conditional Probability Distributions, CPDs).
Case Calculation: Factorizing the Joint Probability of the Graph
Comparison: If we used the standard chain rule, the last term would be \(P(X_6|X_1, X_2, X_3, X_4, X_5)\), requiring a huge parameter space.
Advantage: This factorization significantly reduces the number of parameters we need to estimate, making model learning feasible.
Parameterization: Populating the Model with CPTs
Now that we have the structure, we need to provide specific parameters for each local probability model. For discrete variables, this is typically done using Conditional Probability Tables (CPTs).
A CPT lists the probability of a node taking on different values for every possible combination of its parents’ values.
For a root node (like \(X_1\)), the CPT is a simple prior probability distribution \(P(X_1)\).
For a node with parents (like \(X_3\)), the CPT is \(P(X_3 | X_1, X_2)\).
9.2.2 D-Separation
Why Do We Need a General Rule?
The local Markov property tells us about the independence of a node from its ‘non-descendants’. But we often need to answer more general questions:
Are any two arbitrary sets of nodes A and B independent, given a third set E?
D-Separation (Directed Separation) is a complete, general set of rules that allows us to ‘read’ any conditional independence directly from the graph structure.
The Core Idea of D-Separation
The core idea of D-separation is to check if all paths between two sets of nodes are ‘blocked’.
If all paths from A to B are blocked by the set of observed variables E, then A and B are conditionally independent given E.
Conversely, if there is even one path that is unblocked, A and B are not conditionally independent.
The Three Basic Structures of D-Separation
Any path in a complex DAG can be broken down into three fundamental connection structures. Understanding these three is the key to mastering D-separation.
Structure 1: Tail-to-Tail (Common Cause)
b <- a -> c
Path: b and c are connected through their common parent a.
Rule: When a is observed (given), the path from b to c is blocked.
Explanation: If we already know it’s hot today, observing high ice cream sales tells us nothing new about the number of heatstroke cases. The weather information has ‘explained away’ the correlation. Conversely, if the weather is unknown, high ice cream sales would indeed increase our belief in more heatstroke cases.
Structure 2: Head-to-Tail (Chain)
b -> a -> c
Path: b influences c through the intermediate node a.
Rule: When a is observed (given), the path from b to c is blocked.
Explanation: Once we directly observe the market interest rates, the specific Fed policy that led to this rate is no longer important for predicting real estate prices. The interest rate itself contains all the necessary information.
Explanation: This phenomenon is known as ‘explaining away’. Knowing a professor is talented tells us nothing about their political skills (they are independent beforehand). However, if we know they got tenure and then find out they aren’t very talented, it strongly suggests they must be very politically skilled.
D-Separation Summary: Is the Path Blocked?
A path from a set of nodes A to B is blocked by a set of observed nodes E if there is a node v on the path such that either:
v is a tail-to-tail or head-to-tail node, and vis in the evidence set E.
v is a head-to-head (collider) node, and neither vnor any of its descendants are in the evidence set E.
If all paths from A to B are blocked by E, we say that A and B are D-separated given E, which implies conditional independence: A \(\perp\) B | E.
9.2.3 Case Study: A More Complete Student Model
Now, let’s apply these concepts to the classic ‘Student’ Bayesian Network. This network models the various factors that influence whether a student gets a good recommendation letter.
Variables:
D (Difficulty): Course difficulty (0=easy, 1=hard)
I (Intelligence): Student’s intelligence (0=low, 1=high)
G (Grade): Student’s grade (0=C, 1=B, 2=A)
S (SAT): SAT score (0=low, 1=high)
L (Letter): Quality of recommendation letter (0=weak, 1=strong)
The Student Model Graph Structure
The structure of this graph encodes our prior beliefs about student performance.
The Student Model Factorization
Based on the graph structure and Equation 1, we can immediately write down the factorized form of the joint probability:
With a fully specified Bayesian Network, we can perform Inference. This means that after observing some variables (Evidence), we can compute the posterior probability distribution of other unobserved variables.
The Core Problem: Compute \(P(\text{Query Variables} | \text{Evidence Variables})\)
Example Inference Questions
A student received a strong letter (L=1). What’s the probability they are highly intelligent?
\(P(I=1 | L=1) = ?\)
A highly intelligent student (I=1) is in a hard course (D=1). What is their probability of getting an A?
\(P(G=A | I=1, D=1) = ?\) (This can be read directly from the CPT)
A student has a high SAT score (S=1) but a poor grade (G=C). What’s the probability the course was hard?
\(P(D=1 | S=1, G=C) = ?\)
Hands-on: Building and Querying the Student Model with pgmpy
Enough theory, let’s make this real with code. We will use a powerful Python library pgmpy to implement the student model.
Steps:
Define Model Structure (add nodes and edges)
Define CPTs
Associate CPTs with the structure to form a complete model
Create an inference engine
Perform queries
pgmpy Hands-on: Building and Validating the Model
First, we’ll use a complete code block to build the student model.
# Import the necessary librariesfrom pgmpy.models import DiscreteBayesianNetworkfrom pgmpy.factors.discrete import TabularCPD# 1. Define the model structure (edges)student_model = DiscreteBayesianNetwork([('D', 'G'), ('I', 'G'), ('I', 'S'), ('G', 'L')]) # 2. Define the Conditional Probability Distributions (CPDs)# Root nodescpd_d = TabularCPD('D', 2, [[0.6], [0.4]])cpd_i = TabularCPD('I', 2, [[0.7], [0.3]])# Conditional nodescpd_s = TabularCPD('S', 2, values=[[0.95, 0.2], [0.05, 0.8]], evidence=['I'], evidence_card=[2])# Note: The order of evidence columns is (D=0,I=0), (D=1,I=0), (D=0,I=1), (D=1,I=1)cpd_g = TabularCPD('G', 3, values=[[0.30, 0.05, 0.90, 0.50], # G=A (g_0) [0.40, 0.25, 0.08, 0.30], # G=B (g_1) [0.30, 0.70, 0.02, 0.20]], # G=C (g_2) evidence=['D', 'I'], evidence_card=[2,2])# Note: pgmpy assumes variable states start from 0 by default.# G=A (g_0), G=B (g_1), G=C (g_2)cpd_l = TabularCPD('L', 2, values=[[0.9, 0.4, 0.01], [0.1, 0.6, 0.99]], # L=0, L=1 evidence=['G'], evidence_card=[3])# 3. Add the CPDs to the modelstudent_model.add_cpds(cpd_d, cpd_i, cpd_s, cpd_g, cpd_l)# 4. Check if the model is validassert student_model.check_model()print('Model built and validated successfully!')
Model built and validated successfully!
pgmpy Hands-on: Creating an Inference Engine and Querying
Now for the exciting part! We create an inference engine (here, using Variable Elimination) and ask our question.
Question: Suppose we observe a student with high intelligence (I=1) who received a strong recommendation letter (L=1). What is the probability that the course was hard (D=1)? i.e., compute \(P(D=1 | I=1, L=1)\).
# This is a complete, runnable code block (continues from previous slide)from pgmpy.inference import VariableElimination# (Model construction code was run on the previous slide, not repeated here)# 5. Create an inference objectinference = VariableElimination(student_model)# 6. Perform the queryquery_result = inference.query(variables=['D'], evidence={'I': 1, 'L': 1})print(query_result)
The query result shows \(P(D=1 | I=1, L=1) \approx 0.39\).
Prior Probability: With no information, the probability of a hard course is \(P(D=1)=0.4\).
Posterior Probability: After learning that this intelligent student got a strong letter, our belief that the course was hard slightly decreased, from 40% to 39%.
Intuition: Why? An intelligent student can get a good grade (and thus a strong letter) even in a hard course. However, it’s even easier for them to get a good grade in an easy course. So the evidence of a ‘strong letter’ slightly increases our belief in an ‘easy course’. This is the power of Bayesian inference: it quantitatively updates our beliefs in a logically consistent way.
9.2.4 Dynamic Bayesian Networks: When BNs Meet Time
If we ‘unroll’ a Bayesian Network over time, we get a Dynamic Bayesian Network (DBN).
The simplest and most famous DBN is the Hidden Markov Model (HMM).
HMM Intuition: Weather Forecasting from a Cave
Imagine you are trapped in a cave and can’t see the weather outside. But every day, some water seeps into the cave.
Your Observations (Visible): Is the cave ‘dry’, ‘damp’, or ‘dripping’ today?
True State (Hidden): Is the weather outside ‘sunny’, ‘cloudy’, or ‘rainy’?
Your task is to infer the most likely sequence of weather patterns outside based on the sequence of water seepage you observe.
Core Application of HMMs
Modeling time-series data where the true state of the system is unobservable (hidden), and we can only infer it through some visible observations.
Application in Economics:
True State (Hidden): The economy is in an ‘Expansion’ or ‘Recession’ phase.
Observations (Visible): Quarterly GDP growth rates, unemployment data.
The Two Core Assumptions of HMMs
HMMs are built on two key simplifying assumptions, which make the model tractable.
The Markov Assumption (State Transition)
The Observation Independence Assumption (Emission Probability)
Assumption 1: The Markov Assumption (State Transition)
The current true state \(z_t\) depends only on the true state at the previous time step, \(z_{t-1}\).
Economic Interpretation: Whether the economy is in recession or expansion primarily depends on whether it was in recession or expansion last quarter, not on its history long before that (in a first-order model).
Assumption 2: The Observation Independence Assumption
The current observation \(x_t\) depends only on the current true state \(z_t\).
Economic Interpretation: This quarter’s GDP growth rate depends only on whether the economy is currently in recession or expansion, and not on previous states or observations.
The Graphical Structure of an HMM
These two assumptions together define a very simple chain-like graphical structure.
The Three Elements of an HMM
To fully define an HMM, we need three parameters, usually denoted as \(\lambda = (A, B, \pi)\).
From ‘Causal’ to ‘Correlational’: Why We Need Undirected Graphs
The directed edges in Bayesian Networks are excellent for representing causal relationships (\(A \rightarrow B\)). However, in many scenarios, the relationships between variables are symmetric and mutual, with no clear causal direction.
Social Networks: If I am your friend, you are my friend.
Image Pixels: The color of a pixel is highly correlated with the colors of its neighbors, but none is the ‘cause’ of the others.
Spatial Economics: Housing prices in one region mutually influence prices in neighboring regions.
What is a Markov Random Field (MRF)?
For these types of problems, Markov Random Fields (MRFs), also known as Probabilistic Undirected Graphical Models (PUGMs), are a more natural choice.
An MRF is defined by two parts:
An undirected graph\(G=(\mathcal{V}, \mathcal{E})\), where nodes represent random variables and edges represent direct dependencies.
A set of Potential Functions\(\phi_C(\mathbf{x}_C)\) defined on the ‘cliques’ of the graph, used to quantify the ‘compatibility’ between variables.
Conditional Independence in Undirected Graphs: Much Simpler!
Unlike the complex rules of D-separation, conditional independence in undirected graphs is very intuitive:
If all paths from a set of nodes A to a set of nodes B must pass through a set of nodes C, then A and B are conditionally independent given C.
In other words, C separates A and B.
Case: Reading Independence from an Undirected Graph
MRF Conditional Independence: Three Equivalent Markov Properties
Global Markov Property:
\(A \perp B | C\) if C separates A and B. (As just discussed)
Local Markov Property:
A node is conditionally independent of all other nodes given its neighbors.
The neighbors (or Markov Blanket) are the set of nodes directly connected to it by an edge.
Pairwise Markov Property:
Any two non-adjacent nodes are conditionally independent given all other nodes.
These three properties are equivalent.
Factorization of Joint Probability: Cliques and Potential Functions
How does the joint probability of an undirected graph factorize? The answer is through Cliques.
Clique: A subset of nodes in a graph where every two distinct nodes in the subset are adjacent. They are the ‘fully connected’ subgraphs.
Maximal Clique: A clique that cannot be extended by adding any other adjacent vertex.
The joint probability distribution of an MRF can be factorized into a product of potential functions defined on its maximal cliques.
Case: Identifying Maximal Cliques
Potential Functions
For each maximal clique \(C\) in the graph, we define a potential function\(\psi_C(\mathbf{x}_C)\).
This function takes a specific configuration of values \(\mathbf{x}_C\) for the variables in that clique and outputs a non-negative real number.
Intuitive Meaning: This number represents the ‘compatibility’ or ‘preference’ for the variables within that clique to be in that specific state. A higher value means the configuration is more ‘harmonious’ and thus more likely.
Important: It is not a probability! It’s just a score.
The Hammersley-Clifford Theorem
This important theorem provides the bridge between the joint probability distribution of an MRF and its potential functions:
For a strictly positive probability distribution \(P(\mathbf{x}) > 0\), if it satisfies the conditional independence properties of an undirected graph \(G\), then its joint probability can be factorized as a product of potential functions \(\psi_C\) over the maximal cliques\(C\) of the graph \(G\):
Its Role: To ensure that the probabilities of all possible states sum to 1. Since the product of potential functions is not guaranteed to do so, we need to divide by the sum of this product over all possible configurations to normalize it.
The Challenge: Computing \(Z\) is often the hardest part of working with MRFs. It requires summing over all possible configurations of all variables, which is an exponential computation. This is the main bottleneck for inference and learning in MRFs.
BN vs. MRF Summary
Feature
Bayesian Network (Directed)
Markov Network (Undirected)
Graph Structure
Directed Acyclic Graph (DAG)
Undirected Graph
Core Idea
Factorization of conditional probabilities
Factorization of potential/energy functions
Parameterization
Conditional Probability Distributions (CPDs)
Potential functions on maximal cliques
Normalization
Automatically satisfied (local normalization)
Requires a global partition function Z
Use Cases
Causal relationships, generative models
Symmetric relationships, discriminative models
Interpretability
Strong, easy to understand
Weaker, potential functions are less intuitive
9.3.3 Conditional Random Fields (CRF)
Generative vs. Discriminative Models
Generative Models: Model the joint probability \(P(X, Z)\). They learn how the data is ‘generated’.
Examples: HMM, Bayesian Networks.
Capability: Can be used to generate new sample data.
Discriminative Models: Directly model the conditional probability \(P(Z|X)\) that we often care more about. They learn how to ‘discriminate’ between different Z.
Examples: Logistic Regression, CRF.
Capability: Often perform better on classification and prediction tasks.
The Core Advantage of CRFs
Conditional Random Fields (CRFs) are a special type of MRF that is a discriminative model.
Comparison to HMMs: CRFs can relax the strict observation independence assumption of HMMs. In a CRF, the probability of the current state \(z_t\) can depend on features from the entire observation sequence\(X\), making the model much more powerful and flexible.
Linear-Chain CRF
The most common type of CRF is the linear-chain CRF, which is specialized for sequence data.
Its conditional probability \(P(Z|X)\) takes the form:
\(f_k\) are called feature functions. They can be any function of the states and observations (e.g., ‘if the current word is ’bank’ and the previous word was ‘national’’).
\(\lambda_k\) are the weights for each feature, which are learned from data.
\(Z(X)\) is the partition function that depends on the observation sequence \(X\).
Is there a more general representation that can unify both models and provide a basis for universal inference algorithms?
The answer is the Factor Graph.
Factor Graphs: A More Refined Representation
A factor graph is a bipartite graph that explicitly represents how a global function (like a joint probability) is factorized into a product of local functions (factors).
Two Types of Nodes:
Variable Nodes: Usually drawn as circles, representing random variables.
Factor Nodes: Usually drawn as squares, representing a factor in the factorization of the global function.
Edges: An edge only exists between a variable node and a factor node. A variable node is connected to a factor node if and only if that variable is an argument of that factor.
Case: Converting a Bayesian Network to a Factor Graph
Case: Converting a Markov Network to a Factor Graph
Recall the MRF factorization: \(P(\mathbf{x}) \propto \psi_1(A,B,E) \psi_2(B,C) \psi_3(E,D)\)
Advantages of Factor Graphs
Eliminates Ambiguity: Different graphs (directed or undirected) can correspond to the same factorization, but the factor graph representation is unique.
Universal Algorithms: Many important inference algorithms, such as Belief Propagation, are most clearly and generally defined on factor graphs.
Universal Inference Algorithm: Belief Propagation
Belief Propagation (BP), also known as the Sum-Product Algorithm, is a general algorithm for performing efficient, exact inference on graphical models.
Goal: To compute the Marginal Probability\(P(x_i)\) for each individual variable.
Core Idea: The algorithm works by passing ‘messages’ along the edges of the factor graph. This can be seen as nodes ‘telling’ each other their ‘beliefs’ about the states of variables.
Convergence: On graphs without cycles (i.e., trees), the algorithm converges after a finite number of message passes and yields exact marginal probabilities. On graphs with cycles (Loopy BP), the algorithm is not guaranteed to converge or give exact solutions, but it often works very well in practice.
Intuition of the Sum-Product Algorithm: Applying the Distributive Law
Why is this algorithm efficient? Because it cleverly uses the distributive law to avoid redundant computations.
Consider calculating the marginal probability \(P(x_1) = \sum_{x_2, x_3, x_4} P(x_1,x_2,x_3,x_4)\). Assume \(P(\mathbf{x}) \propto f_1(x_1,x_2) f_2(x_2,x_3) f_3(x_3,x_4)\).
Brute-force calculation: For each value of \(x_1\), iterate through all combinations of \(x_2, x_3, x_4\).
We push the summations as far inside as possible, summing out the innermost variables first and passing the result outward as a ‘message’. This is the mathematical essence of the sum-product message-passing algorithm.
Course Summary
The curse of dimensionality motivates us to find simplified ways to represent joint probability distributions.
Probabilistic Graphical Models use conditional independence to decompose complex distributions into a product of local functions, using a graph to represent the independence assumptions.
Bayesian Networks (Directed) use conditional probabilities, are suitable for modeling causal relationships, and their joint probability is given by a simplified form of the chain rule.
Markov Networks (Undirected) use potential functions, are suitable for modeling symmetric relationships, and their joint probability is given by a product of potentials on maximal cliques (Hammersley-Clifford theorem).
Factor Graphs provide a unifying, more refined representation for both models and are the foundation for general inference algorithms.
Belief Propagation (Sum-Product Algorithm) is the core algorithm for efficient inference on graphical models, whose essence is using the distributive law to avoid redundant calculations.
Key Takeaways
Structure is an Assumption: Every node and every edge (or lack thereof) in a graph is an assumption about how the world works.
Decomposition is Key: The goal is to break down a large, intractable problem into many small, manageable ones.
Inference is the Goal: Once a model is built, its value lies in its ability to answer ‘what if’ questions through probabilistic inference.