Multiple Testing

peter(邱飞)

zhejiang wanli university

Introduction: Statistical Learning and Hypothesis Testing

Statistical learning is a set of tools for modeling and understanding complex datasets. It is a vital area with applications in many fields. Hypothesis testing, a core concept in statistical inference, allows us to use data to answer “yes-or-no” questions, moving beyond simply estimating values or making predictions.

The Challenge of Big Data

In today’s world of “big data,” we’re often faced with testing hundreds or even thousands of hypotheses at once. This is called multiple testing. The main challenge is to avoid “false positives” — incorrectly rejecting a true null hypothesis. We want to make sure our “discoveries” are real, not just random noise! 🔍

Review of Hypothesis Testing

Hypothesis testing gives us a formal way to use data to answer specific questions. It provides a structured framework to evaluate evidence and make informed decisions.

Goal: Decide if there’s enough evidence to reject a pre-defined assumption (the null hypothesis).
Examples:
- In linear regression, is a specific coefficient significantly different from zero? This tells us if a predictor truly influences the outcome.
- In a clinical trial, does a new drug actually lower blood pressure compared to a placebo?

Steps in Hypothesis Testing

We generally follow these steps:

Define Hypotheses:
- Null Hypothesis (H₀): The default assumption – usually representing “no effect” or “no difference.”
- Alternative Hypothesis (Hₐ): What we suspect might be true if H₀ isn’t.
Construct Test Statistic: Calculate a value that summarizes how strongly the data contradicts H₀.

Steps in Hypothesis Testing (Continued)

Compute p-value: Find the probability of seeing data as extreme (or more extreme) than what we observed, assuming H₀ is true.
Decide: Based on the p-value (and a pre-set threshold), decide whether to reject H₀.

Defining the Hypotheses

Null Hypothesis (H₀): The baseline assumption. We look for evidence against this. Think of it like the “status quo” or “innocent until proven guilty.”

Alternative Hypothesis (Hₐ): What we think might be true if H₀ is false. It’s often the opposite of H₀. For instance, if H₀ is “A=B,” then Hₐ is “A≠B.”

Constructing the Test Statistic

The test statistic measures how far the data is from what we’d expect if H₀ were true.
The exact formula depends on the specific hypothesis.
Example: If we’re comparing the average values (means) of two groups (like treatment and control), we often use a two-sample t-statistic.

Two-Sample t-statistic: Formula

The two-sample t-statistic is calculated like this:

\[ T = \frac{\hat{\mu}_t - \hat{\mu}_c}{s \sqrt{\frac{1}{n_t} + \frac{1}{n_c}}} \]

where:

$ _t, _c$: The sample means of the treatment and control groups. These are our best guesses for the true average values.
$ n_t, n_c $: The number of observations in each group.
$ s $: The “pooled standard deviation” (explained next).

Pooled Standard Deviation

The pooled standard deviation, s, combines the variability within each group:

\[ s = \sqrt{\frac{(n_t - 1)s_t^2 + (n_c - 1)s_c^2}{n_t + n_c - 2}} \]

$s_t^2$ and $s_c^2$ are the sample variances of the treatment and control groups. They measure how spread out the data is within each group.
A large absolute value of T (|T|) suggests that the group means are significantly different, providing evidence against the null hypothesis.

Computing the p-value

p-value: The probability of observing a test statistic as extreme as (or more extreme than) the one we calculated, if the null hypothesis were actually true.
A small p-value means the data is unlikely under H₀, providing strong evidence against it.

p-value: Important Clarification

The p-value is not the probability that H₀ is true. It’s the probability of seeing the observed data (or more extreme data), given that H₀ is true. This is a very important distinction! 🧐

Example: p-value Interpretation

Density function for N(0,1)

This figure shows the probability density function of a standard normal distribution, often called a “bell curve.” The vertical line marks a value of 2.33.

Example: p-value Interpretation (Continued)

Only 1% of the area under the curve is to the right of 2.33.
This means there’s only a 2% chance of getting a value from this distribution that’s either greater than 2.33 or less than -2.33.
If our test statistic follows this distribution when H₀ is true, and we observe a test statistic of T = 2.33, the p-value would be 0.02.

Null Distribution

The null distribution is the probability distribution of the test statistic assuming the null hypothesis is true. It tells us what values of the test statistic are likely (or unlikely) if there’s no real effect.
Common examples:
- Normal distribution (bell curve)
- t-distribution (similar to normal, but used for smaller samples)
- χ²-distribution (Chi-squared distribution)
- F-distribution
We need to know the null distribution to calculate p-values.

Decision Making

We reject H₀ if the p-value is smaller than a pre-defined “significance level” (α).
Significance level (α): The cutoff for rejecting H₀. A common choice is α = 0.05 (5%).
Decision Rule:
- If p-value < α, we reject H₀.
- If p-value ≥ α, we fail to reject H₀.

Decision Making: Important Note

We never “accept” H₀. We only “reject” or “fail to reject” it. Failing to reject doesn’t mean H₀ is definitely true, just that we don’t have enough evidence to say it’s false.

Type I and Type II Errors: A Table

	H₀ True	H₀ False
Reject H₀	Type I Error	Correct Decision
Fail to Reject H₀	Correct Decision	Type II Error

Type I error: Rejecting H₀ when it’s actually true (a “false positive”). The probability of this is α (the significance level).
Type II error: Failing to reject H₀ when it’s actually false (a “false negative”).

Type I and Type II Errors: Definitions

Power: The probability of correctly rejecting H₀ when it’s false (1 - Probability of Type II error). We want high power! 💪
Our goal is to control the chance of Type I errors (α) while maximizing power (minimizing Type II errors).

The Challenge of Multiple Testing

When testing a single hypothesis, it’s relatively easy to control the Type I error rate (α).
Problem: When we test many hypotheses at the same time, the chance of making at least one Type I error goes way up, even if each individual test has a low α.

Multiple Testing: An Analogy

Analogy: Imagine flipping a fair coin many times. Each flip has a 50/50 chance, but if you flip enough times, you’re bound to see some “unlikely” streaks (like all heads) just by random chance.
Similarly, when testing many hypotheses, some will look “significant” (low p-value) just by chance, even if all the null hypotheses are true.

Illustration: The Stockbroker

A stockbroker claims they can predict whether Apple’s stock will go up or down for the next 10 days.
They email 1024 potential clients, each getting a different sequence of predictions (2¹⁰ = 1024 possible sequences).
Just by chance, one client will get a perfect prediction! They might think the stockbroker is amazing, but it’s just random luck.
This shows how making many “guesses” (tests) can lead to seemingly significant results purely by chance.

Family-Wise Error Rate (FWER)

FWER: The probability of making at least one Type I error (false positive) among all the hypothesis tests we do.
If all m tests are independent and all null hypotheses are true, then:

\[ \text{FWER}(\alpha) = 1 - (1 - \alpha)^m \]

This formula shows that the FWER increases quickly as m (the number of tests) increases.

FWER Example: Visualized

FWER vs. Number of Hypotheses

This graph shows how the FWER changes as we test more hypotheses (the x-axis is on a log scale), for different values of α.

FWER Example: Interpretation

The dashed line is at 0.05.
To keep the FWER at 0.05 when testing 50 null hypotheses, we need to control the Type I error for each individual hypothesis at a much lower level (around α = 0.001).
As we test more hypotheses, we need to be more and more strict with our individual significance levels to maintain the same overall FWER.

Controlling the FWER

To control the FWER at a specific level (α), we need stricter rules for rejecting each hypothesis.
Common methods:
- Bonferroni Correction: Very simple, but often too conservative.
- Holm’s Method: Less conservative than Bonferroni.

Bonferroni Correction

Rule: Reject H₀j if its p-value is less than α/m. Divide the desired overall significance level (α) by the number of tests (m).
Guarantee: FWER(α/m) ≤ m × α/m = α. The Bonferroni correction ensures the FWER is no larger than α.
Advantages:
- Easy to do.
- Guaranteed FWER control.
Disadvantage:
- Can be very conservative (low power), especially when m is large. We might miss real effects.

Holm’s Method

Holm’s method is a step-down procedure, which is less conservative than Bonferroni:

Specify α: Choose the desired overall FWER (e.g., α = 0.05).
Compute p-values: Calculate p-values (p1, …, pm) for all m hypotheses.
Order p-values: Sort the p-values from smallest to largest: p(1) ≤ p(2) ≤ … ≤ p(m).
Find L: \[ L = \min\big\{j:p_{(j)} > \frac{\alpha}{m+1-j} \big\} \]

Holm’s Method (Continued)

Reject: Reject all null hypotheses H0j for which pj < p(L).

Advantages:
- Controls the FWER.
- Less conservative than Bonferroni (higher power). Holm will always reject at least as many hypotheses as Bonferroni, and usually more.

Holm’s method illustration

Holm’s and Bonferroni

Each panel shows sorted p-values from simulations with 10 hypotheses. Black dots are true null hypotheses (2 out of 10), and red dots are false null hypotheses.

Holm’s method illustration (Continued)

The black line is the Bonferroni threshold (α/m).
The blue line is the Holm threshold (α/(m+1-j)).
The area between the lines shows hypotheses rejected by Holm but not Bonferroni.
Holm rejects more hypotheses, giving it higher power.

Other FWER Control Methods

Tukey’s Method: Specifically for comparing all possible pairs of group means in ANOVA (Analysis of Variance).
Scheffé’s Method: Allows testing any linear combination of group means, even after seeing the data. Very flexible, but also very conservative.
These are more powerful than Bonferroni/Holm in their specific situations, but less general.

Trade-Off: FWER and Power

FWER and Power

This plot shows how FWER and power are related in a simulation where 90% of the null hypotheses are true.

Trade-Off: FWER and Power (Continued)

As m (number of hypotheses) increases, power goes down for a fixed FWER.
Controlling the FWER very strictly (low α) means lower power (more Type II errors). We’re less likely to find true effects.
The dashed vertical line is at FWER = 0.05.
This shows the key trade-off: being more conservative (controlling FWER) reduces our ability to find real effects.

False Discovery Rate (FDR)

FDR: The expected proportion of false positives (Type I errors) among all the rejected null hypotheses.

\[ \text{FDR} = E\left(\frac{\text{Number of Type I Errors}}{\text{Number of Rejected Hypotheses}}\right) = E\left(\frac{V}{R}\right) \]

Where V is the number of Type I Errors and R is the number of Rejected Hypotheses.
Controlling the FDR is less strict than controlling the FWER. It allows for some false positives, but aims to keep their proportion low.

FDR: Motivation

Motivation: In exploratory studies with many hypotheses (m), we might be okay with some false positives to discover more true effects. We’re prioritizing discovery over absolute certainty.
FDR control is good when we’re more interested in finding potential leads than in making rock-solid claims.

Benjamini-Hochberg (BH) Procedure

The Benjamini-Hochberg (BH) procedure is a popular way to control the FDR at a desired level, q:

Specify q: Choose the desired FDR level (e.g., q = 0.10).
Compute p-values: Get p-values (p1, …, pm) for all m hypotheses.
Order p-values: Sort them from smallest to largest: p(1) ≤ p(2) ≤ … ≤ p(m).
Find L: \[L = \max\{j : p_{(j)} \leq qj/m\}.\]
Reject: Reject all null hypotheses H0j for which pj ≤ p(L).

Benjamini-Hochberg (BH) Procedure (Continued)

Guarantee: The BH procedure guarantees that FDR ≤ q if the p-values are independent (or have some types of mild dependence).
Key Idea: The cutoff for rejecting a hypothesis depends on all the p-values (through L), not just its own. This adaptive cutoff makes it less conservative than Bonferroni or Holm.

Example: BH and Bonferroni

BH and Bonferroni

This figure shows the same set of 2000 ordered p-values, comparing Bonferroni (FWER control) and BH (FDR control) thresholds.

Example: BH and Bonferroni (Continued)

Green lines are Bonferroni thresholds for different α levels.
Orange lines are BH thresholds for different q levels.
Blue dots show p-values of rejected null hypotheses under BH.
BH rejects many more hypotheses than Bonferroni for the same level of error control (e.g., compare α=0.1 and q=0.1). This shows the increased power of FDR control.

Re-Sampling Approaches

So far, we’ve assumed we know the theoretical null distribution of the test statistic (like the t-distribution or F-distribution).
Problem: Sometimes, the theoretical null distribution is unknown, or the assumptions needed for it to be valid aren’t met (e.g., data isn’t normally distributed, small sample size).
Solution: Use re-sampling methods (like permutation tests) to estimate the null distribution from the data itself.

Permutation Test: Example

Suppose we want to compare the average values of two groups (X and Y).
Null Hypothesis (H₀): The distributions of X and Y are the same. This means the group labels (“X” or “Y”) are essentially random.
Idea: If H₀ is true, then randomly swapping observations between the groups shouldn’t systematically change the test statistic.
We can permute (shuffle) the data many times, calculate the test statistic each time, and build an empirical null distribution.

Permutation Test: Steps

Calculate the test statistic: Compute the test statistic (e.g., the t-statistic) on the original data.
Repeat many times:
- Randomly shuffle the observations between the two groups.
- Calculate the test statistic on the shuffled data.
Calculate the p-value: The p-value is the proportion of shuffled test statistics that are as extreme as (or more extreme than) the original test statistic.

Resampling for FDR Control

We can also use re-sampling to estimate the FDR directly, without calculating p-values first.
The idea is to approximate:
- E(V): The expected number of false positives. We estimate this by shuffling the data (simulating the null hypothesis) and counting how many hypotheses we’d reject.
- R: The total number of rejections on the original data.

\[FDR = E(\frac{V}{R}) \approx \frac{E(V)}{R} \]

Using re-sampling to estimate the null distribution can combine information across multiple hypotheses.

When is Re-Sampling Useful?

When the theoretical null distribution is unknown or unreliable.
When assumptions for the theoretical distribution are violated (e.g., data isn’t normally distributed, sample size is small).
Provides a flexible way to do hypothesis testing in complex situations where standard theoretical results don’t apply.

Re-Sampling Example: Comparison with Theoretical Null

Theoretical and resampling null distribution comparison

This shows the theoretical and re-sampling null distributions for the 11th gene in the Khan dataset. They’re very similar. The theoretical p-value is 0.041, and the re-sampling p-value is 0.042.

Re-Sampling Example: Different Distributions

Theoretical and resampling null distribution comparison

This shows the theoretical and re-sampling null distributions for the 877th gene in the Khan dataset. They’re quite different. The theoretical p-value is 0.571, and the re-sampling p-value is 0.673. This shows how re-sampling can be more accurate when the theoretical distribution isn’t a good fit.

Summary

Multiple testing: Testing many hypotheses at once increases the risk of false positives (Type I errors).
FWER control: Strict methods like Bonferroni and Holm control the chance of any false positives, but can have low power.
FDR control: Less strict, allowing some false positives to increase discovery. The Benjamini-Hochberg (BH) procedure is a common method.
Re-sampling: A flexible way to estimate null distributions and do inference when theoretical assumptions are violated or the null distribution is unknown.

Thoughts for Discussion 🤔

How do you choose between controlling the FWER and the FDR in practice? What factors affect your decision?
What are the trade-offs in choosing a stricter (FWER) versus a more lenient (FDR) approach?
Can you think of real-world situations where multiple testing is important (e.g., genomics, A/B testing, finance)?
When would you prefer to use a re-sampling approach (like a permutation test) instead of a theoretical null distribution?