zhejiang wanli university
Statistical learning is a set of tools for modeling and understanding complex datasets. It is a vital area with applications in many fields. Hypothesis testing, a core concept in statistical inference, allows us to use data to answer “yes-or-no” questions, moving beyond simply estimating values or making predictions.
In today’s world of “big data,” we’re often faced with testing hundreds or even thousands of hypotheses at once. This is called multiple testing. The main challenge is to avoid “false positives” — incorrectly rejecting a true null hypothesis. We want to make sure our “discoveries” are real, not just random noise! 🔍
Hypothesis testing gives us a formal way to use data to answer specific questions. It provides a structured framework to evaluate evidence and make informed decisions.
Goal: Decide if there’s enough evidence to reject a pre-defined assumption (the null hypothesis).
Examples:
We generally follow these steps:
Compute p-value: Find the probability of seeing data as extreme (or more extreme) than what we observed, assuming H₀ is true.
Decide: Based on the p-value (and a pre-set threshold), decide whether to reject H₀.
Null Hypothesis (H₀): The baseline assumption. We look for evidence against this. Think of it like the “status quo” or “innocent until proven guilty.”
Alternative Hypothesis (Hₐ): What we think might be true if H₀ is false. It’s often the opposite of H₀. For instance, if H₀ is “A=B,” then Hₐ is “A≠B.”
The test statistic measures how far the data is from what we’d expect if H₀ were true.
The exact formula depends on the specific hypothesis.
Example: If we’re comparing the average values (means) of two groups (like treatment and control), we often use a two-sample t-statistic.
The two-sample t-statistic is calculated like this:
\[ T = \frac{\hat{\mu}_t - \hat{\mu}_c}{s \sqrt{\frac{1}{n_t} + \frac{1}{n_c}}} \]
where:
The pooled standard deviation, s, combines the variability within each group:
\[ s = \sqrt{\frac{(n_t - 1)s_t^2 + (n_c - 1)s_c^2}{n_t + n_c - 2}} \]
\(s_t^2\) and \(s_c^2\) are the sample variances of the treatment and control groups. They measure how spread out the data is within each group.
A large absolute value of T (|T|) suggests that the group means are significantly different, providing evidence against the null hypothesis.
p-value: The probability of observing a test statistic as extreme as (or more extreme than) the one we calculated, if the null hypothesis were actually true.
A small p-value means the data is unlikely under H₀, providing strong evidence against it.
The p-value is not the probability that H₀ is true. It’s the probability of seeing the observed data (or more extreme data), given that H₀ is true. This is a very important distinction! 🧐
Density function for N(0,1)
This figure shows the probability density function of a standard normal distribution, often called a “bell curve.” The vertical line marks a value of 2.33.
The null distribution is the probability distribution of the test statistic assuming the null hypothesis is true. It tells us what values of the test statistic are likely (or unlikely) if there’s no real effect.
Common examples:
We need to know the null distribution to calculate p-values.
We reject H₀ if the p-value is smaller than a pre-defined “significance level” (α).
Significance level (α): The cutoff for rejecting H₀. A common choice is α = 0.05 (5%).
Decision Rule:
We never “accept” H₀. We only “reject” or “fail to reject” it. Failing to reject doesn’t mean H₀ is definitely true, just that we don’t have enough evidence to say it’s false.
H₀ True | H₀ False | |
---|---|---|
Reject H₀ | Type I Error | Correct Decision |
Fail to Reject H₀ | Correct Decision | Type II Error |
Power: The probability of correctly rejecting H₀ when it’s false (1 - Probability of Type II error). We want high power! 💪
Our goal is to control the chance of Type I errors (α) while maximizing power (minimizing Type II errors).
When testing a single hypothesis, it’s relatively easy to control the Type I error rate (α).
Problem: When we test many hypotheses at the same time, the chance of making at least one Type I error goes way up, even if each individual test has a low α.
Analogy: Imagine flipping a fair coin many times. Each flip has a 50/50 chance, but if you flip enough times, you’re bound to see some “unlikely” streaks (like all heads) just by random chance.
Similarly, when testing many hypotheses, some will look “significant” (low p-value) just by chance, even if all the null hypotheses are true.
A stockbroker claims they can predict whether Apple’s stock will go up or down for the next 10 days.
They email 1024 potential clients, each getting a different sequence of predictions (2¹⁰ = 1024 possible sequences).
Just by chance, one client will get a perfect prediction! They might think the stockbroker is amazing, but it’s just random luck.
This shows how making many “guesses” (tests) can lead to seemingly significant results purely by chance.
FWER: The probability of making at least one Type I error (false positive) among all the hypothesis tests we do.
If all m tests are independent and all null hypotheses are true, then:
\[ \text{FWER}(\alpha) = 1 - (1 - \alpha)^m \]
FWER vs. Number of Hypotheses
This graph shows how the FWER changes as we test more hypotheses (the x-axis is on a log scale), for different values of α.
The dashed line is at 0.05.
To keep the FWER at 0.05 when testing 50 null hypotheses, we need to control the Type I error for each individual hypothesis at a much lower level (around α = 0.001).
As we test more hypotheses, we need to be more and more strict with our individual significance levels to maintain the same overall FWER.
To control the FWER at a specific level (α), we need stricter rules for rejecting each hypothesis.
Common methods:
Rule: Reject H₀j if its p-value is less than α/m. Divide the desired overall significance level (α) by the number of tests (m).
Guarantee: FWER(α/m) ≤ m × α/m = α. The Bonferroni correction ensures the FWER is no larger than α.
Advantages:
Disadvantage:
Holm’s method is a step-down procedure, which is less conservative than Bonferroni:
Holm’s and Bonferroni
Each panel shows sorted p-values from simulations with 10 hypotheses. Black dots are true null hypotheses (2 out of 10), and red dots are false null hypotheses.
Tukey’s Method: Specifically for comparing all possible pairs of group means in ANOVA (Analysis of Variance).
Scheffé’s Method: Allows testing any linear combination of group means, even after seeing the data. Very flexible, but also very conservative.
These are more powerful than Bonferroni/Holm in their specific situations, but less general.
FWER and Power
This plot shows how FWER and power are related in a simulation where 90% of the null hypotheses are true.
As m (number of hypotheses) increases, power goes down for a fixed FWER.
Controlling the FWER very strictly (low α) means lower power (more Type II errors). We’re less likely to find true effects.
The dashed vertical line is at FWER = 0.05.
This shows the key trade-off: being more conservative (controlling FWER) reduces our ability to find real effects.
\[ \text{FDR} = E\left(\frac{\text{Number of Type I Errors}}{\text{Number of Rejected Hypotheses}}\right) = E\left(\frac{V}{R}\right) \]
Where V is the number of Type I Errors and R is the number of Rejected Hypotheses.
Controlling the FDR is less strict than controlling the FWER. It allows for some false positives, but aims to keep their proportion low.
Motivation: In exploratory studies with many hypotheses (m), we might be okay with some false positives to discover more true effects. We’re prioritizing discovery over absolute certainty.
FDR control is good when we’re more interested in finding potential leads than in making rock-solid claims.
The Benjamini-Hochberg (BH) procedure is a popular way to control the FDR at a desired level, q:
Guarantee: The BH procedure guarantees that FDR ≤ q if the p-values are independent (or have some types of mild dependence).
Key Idea: The cutoff for rejecting a hypothesis depends on all the p-values (through L), not just its own. This adaptive cutoff makes it less conservative than Bonferroni or Holm.
BH and Bonferroni
This figure shows the same set of 2000 ordered p-values, comparing Bonferroni (FWER control) and BH (FDR control) thresholds.
So far, we’ve assumed we know the theoretical null distribution of the test statistic (like the t-distribution or F-distribution).
Problem: Sometimes, the theoretical null distribution is unknown, or the assumptions needed for it to be valid aren’t met (e.g., data isn’t normally distributed, small sample size).
Solution: Use re-sampling methods (like permutation tests) to estimate the null distribution from the data itself.
Suppose we want to compare the average values of two groups (X and Y).
Null Hypothesis (H₀): The distributions of X and Y are the same. This means the group labels (“X” or “Y”) are essentially random.
Idea: If H₀ is true, then randomly swapping observations between the groups shouldn’t systematically change the test statistic.
We can permute (shuffle) the data many times, calculate the test statistic each time, and build an empirical null distribution.
Calculate the test statistic: Compute the test statistic (e.g., the t-statistic) on the original data.
Repeat many times:
Calculate the p-value: The p-value is the proportion of shuffled test statistics that are as extreme as (or more extreme than) the original test statistic.
We can also use re-sampling to estimate the FDR directly, without calculating p-values first.
The idea is to approximate:
\[FDR = E(\frac{V}{R}) \approx \frac{E(V)}{R} \]
Theoretical and resampling null distribution comparison
This shows the theoretical and re-sampling null distributions for the 11th gene in the Khan dataset. They’re very similar. The theoretical p-value is 0.041, and the re-sampling p-value is 0.042.
Theoretical and resampling null distribution comparison
This shows the theoretical and re-sampling null distributions for the 877th gene in the Khan dataset. They’re quite different. The theoretical p-value is 0.571, and the re-sampling p-value is 0.673. This shows how re-sampling can be more accurate when the theoretical distribution isn’t a good fit.
Multiple testing: Testing many hypotheses at once increases the risk of false positives (Type I errors).
FWER control: Strict methods like Bonferroni and Holm control the chance of any false positives, but can have low power.
FDR control: Less strict, allowing some false positives to increase discovery. The Benjamini-Hochberg (BH) procedure is a common method.
Re-sampling: A flexible way to estimate null distributions and do inference when theoretical assumptions are violated or the null distribution is unknown.
How do you choose between controlling the FWER and the FDR in practice? What factors affect your decision?
What are the trade-offs in choosing a stricter (FWER) versus a more lenient (FDR) approach?
Can you think of real-world situations where multiple testing is important (e.g., genomics, A/B testing, finance)?
When would you prefer to use a re-sampling approach (like a permutation test) instead of a theoretical null distribution?
邱飞(peter) 💌 [email protected]