zhejiang wanli university
Welcome to the fascinating world of survival analysis! 🕰️
In this chapter, we delve into analyzing a unique type of data: the time until an event occurs.
This is different from typical regression problems because we are not just predicting a value, but the time it takes for something to happen.
The key challenge in survival analysis is censoring.
Censoring means we don’t always know the exact time the event occurs.
Think of a medical study tracking patient survival after cancer treatment. 🏥
Some patients might still be alive at the study’s end. 🧑⚕️
We know they survived at least that long, but not their exact survival time. This is censored data. 🤔
Censored Data Example
This image illustrates different types of censoring. We’ll focus on right-censoring, where we only know the event happened after a certain time.
Let’s define some essential terms.
Survival Analysis: Statistical methods for analyzing time-to-event data.
Censored Data: Observations where the event of interest has not occurred for all subjects by the end of the observation period.
Event: The outcome of interest (e.g., death, recovery, machine failure, customer churn).
Survival Time: The time until the event occurs.
Survival Analysis: Statistical methods for analyzing time-to-event data.
Survival analysis focuses on the distribution of time until an event. It’s not just about whether an event happens, but when. It’s like being a detective, but instead of solving a crime, you’re solving for time! 🕵️♀️⏰
Censored Data: Observations where the event of interest has not occurred for all subjects by the end of the observation period.
Censoring means we have incomplete information about the event time. We only know a lower bound (right-censoring), upper bound (left-censoring), or an interval (interval-censoring). Think of it like reading a book with missing pages – you know something happened, but not the full story. 📖✂️
Event: The outcome of interest (e.g., death, recovery, machine failure, customer churn).
The “event” can be anything we’re interested in tracking the time to. It doesn’t have to be something negative! It could be a machine breaking down 🛠️, a customer leaving 🏃, or even a plant flowering! 🌸
Survival Time: The time until the event occurs.
This is the variable we’re ultimately trying to understand and predict. We often denote it with the letter T. It’s the unknown we’re trying to uncover! 🔍
We will explore how to deal with censoring and effectively extract information using tools like survival analysis. This is like learning a new language to decipher incomplete data! 🗣️
Survival analysis isn’t limited to medical studies. It’s a versatile tool with broad applications! 🌍
Survival analysis helps doctors estimate prognosis and evaluate treatment effectiveness. It allows them to make informed decisions and provide better care. It’s like giving doctors a crystal ball, but based on data! 🔮🩺
Understanding churn allows businesses to take proactive steps to retain customers. Knowing when a customer might leave is as valuable as knowing why. It helps businesses keep their customers happy! 😊💼
Reliability analysis helps engineers design more durable and dependable products. It’s about building things that last! Think of it as making sure your bridge doesn’t collapse! 🌉💪
Survival models can predict the likelihood of loan defaults, informing lending decisions. This helps banks and lenders make smarter choices. It’s like having a financial advisor who can see into the future! 💰🔮
The concept of censoring extends to situations with measurement limitations. If your scale only goes up to 300 lbs, and someone weighs more, you only know their weight is at least 300 lbs. It’s like measuring with a ruler that’s too short! 📏🤔
Key Insight: Survival analysis techniques allow us to work with incomplete information.
We don’t need to observe the event for every individual to gain valuable insights. This is like solving a puzzle with missing pieces! 🧩
Let’s define some core concepts mathematically.
This is the underlying quantity we’re interested in, even if we don’t always observe it directly. It’s the “true” time, even if we don’t know it yet.
This represents the point at which we lose track of the individual. It’s like the end of a movie, but you don’t know what happens next! 🎬❓
This is the data we have to work with – the minimum of the true survival time and the censoring time. It’s what we actually observe.
\[ \delta = \begin{cases} 1 & \text{if } T \leq C \text{ (event observed)} \\ 0 & \text{if } T > C \text{ (censored)} \end{cases} \]
This indicator variable tells us whether Y represents a true survival time or a censoring time. It’s like a flag that tells us if we have complete information or not. 🚩
These variables (T, C, Y, and δ) form the basis of survival analysis. They are the building blocks of our analysis! 🧱
Let’s consider a simple example to illustrate these concepts.
Visualizing Censored Data
Visualizing Censored Data
These are uncensored observations. The event happened during the observation period. We have complete information for these patients.
Visualizing Censored Data
This is a right-censored observation. We know the patient survived at least until the end of the study, but we don’t know their exact survival time.
Visualizing Censored Data
This is another example of right-censoring. We lost track of the patient before the event occurred, so we don’t know their exact survival time.
Important: Censored observations still provide valuable information! They tell us the event didn’t happen before a certain time. This is crucial for accurate analysis!
Censoring isn’t always straightforward. The reason for censoring matters.
This means that the reason someone is censored doesn’t provide extra information about their likely survival time, beyond what we already know from their features. Think of it like a coin flip – the reason you stop flipping (censoring) doesn’t affect the probability of heads or tails (survival). 🪙
This is informative censoring. The censoring is related to the survival time, which violates the independence assumption. It’s like stacking the deck in a card game – the outcome is no longer random! 🃏❌
Types of Censoring
alt text
The image shows the example of Right Censoring
We will focus mainly on right censoring, the most prevalent type in practice. It’s the most common scenario we encounter.
The survival curve, denoted by S(t), is a fundamental concept. It gives the probability of surviving past time t:
\[S(t) = Pr(T > t)\]
This is the cornerstone of survival analysis! It tells us the probability of “surviving” (not experiencing the event) beyond a certain time.
The larger the value of S(t), the more likely the event will occur at time greater than t.
A higher S(t) means a higher chance of not experiencing the event before time t. It’s like the probability of staying dry in the rain – the longer you stay inside, the higher the probability of staying dry! ☔
How do we estimate S(t) from data with censoring? The Kaplan-Meier estimator is a powerful tool. It’s like a statistical superhero for survival data! 🦸♀️
Let’s consider the BrainCancer
dataset. We want to estimate S(20): the probability of surviving at least 20 months. Naive approaches fail:
Using Y directly treats censored observations as if the event occurred at the censoring time, leading to an underestimate. It’s like saying everyone who left the party early went home, even if they might have gone somewhere else! 🎉🏠❓
Discarding censored data reduces the sample size and biases the results. It’s like throwing away puzzle pieces – you can’t see the full picture! 🧩🗑️
The Kaplan-Meier estimator elegantly handles censoring to provide a more accurate estimate. It’s the clever way to deal with incomplete data!💡
The Kaplan-Meier estimator works sequentially, considering events as they unfold in time. It’s like watching a movie frame by frame! 🎞️
These are the ingredients we need for our Kaplan-Meier recipe! 🍲
The Kaplan-Meier estimator formula is:
\[ \hat{S}(d_k) = \prod_{j=1}^{k} \left( \frac{r_j - q_j}{r_j} \right) \]
This looks complicated, but we’ll break it down step-by-step!
For times between death times, \(\hat{S}(t)\) remains constant, creating a step-like curve. It’s like a staircase, not a smooth slope! 🪜
The formula is derived from the law of total probability:
\[ Pr(T > d_k) = Pr(T > d_k | T > d_{k-1})Pr(T > d_{k-1}) + Pr(T>d_k|T \leq d_{k-1})Pr(T\leq d_{k-1}) \]
This is a fundamental probability rule, breaking down a complex probability into simpler parts.
Since \(d_{k-1} < d_k\), \(Pr(T>d_k|T \leq d_{k-1}) = 0\), then the above formula is:
\[ S(d_k) = Pr(T > d_k) = Pr(T > d_k | T > d_{k-1})Pr(T > d_{k-1}) \]
Because if you’ve already survived past \(d_{k-1}\), you can’t possibly experience the event before \(d_{k-1}\)!
Plug in \(S(t)\) and rearrange the above formula: \[ S(d_k) = Pr(T>d_k|T>d_{k-1}) \times ... \times Pr(T>d_2|T>d_1)Pr(T>d_1) \]
We’re expressing the overall survival probability as a product of conditional probabilities.
We estimate each term on the right-hand side using the fraction of the risk set at time \(d_j\) who survived past time \(d_j\):
\[\widehat{Pr}(T > d_j | T > d_{j-1}) = (r_j - q_j) / r_j\]
This is the heart of the Kaplan-Meier estimator! We’re using the observed data to estimate these conditional probabilities.
Finally, we arrive at the Kaplan-Meier estimator:
\[ \hat{S}(d_k) = \prod_{j=1}^{k} \left( \frac{r_j - q_j}{r_j} \right) \]
This is the formula we use to calculate the estimated survival probabilities!
Here’s the Kaplan-Meier curve for the BrainCancer
data:
Kaplan-Meier Curve for BrainCancer Data
Kaplan-Meier Curve for BrainCancer Data
We can read the estimated survival probabilities directly from the curve!
Often, we want to compare survival curves between groups (e.g., males vs. females). Is there a significant difference in survival between groups? 🤔
Log-Rank Test Example
The log-rank test is a statistical test for comparing survival curves. It accounts for censoring. It’s like a statistical judge deciding if there’s a real difference! ⚖️
The log-rank test examines events sequentially, like the Kaplan-Meier estimator. It’s a step-by-step comparison!
Group 1 | Group 2 | Total | |
---|---|---|---|
Died | \(q_{1k}\) | \(q_{2k}\) | \(q_k\) |
Survived | \(r_{1k}-q_{1k}\) | \(r_{2k}-q_{2k}\) | \(r_k-q_k\) |
Total | \(r_{1k}\) | \(r_{2k}\) | \(r_k\) |
This table summarizes the observed events and the number of individuals at risk in each group at each death time.
The log-rank test statistic (W) is calculated based on the observed and expected number of deaths in group 1:
\[ W = \frac{\sum_{k=1}^{K}(q_{1k} - \mu_k)}{\sqrt{\sum_{k=1}^{K}Var(q_{1k})}} \] where $ _k = q_k $ and \(Var(q_{1k}) = \frac{q_k(r_{1k}/r_k)(1-r_{1k}/r_k)(r_k - q_k)}{r_k - 1}\)
This formula measures the difference between what we observed and what we would expect if there were no difference between the groups.
Under the null hypothesis (no difference in survival), W approximately follows a standard normal distribution. This allows us to calculate a p-value!
Comparing survival times of males and females in the BrainCancer
data:
In this case, the data doesn’t provide strong evidence of a difference in survival between males and females.
So far, we’ve looked at describing survival curves and comparing them between groups. Now, we want to predict survival time based on covariates (features). It’s like predicting the future, but with data! 🔮📊
The hazard function, h(t), is also known as the hazard rate or force of mortality. It represents the instantaneous risk of the event occurring at time t, given survival up to time t:
\[h(t) = \lim_{\Delta t \to 0} \frac{Pr(t < T \leq t + \Delta t | T > t)}{\Delta t}\]
This is a key concept for modeling survival data!
\[ \begin{aligned} h(t) &= \lim_{\Delta t \to 0} Pr((t<T\le t+\Delta t)\cap (T>t))/\Delta t \over Pr(T>t) \\ &= \lim_{\Delta t \to 0} {Pr(t<T\le t+\Delta t) / \Delta t \over Pr(T>t)} \\ &= {f(t) \over S(t)} \end{aligned} \]
where
\[ f(t) = \lim_{\Delta t\to 0} {Pr(t<T\le t+\Delta t)\over \Delta t} \]
We can derive the relationship between the hazard function, probability density function and survival function.
\(f(t)\) is probability density function.
\(f(t)\) is different from \(S(t)\), it measures the instantaneous probability.
The likelihood associated with the i-th observation is:
\[ L_i = \begin{cases} f(y_i) \quad \text{if the i-th observation is not censored} \\ S(y_i) \quad \text{if the i-th observation is censored} \end{cases} \\ = f(y_i)^{\delta_i}S(y_i)^{1-\delta_i} \]
This is how we quantify the contribution of each observation to the overall likelihood.
If \(Y=y_i\) and the i-th observation is not censored, then the likelihood is the probability of dying in a tiny interval around time \(y_i\).
If the i-th observation is censored, then the likelihood is the probability of surviving at least until time \(y_i\)
The Cox proportional hazards model is a powerful and flexible approach to model the relationship between covariates and the hazard function. It’s the workhorse of survival regression! 🐎
The Proportional Hazards Assumption:
\[h(t|x_i) = h_0(t) \exp(\sum_{j=1}^{p} x_{ij}\beta_j)\]
This is the core assumption of the Cox model!
The model breaks down the hazard into a baseline part and a part that depends on the features.
Proportional Hazards Illustration
Proportional Hazards Illustration
Proportional Hazards Illustration
This visual check helps us assess whether the proportional hazards assumption is reasonable.
How do we estimate the coefficients, \(\beta\), in the Cox model without knowing \(h_0(t)\)? We use the partial likelihood. It’s a clever trick to get around the unknown baseline hazard!
We’re focusing on the relative risk of failure, compared to others who are still at risk.
\[ \frac{h_0(y_i) \exp(\sum_{j=1}^p x_{ij}\beta_j)}{\sum_{i':y_{i'}\ge y_i}h_0(y_i)\exp(\sum_{j=1}^{p}x_{i'j}\beta_j)} = \frac{\exp(\sum_{j=1}^p x_{ij}\beta_j)}{\sum_{i':y_{i'}\ge y_i}\exp(\sum_{j=1}^{p}x_{i'j}\beta_j)} \]
\[PL(\beta) = \prod_{i:\delta_i = 1} \frac{\exp(\sum_{j=1}^p x_{ij}\beta_j)}{\sum_{i':y_{i'}\ge y_i}\exp(\sum_{j=1}^{p}x_{i'j}\beta_j)}\]
We multiply the probabilities together, assuming independence between observations.
Let’s apply the Cox model to the BrainCancer
data:
Variable | Coefficient | Std. error | z-statistic | p-value |
---|---|---|---|---|
sex[Male] | 0.18 | 0.36 | 0.51 | 0.61 |
diagnosis[LG Glioma] | 0.92 | 0.64 | 1.43 | 0.15 |
diagnosis[HG Glioma] | 2.15 | 0.45 | 4.78 | 0.00 |
diagnosis[Other] | 0.89 | 0.66 | 1.35 | 0.18 |
loc[Supratentorial] | 0.44 | 0.70 | 0.63 | 0.53 |
ki | -0.05 | 0.02 | -3.00 | <0.01 |
gtv | 0.03 | 0.02 | 1.54 | 0.12 |
stereo[SRT] | 0.18 | 0.60 | 0.30 | 0.77 |
Variable | Coefficient | Std. error | z-statistic | p-value |
---|---|---|---|---|
sex[Male] | 0.18 | 0.36 | 0.51 | 0.61 |
diagnosis[LG Glioma] | 0.92 | 0.64 | 1.43 | 0.15 |
diagnosis[HG Glioma] | 2.15 | 0.45 | 4.78 | 0.00 |
diagnosis[Other] | 0.89 | 0.66 | 1.35 | 0.18 |
loc[Supratentorial] | 0.44 | 0.70 | 0.63 | 0.53 |
ki | -0.05 | 0.02 | -3.00 | <0.01 |
gtv | 0.03 | 0.02 | 1.54 | 0.12 |
stereo[SRT] | 0.18 | 0.60 | 0.30 | 0.77 |
Variable | Coefficient | Std. error | z-statistic | p-value |
---|---|---|---|---|
sex[Male] | 0.18 | 0.36 | 0.51 | 0.61 |
diagnosis[LG Glioma] | 0.92 | 0.64 | 1.43 | 0.15 |
diagnosis[HG Glioma] | 2.15 | 0.45 | 4.78 | 0.00 |
diagnosis[Other] | 0.89 | 0.66 | 1.35 | 0.18 |
loc[Supratentorial] | 0.44 | 0.70 | 0.63 | 0.53 |
ki | -0.05 | 0.02 | -3.00 | <0.01 |
gtv | 0.03 | 0.02 | 1.54 | 0.12 |
stereo[SRT] | 0.18 | 0.60 | 0.30 | 0.77 |
Each coefficient tells us how the hazard changes with a one-unit increase in the corresponding feature, holding all other features constant.
Next, we will introduce the dataset Publication
, involving the time to publication of journal papers reporting the results of clinical trials funded by the National Heart, Lung, and Blood Institute.
For 244 trials, the time in months until publication is recorded.
alt text
alt text
The log-rank test suggests a weak, non-significant difference.
Now, let’s fit Cox’s proportional hazards model using all available features:
Variable | Coefficient | Std. error | z-statistic | p-value |
---|---|---|---|---|
posres[Yes] | 0.55 | 0.18 | 3.02 | 0.00 |
multi[Yes] | 0.15 | 0.31 | 0.47 | 0.64 |
clinend[Yes] | 0.51 | 0.27 | 1.89 | 0.06 |
mech[K01] | 1.05 | 1.06 | 1.00 | 0.32 |
mech[K23] | -0.48 | 1.05 | -0.45 | 0.65 |
mech[P01] | -0.31 | 0.78 | -0.40 | 0.69 |
mech[P50] | 0.60 | 1.06 | 0.57 | 0.57 |
mech[R01] | 0.10 | 0.32 | 0.30 | 0.76 |
mech[R18] | 1.05 | 1.05 | 0.99 | 0.32 |
mech[R21] | -0.05 | 1.06 | -0.04 | 0.97 |
mech[R24, K24] | 0.81 | 1.05 | 0.77 | 0.44 |
mech[R42] | -14.78 | 3414.38 | -0.00 | 1.00 |
mech[R44] | -0.57 | 0.77 | -0.73 | 0.46 |
mech[RC2] | -14.92 | 2243.60 | -0.01 | 0.99 |
mech[U01] | -0.22 | 0.32 | -0.70 | 0.48 |
mech[U54] | 0.47 | 1.07 | 0.44 | 0.66 |
sampsize | 0.00 | 0.00 | 0.19 | 0.85 |
budget | 0.00 | 0.00 | 1.67 | 0.09 |
impact | 0.06 | 0.01 | 8.23 | 0.00 |
Variable | Coefficient | Std. error | z-statistic | p-value |
---|---|---|---|---|
posres[Yes] | 0.55 | 0.18 | 3.02 | 0.00 |
multi[Yes] | 0.15 | 0.31 | 0.47 | 0.64 |
clinend[Yes] | 0.51 | 0.27 | 1.89 | 0.06 |
mech[K01] | 1.05 | 1.06 | 1.00 | 0.32 |
mech[K23] | -0.48 | 1.05 | -0.45 | 0.65 |
mech[P01] | -0.31 | 0.78 | -0.40 | 0.69 |
mech[P50] | 0.60 | 1.06 | 0.57 | 0.57 |
mech[R01] | 0.10 | 0.32 | 0.30 | 0.76 |
mech[R18] | 1.05 | 1.05 | 0.99 | 0.32 |
mech[R21] | -0.05 | 1.06 | -0.04 | 0.97 |
mech[R24, K24] | 0.81 | 1.05 | 0.77 | 0.44 |
mech[R42] | -14.78 | 3414.38 | -0.00 | 1.00 |
mech[R44] | -0.57 | 0.77 | -0.73 | 0.46 |
mech[RC2] | -14.92 | 2243.60 | -0.01 | 0.99 |
mech[U01] | -0.22 | 0.32 | -0.70 | 0.48 |
mech[U54] | 0.47 | 1.07 | 0.44 | 0.66 |
sampsize | 0.00 | 0.00 | 0.19 | 0.85 |
budget | 0.00 | 0.00 | 1.67 | 0.09 |
impact | 0.06 | 0.01 | 8.23 | 0.00 |
Variable | Coefficient | Std. error | z-statistic | p-value |
---|---|---|---|---|
posres[Yes] | 0.55 | 0.18 | 3.02 | 0.00 |
multi[Yes] | 0.15 | 0.31 | 0.47 | 0.64 |
clinend[Yes] | 0.51 | 0.27 | 1.89 | 0.06 |
mech[K01] | 1.05 | 1.06 | 1.00 | 0.32 |
mech[K23] | -0.48 | 1.05 | -0.45 | 0.65 |
mech[P01] | -0.31 | 0.78 | -0.40 | 0.69 |
mech[P50] | 0.60 | 1.06 | 0.57 | 0.57 |
mech[R01] | 0.10 | 0.32 | 0.30 | 0.76 |
mech[R18] | 1.05 | 1.05 | 0.99 | 0.32 |
mech[R21] | -0.05 | 1.06 | -0.04 | 0.97 |
mech[R24, K24] | 0.81 | 1.05 | 0.77 | 0.44 |
mech[R42] | -14.78 | 3414.38 | -0.00 | 1.00 |
mech[R44] | -0.57 | 0.77 | -0.73 | 0.46 |
mech[RC2] | -14.92 | 2243.60 | -0.01 | 0.99 |
mech[U01] | -0.22 | 0.32 | -0.70 | 0.48 |
mech[U54] | 0.47 | 1.07 | 0.44 | 0.66 |
sampsize | 0.00 | 0.00 | 0.19 | 0.85 |
budget | 0.00 | 0.00 | 1.67 | 0.09 |
impact | 0.06 | 0.01 | 8.23 | 0.00 |
Adjusted Survival Curves for Publication Data
Adjusted Survival Curves for Publication Data
We can apply shrinkage methods (like ridge and lasso) to the Cox model. This is useful for high-dimensional data (many features) and can improve prediction accuracy.
Idea: Minimize a penalized version of the negative log partial likelihood:
\[-\log\left(\prod_{i:\delta_i=1} \frac{\exp(\sum_{j=1}^p x_{ij}\beta_j)}{\sum_{i':y_{i'}\ge y_i}\exp(\sum_{j=1}^{p}x_{i'j}\beta_j)}\right) + \lambda P(\beta)\]
This adds a penalty to the model’s complexity, encouraging simpler models with fewer features.
Publication
data. This helps us select the most important features for predicting publication time.budget
and impact
, have non-zero estimated coefficients.Partial likelihood deviance
Cross-validation helps us choose the optimal level of shrinkage (the value of λ).
We can use risk score to categorize the observations based on their “risk”.
For example, we use the risk score: \[ budget_i \cdot \hat{\beta}_{budget} + impact_i \cdot \hat{\beta}_{impact} \] where \(\hat{\beta}_{budget}\) and \(\hat{\beta}_{impact}\) are the coefficients estimates for these two features from the training set.
alt text
The figure shows that there is clear separation between the three strata, and that the strata are correctly ordered in terms of low, medium, and high risk of publication.
We can evaluate how well the model separates observations into different risk groups on a test set.
邱飞(peter) 💌 [email protected]