Part 1: Introduction to Survival Analysis

Note

A note on platform and process. Part 0 of this series was published on Medium. Starting from Part 1, I have moved to Quarto on GitHub Pages for one reason: proper LaTeX rendering. A series this mathematical deserves a math-native home.

This article is not AI generated. I used Claude for proofreading, LaTeX syntax, and occasional structural feedback. Every derivation, example, and word is mine.

Time = 0 has passed. There’s a machine fresh off the assembly line, there’s a patient that got a second shot at life after a surgery, there’s a customer who just signed up with your services. The question is not whether — machines fail, patients die, customers leave. The question is when, and what we can say about that when, given everything we know. The branch of statistics that deals with answering this and other related questions is called survival analysis.

If you are trained in basic statistics and machine learning methods, your first instinct would probably be to reach for regression. Time is a continuous variable and (linear) regression — the workhorse of statistics — is the best fit for dealing with continuous variables. Right? No. Wrong. Here are the reasons regression breaks.

1. Regression does not place any constraints on the range of the response variable whereas the time to failure is always non-negative.

2. In the linear regression setup, the response variable is an estimate of its conditional expectation given the covariates, i.e \(E[Y \mid X]\). For survival analysis, conditional expectation is sometimes not the right quantity to estimate. In a clinical setting, you might care about “What fraction of patients survive beyond 10 years post surgery?” In a predictive maintenance setting, an engineer might ask “At what rate are machines failing after 10,000 hours of operation?” These are fundamentally different questions from “what is the average time to failure given these covariates?” — and regression, by construction, can only answer the latter.

3. And here comes the most important part. Imagine you are studying the time-to-failure of 10 machines. You run the study for 10,000 hours and stop. At the end of your study, 6 machines have failed and you know their exact failure times. The 4 machines that survived are, well, still running and all you know is that their survival time is greater than 10,000. If you naively regress on the six machines that failed and discard the ones that did not, your estimate of the time to failure is biased downward — because the machines you discarded are precisely the most durable ones in your sample. There is a name for this situation which we will get to very shortly.

The Setup

From here on, let’s fix our language. We’ll talk about the lifetime \(T\) of a system — a machine, a patient, a customer — and the results will be general enough to apply to all of them.

Let \(T\) be the lifetime of a system — a non-negative continuous random variable with Probability Density Function (PDF) \(f\) and Cumulative Distribution Function (CDF) \(F\), which is the probability that the system’s lifetime is at most \(t\) units. \[ F(t) = P(T \leq t) \]

The Survival Function of \(T\), denoted \(S(t)\), is the probability that the system survives at least \(t\) units of time:

\[ S(t) = P(T > t) \]

The CDF and the Survival Function are related by the following simple relationship:

\[ S(t) + F(t) = 1 \]

Censoring

Censoring is the reason survival analysis exists as a separate field of statistics. It is a situation where the exact time-to-event is unknown for the subject. All we know is that the failure had not occurred by the time we stopped observing — either because the study ended, or the subject was lost to follow-up. Going back to our machine example, the four machines that are still running after 10,000 hours are censored machines.

Censoring has three types.

Right censoring — the most common. The study ends before the event of interest occurs. The four machines that are still running after 10,000 hours are right censored.
Left censoring — The failure happened before you started observing. You inspect a machine for the first time and find it has already failed — you know failure occurred, but not when. Or a patient comes in with a disease that has already progressed to a certain stage — you know the disease started but not when.
Interval censoring — You don’t know when the system failed but it happened between two inspection times.

We will work with right censored data for pretty much the entire 12 part series.

Censoring vs. Missing Data. A missing value tells you nothing about a variable whereas a censored value conveys partial but concrete information. A machine that survived beyond 10,000 hours tells you exactly that — it lasted at least that long. That lower bound is real, and throwing it away is a statistical crime.

Here is a formal mathematical setup of right censored data. We have \(n\) independent and identically distributed samples of the form \((Y_i, \delta_i)\) where \(Y_i = \min(T_i, C_i)\) and \(\delta_i = \mathbb{1}\{T_i \leq C_i\}\). Here \(T_i\) is the lifetime of the \(i\)-th system and \(C_i\) is the censoring time for that system. \(\delta\) is an indicator variable which is 1 if failure happened before censoring, 0 otherwise. \(Y_i\) is the observed time, which is the minimum of the true lifetime and the censoring time.

You never observe \(T_i\) and \(C_i\) separately — only their minimum and whether the event got there first. Here’s what you know about the \(i\)-th system based on the observed data:

If \(\delta_i = 1\), failure happened before the study ended. You know exactly when.
If \(\delta_i = 0\), the unit was still alive when you stopped watching. You only know that \(T_i > Y_i\).

Here is what a survival analysis dataset looks like in the real world. Each row is one machine — you never see \(T\) and \(C\) separately, only their minimum and whether the event occurred.

Survival data for 10 machines. \(\delta = 1\) indicates failure observed; \(\delta = 0\) indicates censoring.
Machine ID	\(Y = \min(T, C)\) (hours)	\(\delta\)
M01	2,341	1
M02	10,000	0
M03	7,823	1
M04	10,000	0
M05	1,205	1
M06	9,441	1
M07	10,000	0
M08	4,678	1
M09	10,000	0
M10	6,102	1

The Hazard Function

We have established that \(S(t)\) tells us the probability of surviving past time \(t\). But consider a different and more pointed question: given that a system has already survived until time \(t_0\), how likely is it to fail in the next small instant?

Let \(\Delta t\) be a small time interval. The probability that a system which has survived at least \(t_0\) units of time fails in \([t_0, t_0 + \Delta t]\) is, by the definition of conditional probability:

\[ P(t_0 \leq T \leq t_0 + \Delta t \mid T > t_0) = \frac{P(t_0 < T \leq t_0 + \Delta t)}{P(T > t_0)} = \frac{F(t_0 + \Delta t) - F(t_0)}{S(t_0)} \]

Let us call this quantity \(G(t_0)\). Dividing both sides by \(\Delta t\):

\[ \frac{G(t_0)}{\Delta t} = \frac{1}{S(t_0)} \cdot \frac{F(t_0 + \Delta t) - F(t_0)}{\Delta t} \]

Now, where have you seen an expression of the form \((f(a + h) - f(a)) / h\) where \(h\) is small? That’s right — calculus. As \(\Delta t \to 0\):

\[ \frac{F(t_0 + \Delta t) - F(t_0)}{\Delta t} \to f(t_0) \]

because the derivative of the CDF is the PDF. So in the limit:

\[ \lim_{\Delta t \to 0} \frac{G(t_0)}{\Delta t} = \frac{f(t_0)}{S(t_0)} \]

This quantity has a name. It is the hazard function of \(T\) evaluated at \(t_0\), denoted \(\lambda(t_0)\):

\[ \boxed{\lambda(t_0) = \frac{f(t_0)}{S(t_0)}} \]

Now I know what you’re thinking. The math is straightforward but why in the world would I care about this silly expression?

What does \(\lambda(t)\) actually mean?

Recall that for a continuous random variable, probabilities over small intervals can be approximated using the PDF:

\[ P(t_0 \leq T \leq t_0 + \Delta t) = \int_{t_0}^{t_0 + \Delta t} f(x)\, dx \approx f(t_0) \cdot \Delta t \]

So the conditional probability of failure in \([t_0, t_0 + \Delta t]\) becomes:

\[ \frac{P(t_0 \leq T \leq t_0 + \Delta t)}{S(t_0)} \approx \frac{f(t_0) \cdot \Delta t}{S(t_0)} = \lambda(t_0) \cdot \Delta t \]

This is the key insight. \(\lambda(t_0) \cdot \Delta t\) is approximately the conditional probability that a machine which has survived until \(t_0\) will fail in the next \(\Delta t\) hours. This is why \(\lambda(t)\) is called the rate of failure per unit time — it tells you, at every moment, how risky the next instant is for a system that has made it this far.

Let us clear any confusion between the probability density function \(f(t)\) and the hazard function \(\lambda(t)\). The probability density function \(f(t)\) is the unconditional density of failure at \(t\). \(\lambda(t)\) conditions on survival. To make it more explicit, \(f(t)\) is a probability density, not a probability. \(f(5000) = 0.0001\) does not mean “the probability of failure at exactly 5000 hours is 0.0001.” It means the probability of failure in a small interval around 5000 hours is approximately \(f(5000) \cdot \Delta t = 0.0001 \cdot \Delta t\). Here is a concrete example to illustrate the difference.

Consider two machines — call them Machine A and Machine B. At \(t_0 = 5000\) hours, both have the same unconditional failure density: \(f(5000) = 0.0001\). A naive reading suggests they are equally “failure-prone” at this moment. They are not.

Machine A has \(S(5000) = 0.9\) — 90% of such machines survive to 5000 hours. Reaching this point is unremarkable.
Machine B has \(S(5000) = 0.1\) — only 10% of such machines survive to 5000 hours. This machine is a true survivor against the odds.

Their hazard rates at \(t_0 = 5000\):

\[ \lambda_A(5000) = \frac{f(5000)}{S_A(5000)} = \frac{0.0001}{0.9} \approx 0.000111 \]

\[ \lambda_B(5000) = \frac{f(5000)}{S_B(5000)} = \frac{0.0001}{0.1} = 0.001 \]

Machine B’s hazard rate is 9 times higher than Machine A’s at the exact same moment, despite having the same \(f(5000)\). The reason is simple: Machine B has survived longer than 90% of its kind. The few that remain are under severe stress — and the hazard function knows this. \(f(t)\) does not. This is the power of conditioning on survival. The hazard function carries information that the PDF simply cannot.

Another way to write \(\lambda(t)\), an introduction to the cumulative hazard function

The hazard function can also be expressed in terms of the survival function. Starting from the definition: \[\lambda(t) = \frac{f(t)}{S(t)} \] We can rewrite the PDF in terms of the survival function:

As we know, the survival function is related to the CDF by \(S(t) = 1 - F(t)\). Differentiating both sides with respect to \(t\) gives us \[f(t) = -\frac{d}{dt}S(t)\]

Substituting this into the hazard function: \[\lambda(t) = \frac{-\frac{d}{dt}S(t)}{S(t)} = -\frac{1}{S(t)} \cdot \frac{d}{dt}S(t)\] This can be rewritten as: \[\lambda(t) = -\frac{d}{dt}\log S(t)\]

This expression shows that the hazard function is the negative derivative of the logarithm of the survival function. Integrating both sides with respect to \(t\) gives us:

\[\boxed{\Lambda(t) = \int_0^t \lambda(u)\,du = -\log S(t)}\]

This integral is known as the cumulative hazard function, often denoted by \(\Lambda(t)\) or \(H(t)\). This relationship between the cumulative hazard function and the survival function is fundamental in survival analysis. We can express the survival function in terms of the cumulative hazard function as follows:

\[\boxed{S(t) = e^{-\Lambda(t)}}\]

What does the cumulative hazard function \(\Lambda(t)\) mean?

The cumulative hazard function \(\Lambda(t)\) can be interpreted as the accumulated risk of failure or death up to time \(t\) since the start of observation. Think of it as a “risk score” that increases over time. The higher \(\Lambda(t)\) is, the more likely the system is to fail by time \(t\).

However, the most important consequence of the relationship between the cumulative hazard and the survival function is that the probability of survival decays exponentially with the cumulative hazard. This means that if the cumulative hazard increases linearly over time, the survival function will decay exponentially. If the cumulative hazard increases more rapidly, the survival function will decay even faster. This exponential relationship is the bridge between the hazard function (which can be estimated from data as we will see in the later sections) and the survival function (which is often what engineers or clinicians care about).

The Unified Framework of Survival Analysis

We have met three key functions in survival analysis: the PDF \(f(t)\), the survival function \(S(t)\), and the hazard function \(\lambda(t)\). These functions are not independent of each other. Know any one of them — you know them all.

\[f(t) \longleftrightarrow S(t) \longleftrightarrow \Lambda(t) \longleftrightarrow \lambda(t)\]

Here’s how:

From \(f(t)\): \[S(t) = 1 - \int_0^t f(u)\,du, \qquad \lambda(t) = \frac{f(t)}{S(t)}\]

From \(S(t)\): \[f(t) = -\frac{d}{dt}S(t), \qquad \lambda(t) = -\frac{d}{dt}\log S(t)\]

From \(\lambda(t)\): \[\Lambda(t) = \int_0^t \lambda(u)\,du, \qquad S(t) = e^{-\Lambda(t)}, \qquad f(t) = \lambda(t)\cdot e^{-\Lambda(t)}\]

In survival analysis, we can choose to work with any of these functions depending on the context and the question at hand. The Cox Proportional Hazards model, for example, is a regression model that directly models the hazard function. The Kaplan-Meier estimator is a non-parametric estimator of the survival function. The choice of which function to work with is often guided by the nature of the data and the specific research question being addressed.

What’s next?

We now have the language of survival analysis. We know what the key functions are and how they relate to each other. In the next part, we will meet some commonly used distributions of survival analysis and how they are characterized purely by their hazard functions. We will start with the simplest one- the distribution with a constant hazard function, which is the well known Exponential distribution. We will understand the memorylessness property of the exponential distribution and its consequences. We will then move on to the linearly increasing hazard function- the Rayleigh distribution, and the linearly decreasing hazard function, which is the Pareto distribution. We will see how the shape of the hazard function dictates the shape of the survival curve and what that means in real life. And finally, we will meet the workhorse of survival analysis and reliability engineering — the Weibull distribution, which can model a wide variety of hazard shapes and is used in a huge range of applications. We will also see how to estimate the parameters of these distributions from data and how to use them for prediction. The clock is still ticking. See you in Part 2.