Madhav's Blog

Part 3: Fitting Survival Distributions to Data

Madhav Prashanth Ramachandran — Sat, 18 Apr 2026 18:30:00 GMT

Recap of Part 2 and What’s Coming in Part 3

In Part 2, we built a vocabulary of survival distributions — the Exponential for constant hazard, the Rayleigh for linearly increasing hazard, and the Weibull as the unifying family that contains both as special cases. Along the way we took a detour through the connection between the Rayleigh, Normal, and Exponential distributions, and ended with the bathtub curve — the most common failure pattern in real engineered systems.

But a distribution is not a model until it is fit to data. In Part 3, we ask: given a dataset of failure times — some exact, some censored — how do we estimate the parameters of these distributions? We will derive the likelihood function under censoring from first principles, apply maximum likelihood estimation, and assess the quality of our fits using diagnostic tools. Python code throughout using lifelines.

Maximum Likelihood Estimation

We will review the maximum likelihood estimation (MLE) framework with a simple example. We will then extend this framework to handle censored data, which is what makes survival analysis different from classical statistical inference. We will derive the likelihood function for censored data, and then apply MLE to estimate the parameters of the distributions we have discussed. We will simulate some synthetic data to demonstrate the MLE process in practice. If you are new to MLE and feel lost, you can consult any standard statistics textbook or an online resource for a more detailed introduction. The key idea is that MLE provides a systematic way to estimate the parameters of a statistical model by finding the parameter values that maximize the likelihood of observing the given data.

The Likelihood Function

This is the heart of the matter. Likelihoods often are confused with probabilities – they look similar, their forms are similar but they are conceptually different. In probability, we ask: given a model with known parameters, what is the chance of observing the data? In likelihood, we ask: given the observed data, what is the chance that the data came from a model with certain parameters? To give a concrete example, suppose we observe a failure time of 5 hours. Let’s suppose we have a model that says the failure time follows an exponential distribution with a mean of 10 hours. Probability asks: What is the probability of observing a failure between 4 and 6 hours given this model? Likelihood asks: Across all possible values of the mean (or rate) parameter, for which value does the observed time of 5 hours seems most plausible?

Here is the setup. We have a sequence of observed failure times that we assume are independent and identically distributed (i.i.d.) samples from some distribution with a parameter . The likelihood function is defined as the joint probability of observing this data given the parameter : In the simplest setup, we will assume that all observations are exact failure times (no censoring). In that case, the likelihood function is:

Note that this is a function of - we are treating the data as fixed and asking how likely different values of are to have generated this data. Since most of our work will involve continuous distributions, we will often work with the probability density function (pdf) instead of the probability mass function (pmf). In that case, the likelihood becomes:

Note that some authors use the conditional probability notation instead of the joint probability notation , but the meaning is the same. The likelihood function captures how well the model with parameter explains the observed data. The MLE process will involve finding the value of that maximizes this likelihood function. We will use the semicolons in the notation to emphasize that we are treating the data as fixed and the parameter as an unknown constant. If the parameter is itself a random variable with a prior distribution, then we are in the realm of Bayesian inference, which is a different framework. In MLE, we do not assign a prior distribution to the parameter; we simply find the value that maximizes the likelihood of the observed data.

The Maximum Likelihood Estimation Process

The MLE process involves finding the value of that maximizes the likelihood function . In practice, it is often easier to work with the log-likelihood function, which is the natural logarithm of the likelihood:

The MLE estimate is the value of that maximizes :

If the likelihood function is well-behaved (e.g., it is differentiable), we can find the MLE estimate by taking the derivative of the log-likelihood with respect to , setting it equal to zero, and solving for . This gives us the critical points of the log-likelihood function, which we can then evaluate to find the maximum. To show that the critical point we find is indeed a maximum, we can check the second derivative or use other methods if necessary. Let’s see how this works in practice with a simple example.

Suppose we have a dataset of failure times that we believe follows an exponential distribution. The exponential distribution has a single parameter (the rate), and its pdf is given by:

Given a dataset of failure times , the likelihood function for the exponential distribution is:

The log-likelihood function is:

To find the MLE estimate , we take the derivative of with respect to , set it equal to zero, and solve for :

Solving for gives us:

This is the MLE estimate for the rate parameter of the exponential distribution based on our observed failure times. Notice that this estimate is simply the reciprocal of the sample mean of the failure times, which makes intuitive sense given the properties of the exponential distribution. We can verify that this critical point is indeed a maximum by checking the second derivative of the log-likelihood function evaluated at :

which is negative for all . Therefore, we have a maximum at . In fact, the (only) critical point we find is a global maximum.

For a vector of parameters with components , the MLE process is similar but we take the gradient of the log-likelihood with respect to the vector of parameters and set it equal to the zero vector to find the critical points. We can then use the Hessian matrix to check whether we have a maximum, minimum, or saddle point. For instance, the Weibull distribution has two parameters, the shape and the scale . The MLE process would involve taking the gradient of the log-likelihood with respect to both and , setting it equal to zero, and solving for both parameters simultaneously. Let’s derive the MLE equations for the Weibull distribution. Suppose you have a dataset of failure times that you believe follows a Weibull distribution with shape parameter and scale parameter . The pdf of the Weibull distribution is given by:

The likelihood function for the Weibull distribution is:

The log-likelihood function is:

To find the MLE estimates and , we take the gradient of with respect to both parameters, set it equal to zero, and solve for and . This will give us a system of equations that we can solve numerically to find the MLE estimates. The resulting equations are nonlinear and do not have a closed-form solution, which is why we need numerical optimization methods to find the MLE estimates for the Weibull distribution. We will see how to do this in practice using lifelines in the simulation below.

import numpy as np
from lifelines import WeibullFitter
import matplotlib.pyplot as plt

# Generate synthetic Weibull data
np.random.seed(42)
true_gamma = 2.0   # shape
true_theta = 100.0  # scale
n = 200  # number of samples

# Weibull random samples using inverse CDF
u = np.random.uniform(0, 1, n)
t = true_theta * (-np.log(1 - u))**(1/true_gamma)

# All observed (no censoring yet)
event_observed = np.ones(n)

# Fit
wf = WeibullFitter()
wf.fit(t, event_observed=event_observed)
wf.print_summary()

model	lifelines.WeibullFitter
number of observations	200
number of events observed	200
log-likelihood	-1035.10
hypothesis	lambda_ != 1, rho_ != 1

	coef	se(coef)	coef lower 95%	coef upper 95%	cmp to	z	p	-log2(p)
lambda_	97.18	3.63	90.06	104.29	1.00	26.50	<0.005	511.45
rho_	1.99	0.11	1.78	2.21	1.00	8.93	<0.005	60.96

AIC	2074.19

Figure 1

lifelines uses lambda_ for the scale parameter and rho_ for the shape parameter — different notation, same parameters. With observations, the estimated parameters are already close to the true values:

MLE recovers the true parameters with high accuracy even at modest sample sizes.

Figure 2: Fitted Weibull survival curve vs empirical survival curve

The fitted Weibull survival curve (gold) closely tracks the empirical survival function (blue) across the entire time range. The empirical survival function is simply the fraction of systems still surviving at each time point — no distributional assumptions, just the raw data speaking for itself. The fact that the fitted curve hugs it so tightly tells us two things: MLE found good parameter estimates, and the Weibull is a reasonable model for this data. We will make this “goodness of fit” assessment more rigorous later using diagnostic tools.

But there is a catch — this was the easy case. All 200 observations were exact failure times. In practice, not all machines will have failed by the end of the study — some will still be running when the study ends. How do we modify the likelihood function to account for censoring? This is the key question that makes survival analysis different from classical statistical inference. We derive the censored likelihood function from first principles next.

MLE with Censored Data

Remember that in survival analysis, we often have censored data — we know that a system survived up to a certain time, but we don’t know the exact failure time. This is called right-censoring. We need to modify our likelihood function to account for this type of data. Recall from Part 1 of the series that we observe pairs of the form where is the observed time (either failure time or censoring time) and is an indicator variable that is 1 if the event (failure) was observed and 0 if it was censored. If , we know that the failure time was observed and is equal to . If , we know that the system survived up to time , but we don’t know the exact failure time.

A brief notational remark before we proceed further. and are random variables representing the (unknown) failure time and censoring time of the -th unit. Their realizations — the actual observed values — are denoted and in lowercase. is also a random variable, and its realization is . We use uppercase for random variables and lowercase for their observed values throughout.

How does this affect the likelihood function? For an observed failure time (when ), the contribution to the likelihood is simply the pdf evaluated at : . This is equal to the pdf evaluated at the failure time .

For a censored observation (when ), the observation tells us that the failure time is greater than . The contribution to the likelihood is the probability that the system survived past , which is given by the survival function: .

The full likelihood function for a dataset with both exact failure times and censored observations is the product of the contributions from all observations:

You might be wondering why we have the exponents and in the likelihood function. This is a common way to write the likelihood function in survival analysis to compactly represent both types of observations. When , the term becomes and the term becomes 1, so the contribution to the likelihood is just . When , the term becomes 1 and the term becomes , so the contribution to the likelihood is just . This way of writing the likelihood function allows us to handle both types of observations in a unified framework. This is very similar to writing the likelihood function (or loss function) in logistic regression, where we have a binary outcome and we use the observed labels to determine which term contributes to the likelihood for each observation.

Finally, a point worth observing is that the likelihood function for censored data is not a simple product of pdfs, but rather a product of a mixture of pdfs and survival functions. How can you combine the PDF (which is a density) and the survival function (which is a probability) in the same likelihood function? The answer lies in the fact that we are not comparing incompatible objects. For a censored observation, is a genuine probability. For an exact failure, is a density — but both contribute to the same likelihood because we are asking the same question for each observation: what parameter value makes this observation most plausible? The likelihood function combines these two types of contributions in a unified framework that allows MLE to work even in the presence of censoring.

If you want to be very abstract and not make a big deal about the fact that we are mixing densities and probabilities, you can just think of the likelihood function as a product of two functions and you are trying to find the parameter values that maximize this product. The abstract view strips away the details and allows you to focus on the optimization problem. However, it has a disadvantage that it takes away the intuition about why we are using the pdf for exact failures and the survival function for censored observations. The more concrete view emphasizes the different contributions to the likelihood from different types of observations, which can help build intuition about how MLE works in the presence of censoring. Both views are valid and can be useful in different contexts.

Let’s spend some more time on understanding true failure vs censoring and how they contribute to the likelihood function a little more intuitively.

Case 1: (exact failure observed).

In this case, we know that the observed failure time (a realization of ) is less than or equal to the censoring time . Consider the event that the failure time and the failure time is in a small interval around . If the censoring time and the failure time are independent, then the joint probability of observing a failure at time and it being uncensored can be expressed as:

The first term is approximately equal to the pdf evaluated at : . The second term is the probability that the censoring time is greater than or equal to , which is given by the survival function of the censoring distribution evaluated at : . Therefore, the contribution to the likelihood from an exact failure observation can be expressed as:

Since we are maximizing the likelihood with respect to , the term does not depend on and can be treated as a constant. Therefore, the contribution to the likelihood from an exact failure observation is effectively proportional to the pdf evaluated at : . This is why we use the pdf for exact failure observations in the likelihood function. Note that the term is the survival function of the censoring distribution, which represents the probability that the censoring time is greater than or equal to and therefore has nothing to do with the failure parameters that we are trying to estimate.

Case 2: (censored observation).

In this case, we know that the system survived up to time , but we don’t know the exact failure time. The contribution to the likelihood from a censored observation is the probability that the failure time is greater than : . We know with certainty that the system survived up to time , so the likelihood contribution is the probability of surviving past that time, which is given by the survival function. Consider the event that the failure time is greater than and the censoring time is in a small interval around . If the censoring time and the failure time are independent, then the joint probability of observing a censored observation at time can be expressed as:

The first term is the probability that the failure time is greater than , which is given by the survival function evaluated at : . The second term is approximately equal to the pdf of the censoring distribution evaluated at : . Therefore, the contribution to the likelihood from a censored observation can be expressed as:

Since we are maximizing the likelihood with respect to , the term does not depend on and can be treated as a constant. Therefore, the contribution to the likelihood from a censored observation is effectively proportional to the survival function evaluated at : . This is why we use the survival function for censored observations in the likelihood function. Note that the term is the pdf of the censoring distribution, which represents the probability of observing a censoring event at time and therefore has nothing to do with the failure parameters that we are trying to estimate.

Combining both cases and all the observations, the full censored likelihood is:

where the and censoring distribution terms have been absorbed into a proportionality constant that does not affect the location of the maximum.

The assumption of independence between the failure time and censoring time is crucial for this derivation. This is known as the non-informative censoring assumption, which tells us that knowing when a subject was censored does not give us any information about their failure time. This assumption is violated in some real world scenarios, such as when a machine operator decides to stop a machine that is showing signs of imminent failure, or when a patient drops out of a clinical trial due to worsening health. In such cases, the censoring is informative and the standard MLE approach may yield biased estimates. There are methods to handle informative censoring, such as joint modeling of the failure and censoring processes, but these are beyond the scope of this post.

The Log-Likelihood Function with Censoring

The full censored likelihood function is:

Taking the natural logarithm of the likelihood function gives us the log-likelihood function:

Based on our previous study of the relationship between the pdf and the survival function, we can express the log-likelihood function in terms of the hazard function . Recall that the pdf can be expressed as , so we can rewrite the log-likelihood function as:

Notice that appears for every observation regardless of whether it failed or was censored — survival information contributes to the likelihood for all units. The hazard term only contributes for observed failures. This is the elegance of the censored likelihood. Finally, we can express the log-likelihood function in terms of the cumulative hazard function using the relationship :

Now we use our calculus machinery to find the MLE estimates. Taking the derivative of the log-likelihood function with respect to and setting it equal to zero gives us the MLE equations that we can solve to find the parameter estimates. The specific forms of these equations will depend on the distribution we are fitting. And sometimes, these equations will be so nonlinear and complex that we won’t be able to solve them analytically. We will see how to handle this in practice using numerical optimization methods in lifelines when we fit the Weibull distribution with censored data in the next section.

But first, let’s calculate the analytical MLE estimates for the exponential distribution with censored data to see how the presence of censoring modifies the MLE equations. The exponential distribution has a single parameter (the rate), and its pdf and survival function are given by:

The log-likelihood function for the exponential distribution with censored data is:

This simplifies to:

Taking the derivative of the log-likelihood function with respect to and setting it equal to zero gives us:

Solving for gives us the MLE estimate for the exponential distribution with censored data:

The quantity is known as the total time at risk, which is the sum of the observed times for all units, regardless of whether they failed or were censored. Each unit contributes to the total time at risk based on its observed duration. The quantity is the total number of observed failures.

Isn’t this marvelous? The MLE estimate for the rate parameter in the presence of censoring is simply the total number of observed failures (the sum of ) divided by the total time at risk (the sum of ). If the data were fully observed with no censoring, then and we would recover the MLE estimate for the exponential distribution without censoring: (I have used instead of for consistency with the current topic). The presence of censoring effectively reduces the number of observed failures and increases the total time at risk, which leads to a different MLE estimate that accounts for the incomplete information in the data. This is the power of the censored likelihood function — it allows us to make valid inferences about the parameters of the failure distribution even when we have incomplete data due to censoring. It is left an exercise for the reader to verify that the MLE estimate for the exponential distribution with censored data is indeed a maximum by checking the second derivative of the log-likelihood function evaluated at .

MLE with Censoring: A synthetic data example

Rather than working with abstract numbers and symbols, let’s build a dataset that will accompany us for the rest of the series- a synthetic fleet of 1000 machines, each with its own age, operating conditions, and failure history.

import numpy as np
import pandas as pd
import os

np.random.seed(42)
n = 1000

# --- Continuous covariates ---
machine_age        = np.random.uniform(0, 15, n)
usage_intensity    = np.random.uniform(0.5, 2.0, n)
operating_temp     = np.random.uniform(60, 120, n)
load_factor        = np.random.uniform(0.3, 1.0, n)
rpm                = np.random.uniform(500, 3000, n)
vibration_level    = np.random.uniform(0.5, 10, n)
oil_quality        = np.random.uniform(0, 1, n)
maintenance_count  = np.random.randint(0, 21, n).astype(float)

# --- Categorical covariates ---
environment  = np.random.choice(['indoor', 'outdoor', 'harsh'], n, p=[0.4, 0.4, 0.2])
manufacturer = np.random.choice(['A', 'B', 'C'], n, p=[0.4, 0.35, 0.25])

# --- Encode categoricals for survival time generation ---
env_effect = np.where(environment == 'indoor', 1.0,
             np.where(environment == 'outdoor', 0.85, 0.65))

mfr_effect = np.where(manufacturer == 'A', 1.0,
             np.where(manufacturer == 'B', 0.9, 0.75))

# --- True Weibull parameters ---
# Scale theta depends on covariates
# Higher stress covariates -> smaller theta -> shorter survival
gamma_true = 2.0

theta_true = (
    300
    * env_effect
    * mfr_effect
    / (
        1
        + 0.03 * machine_age
        + 0.15 * usage_intensity
        + 0.008 * operating_temp
        + 0.10 * load_factor
        + 0.0001 * rpm
        + 0.05 * vibration_level
        + 0.10 * oil_quality
        - 0.02 * maintenance_count  # more maintenance -> longer survival
    )
)

# --- Generate Weibull failure times via inverse CDF ---
u = np.random.uniform(0, 1, n)
T = theta_true * (-np.log(1 - u))**(1 / gamma_true)

# --- Random right censoring ---
C = np.random.uniform(50, 400, n)

# --- Observed time and event indicator ---
Y     = np.minimum(T, C)
delta = (T <= C).astype(int)

# --- Build dataframe ---
df = pd.DataFrame({
    'machine_id':         [f'M{i+1:04d}' for i in range(n)],
    'machine_age':        machine_age.round(2),
    'usage_intensity':    usage_intensity.round(2),
    'operating_temp':     operating_temp.round(2),
    'load_factor':        load_factor.round(2),
    'rpm':                rpm.round(0).astype(int),
    'vibration_level':    vibration_level.round(3),
    'oil_quality':        oil_quality.round(3),
    'maintenance_count':  maintenance_count.astype(int),
    'environment':        environment,
    'manufacturer':       manufacturer,
    'observed_time':      Y.round(2),
    'event_observed':     delta
})

# --- Summary ---
print(f"Total machines    : {n}")
print(f"Observed failures : {delta.sum()} ({100*delta.mean():.1f}%)")
print(f"Censored          : {(1-delta).sum()} ({100*(1-delta).mean():.1f}%)")
print(f"Mean observed time: {Y.mean():.1f} hours")
print(f"\nFirst 10 rows:")
print(df.head(10).to_string(index=False))

# --- Save ---
os.makedirs('../../data', exist_ok=True)
df.to_csv('../../data/machine_fleet.csv', index=False)
print("\nDataset saved to ../../data/machine_fleet.csv")

Total machines    : 1000
Observed failures : 886 (88.6%)
Censored          : 114 (11.4%)
Mean observed time: 81.4 hours

First 10 rows:
machine_id  machine_age  usage_intensity  operating_temp  load_factor  rpm  vibration_level  oil_quality  maintenance_count environment manufacturer  observed_time  event_observed
     M0001         5.62             0.78           75.70         0.77 1930            4.240        0.648                 10     outdoor            A         111.39               1
     M0002        14.26             1.31           74.82         0.86 2514            4.998        0.172                 16     outdoor            A          68.56               1
     M0003        10.98             1.81          114.38         0.48 2400            8.618        0.872                  4      indoor            C          49.03               1
     M0004         8.98             1.60           74.97         0.74  885            3.730        0.613                  6       harsh            A          88.65               1
     M0005         2.34             1.71           76.32         0.70  873            8.762        0.157                 18      indoor            B         205.67               1
     M0006         2.34             1.49          105.56         0.88 1170            1.337        0.962                 14       harsh            B          69.69               1
     M0007         0.87             1.54           86.98         0.93 1403            7.880        0.518                  2     outdoor            B          79.77               0
     M0008        12.99             1.77          106.60         0.31 1521            8.552        0.073                 14      indoor            B          28.99               1
     M0009         9.02             0.87           63.92         0.77 2199            2.227        0.627                  4       harsh            B          37.09               1
     M0010        10.62             1.23           89.25         0.34  642            4.588        0.253                 11      indoor            A         106.60               1

Dataset saved to ../../data/machine_fleet.csv

We generate a synthetic fleet of 1000 machines, each with 8 continuous and 2 categorical covariates representing realistic machine characteristics — age, operating temperature, vibration level, manufacturer, and so on. The true failure times are drawn from a Weibull distribution with shape and a scale parameter that depends on the covariates — higher stress (temperature, vibration, load) shrinks and shortens expected lifetime, while more maintenance history extends it.

For now, we work only with the observed times — whether the machine failed or was censored. But in reality, a machine’s lifetime depends on its physical characteristics — how old it is, how hard it runs, the temperature it operates at, how well it has been maintained. Modeling the relationship between these characteristics and the failure time is exactly what survival regression models like the Cox Proportional Hazards model are built to do — and that is where we are headed.

Now, we will fit a Weibull to the observed times with lifelines. Here is the piece of code.

import pandas as pd
from lifelines import WeibullFitter

wf = WeibullFitter()
df = pd.read_csv('../../data/machine_fleet.csv')

wf.fit(df['observed_time'], event_observed=df['event_observed'])
wf.print_summary()

model	lifelines.WeibullFitter
number of observations	1000
number of events observed	886
log-likelihood	-4684.21
hypothesis	lambda_ != 1, rho_ != 1

	coef	se(coef)	coef lower 95%	coef upper 95%	cmp to	z	p	-log2(p)
lambda_	97.26	1.80	93.72	100.79	1.00	53.41	<0.005	inf
rho_	1.86	0.05	1.76	1.95	1.00	17.69	<0.005	230.18

AIC	9372.42

886 out of 1000 machines failed during the study- an 88.6% event rate with 11.4% censored. The fitted shape parameter is close to but not exactly the true value of . This is expected — we are fitting a single Weibull to all 1000 machines while ignoring the fact that each machine has a different driven by its covariates. By fitting a single Weibull, we are forcing one distribution to describe a heterogeneous fleet of machines. This is precisely why survival regression methods exist- to model how machine characteristics influence lifetime. We will get there eventually. But first, let’s assess how well this naive fit describes the data using diagnostic tools.

Diagnostic Tool: Q-Q Plot

The Q-Q plot is a graphical method to display whether a dataset follows a particular distribution or whether two datasets come from the same population. For our purposes, we are testing if the dataset comes from a particular Weibull or not. Note that the game is rigged- we simulated data to follow a particular Weibull, so we already know the answer. But the Q-Q plot is a diagnostic tool we will use repeatedly on real data where we do not know the answer.

The basic idea of the Q-Q plot is to compare theoretical quantiles vs empirical quantiles. If the data fits a particular Weibull, these quantiles should match and the points should fall on a straight line. Here is how we’ll build the Q-Q plot for the Weibull step by step.

Step 1: Theoretical Quantiles

Recall that the -th quantile of a probability distribution is the value such that where . For instance, the 0.5 quantile (also called the 50th percentile) is the median. Let’s derive the theoretical quantiles of the Weibull distribution.

We want to find such that . If , then:

Rearranging:

Taking logarithms of both sides:

Taking the -th root:

This is the quantile function (inverse CDF) of the Weibull distribution. Given a probability , it tells us the time by which a fraction of systems are expected to have failed.

Here is a special value: . Substituting into the quantile function:

So — the scale parameter is exactly the th percentile of the Weibull distribution, regardless of the shape parameter . This confirms what we noted earlier: by time , approximately 63.2% of systems will have failed.

Step 2: Empirical Quantiles

The empirical quantiles are the ones you get directly from your dataset.

Step 2a: Sort your data. Arrange the observed failure times in ascending order:

The notation denotes the -th order statistic — is the smallest observed failure time, is the next smallest, and is the largest.

Step 2b: Assign plotting positions. The -th order statistic corresponds to an approximate quantile level:

This is known as the median rank formula (or Blom plotting position). It estimates the probability level associated with the -th smallest observation. The corrections and reduce bias compared to the naive estimate , particularly in the tails of the distribution.

Step 3: Constructing the Q-Q Plot

Step 3a: Fit the Weibull distribution to get the estimated parameters and .

Step 3b: For each data point :

The empirical quantile is — the -th ordered failure time from the data.
The approximate quantile level is from the median rank formula.
The theoretical quantile is:

Step 3c: Plot the pairs for each .

What to look for:

Straight line through the origin with slope 1 — perfect fit. The theoretical and empirical quantiles agree at every probability level.
Points above the diagonal — the data has heavier tails than the Weibull predicts. Some machines lasted much longer than expected.
Points below the diagonal — the data has lighter tails than the Weibull predicts. Failures are happening earlier than the model expects.
S-shaped curve — the data comes from a mixed distribution or the shape parameter is wrong.

We will make these observations more precise and illuminating with a worked example shortly. Do not panic. For now we construct the Q-Q plot using only the observed failures, ignoring censored observations. This is a simplification — a more rigorous approach uses the Kaplan-Meier estimator for the empirical quantiles, which we will introduce in Part 4.

Let’s see how well our fitted Weibull describes the machine fleet data. Here is the Q-Q plot using the 886 observed failures.

Figure 3: Q-Q plot: fitted Weibull vs observed failure times (machine fleet)

The Q-Q plot tells a clear story. Most points fall below the diagonal — the empirical quantiles are smaller than the theoretical quantiles at the same probability level i.e . What does that mean mathematically? Since the CDF is monotone increasing, this implies:

The fitted Weibull assigns too high a survival probability at the actual observed failure times- it thinks more machines should still be running than actually are. In simple words: the Weibull estimator is too optimistic. It predicts machines will last longer than they actually do.

This is the signature of a heterogeneous fleet being forced into a single distribution. Each machine has its own true driven by its covariates like usage, age, temperature etc. A single Weibull cannot capture all of that simultaneously, and the Q-Q plot is telling us exactly that. This is precisely why survival regression exists and we will explore that in the later modules.

A note on other Q-Q plot patterns

Points above the diagonal — , which implies and . The Weibull underestimates survival — it predicts more failures by time than actually occurred. The data has heavier tails than the model expects.
S-shaped curve — points below the diagonal in the lower tail and above in the upper tail (or vice versa). The shape parameter is likely misspecified, or the data comes from a mixture of two distinct populations with different failure regimes.

Empirical CDF and the Binomial Connection

I cannot help but talk about a beautiful connection between the empirical CDF and the Binomial distribution. Here is how it goes.

Given data , the empirical CDF estimates the true CDF from the data alone — no distributional assumptions, no parameters to fit. For any time :

Using indicator notation, this can be written compactly as:

where is 1 if the -th observation is less than or equal to , and 0 otherwise.

Indicator Notation

The indicator function for an event is defined as:

An equivalent notation uses a set and a point :

In our case, is shorthand for — it equals 1 if the observation falls in the set , and 0 otherwise.

Here is a simple example. Consider five observed failure times: hours.

For : no failures yet, so
For : one failure observed, so
For : two failures observed, so
For : three failures observed, so
For : four failures observed, so
For : all five failures observed, so

The empirical CDF is a step function — it jumps by at each observed failure time and stays flat in between. Here is a plot showing the empirical CDF with these examples.

Figure 4: Empirical CDF for a simple example with 5 failure times

Each jump represents one observed failure. The empirical CDF makes no assumptions about the underlying distribution — it simply counts. This is what makes it so powerful as a diagnostic tool, and as you will see in Part 4, it is also the foundation of the Kaplan-Meier estimator.

The Binomial Connection

Fix a time value . For the -th observation, with corresponding random variable , define the indicator random variable:

Since is a random variable, it has an associated probability. What is ? It is simply the probability that , which is exactly the CDF of evaluated at :

Doesn’t this remind you of a coin toss? If we think of “failure by time ” as analogous to obtaining heads, then plays the role of the probability of heads. The indicator is therefore a Bernoulli random variable with parameter :

Now let . This counts the total number of observations that have failed by time . Since the failure times are independent and identically distributed, the indicators are independent Bernoulli random variables. A sum of independent Bernoulli random variables follows a Binomial distribution, so:

The expectation and variance of are:

Connection with the Empirical CDF

The empirical CDF is:

Since , we have:

takes integer values , so takes values — and the difference between consecutive values is exactly the jump size we observed in the step function plot. The probability that the empirical CDF takes the value is:

The moments of follow directly from those of :

Two beautiful results — the empirical CDF is an unbiased estimator of the true CDF , and its variance shrinks as grows. The larger the dataset, the more precisely estimates at every time point .

The Kolmogorov-Smirnov (K-S) Test

The core idea of the K-S test is simple. You have two CDFs:

— the empirical CDF, computed from the data
— the fitted Weibull CDF

At every point , these two curves have some vertical distance between them. The K-S test statistic is simply the largest such distance:

where stands for supremum — the least upper bound of a set of real numbers that’s (at least) bounded above. (i.e., there exists some number such that every element in the set is . This number M is called an upper bound). The axiom of completeness states that any set of real numbers that is bounded above has a supremum. If you are not familiar with supremum, replace it with maximum and you will be fine for all practical purposes.

Intuitively, if the fit is perfect, . If the fit is terrible, is large.

The Infinity Norm

A function on a domain is said to be bounded if there exists a real number such that for all .

The infinity norm (or sup norm) of a bounded function on a domain is defined as:

The infinity norm naturally induces a metric between two functions and :

It measures the largest absolute value the function attains over its entire domain — the worst-case deviation. In our case, is a function of time , and is its largest absolute value over all . The infinity norm is the natural norm on the space of bounded functions and plays a central role in functional analysis and approximation theory.

The infinity norm is indeed a norm — it satisfies non-negativity, homogeneity, and the triangle inequality. Verifying these properties is a standard exercise in real analysis. The triangle inequality in particular:

follows directly from and taking the supremum on both sides. In fact, the space of bounded functions on equipped with the infinity norm, denoted , is not just a normed vector space but a Banach space — a complete normed vector space where every Cauchy sequence converges. This completeness property is what makes the infinity norm so powerful in analysis.

The Hypothesis Test

If is large enough — larger than what you would expect by chance if is true — you reject the null.

Distribution of

Without going into the depths of hell, we simply state that under with known parameters, converges in distribution to the Kolmogorov distribution . The CDF of has the closed form:

A Critical Caveat

The standard K-S test assumes the parameters and are known in advance — not estimated from the same data. In our case, we estimated and from the machine fleet data using MLE, and then used those estimates to construct . This makes artificially small — the fitted distribution has already been pulled toward the data, so the two curves are closer than they would be with truly known parameters. As a result, the standard K-S p-values are too optimistic and should not be taken at face value.

For our purposes, we use the K-S test as a descriptive tool — a way to quantify how close the fit is — rather than as a strict hypothesis test. The Q-Q plot already told us the story visually. The K-S statistic puts a number on it. Let’s see via code what that number looks like.

import numpy as np
import pandas as pd
from lifelines import WeibullFitter
from scipy import stats

df = pd.read_csv('../../data/machine_fleet.csv')

wf = WeibullFitter()
wf.fit(df['observed_time'], event_observed=df['event_observed'])

gamma_hat = wf.rho_
theta_hat = wf.lambda_

# Use only observed failures
t_obs = df[df['event_observed'] == 1]['observed_time'].values

# K-S test against fitted Weibull
# scipy uses the standard Weibull parameterization
# Weibull(c, scale) where c = gamma, scale = theta
ks_stat, p_value = stats.kstest(
    t_obs,
    'weibull_min',
    args=(gamma_hat, 0, theta_hat)
)

print(f"K-S statistic  : {ks_stat:.4f}")
print(f"p-value        : {p_value:.4f}")
print(f"Sample size    : {len(t_obs)}")
print()
print("Interpretation:")
if p_value < 0.05:
    print(f"  D_n = {ks_stat:.4f} — reject H0 at 5% significance.")
    print("  The fitted Weibull does not describe the data well.")
else:
    print(f"  D_n = {ks_stat:.4f} — fail to reject H0 at 5% significance.")
    print("  The fitted Weibull is a reasonable description of the data.")
print()
print("Note: p-value is anti-conservative since parameters were")
print("estimated from the same data. Use as descriptive tool only.")

K-S statistic  : 0.0669
p-value        : 0.0007
Sample size    : 886

Interpretation:
  D_n = 0.0669 — reject H0 at 5% significance.
  The fitted Weibull does not describe the data well.

Note: p-value is anti-conservative since parameters were
estimated from the same data. Use as descriptive tool only.

and are both computed from the same test statistic, but they answer different questions. measures the size of the deviation — 6.7 percentage points at most, which is practically small. The p-value measures whether a deviation this large is surprising given the sample size. With , even a small becomes highly significant — the test has enough power to confidently declare the fit imperfect. This is a well known phenomenon in hypothesis testing — with large sample sizes, even small and practically irrelevant deviations from the null become statistically significant. The test is not broken. It is doing exactly what it is designed to do: detect any deviation from , no matter how small, given enough data. Whether that deviation matters in practice is a separate question that the p-value cannot answer.

Let’s spend a few minutes to analyze the output and in practical terms. The small p-value tells us that the observed data is unlikely to have come from the fitted Weibull. Fine. The test statistic tells us that the fitted CDF and the empirical CDF are never off by more than 6.7 percent at any time point.

Because for any time :

Now suppose we use the fitted Weibull to make a decision: Schedule maintenance at time t* where the fitted CDF reaches 30% - i.e. act before 30% of the fleet has failed. Solving for t* gives,

We schedule maintenance at 71 hours. But by the K-S bound, the actual observed failure fraction at hours satisfies:

Out of 1000 machines, between 233 and 367 have actually failed by the time we trigger maintenance — a swing of machines driven entirely by the model’s imprecision. Whether that uncertainty is acceptable depends on the cost of an unplanned failure vs the cost of preventive maintenance. Let’s assume:

Cost of an unplanned failure (breakdown, emergency repair, downtime): ₹5,00,000 per machine
Cost of preventive maintenance (scheduled, planned): ₹50,000 per machine

At hours, you service all 1000 machines regardless. The question is how many had already failed before you arrived.

Best case : 233 machines had already failed, 767 are still running.

Worst case : 367 machines had already failed, 633 are still running.

The uncertainty from alone translates into a cost swing of:

approximately ₹6 crore. And crucially — we cannot identify which 67 machines are driving this uncertainty without knowing their individual characteristics. That requires modeling the effect of covariates on failure time. This is precisely what survival regression is built to do — and where we are headed. Let’s visualize how the cost swing and the trigger time vary across all possible thresholds from 5% to 95%.

Figure 5: Cost swing due to model uncertainty (D_n = 0.067) across maintenance thresholds

The cost swing curve is nearly flat at ₹6 crore across all thresholds — this is not a coincidence. Since is a uniform bound over all , the worst-case cost uncertainty is approximately:

regardless of which threshold you choose. A better model — one that reduces — would shift this entire curve downward uniformly. The right panel shows the trigger time growing rapidly with threshold — waiting for 90% of the fleet to fail before intervening means waiting nearly 180 hours.

Derivation of the Cost Swing Formula

At threshold , you trigger maintenance at and service all machines. Of those, had already failed and each costs , while are still running and each costs . The total cost is:

By the K-S bound, lies in , so:

The cost swing is:

This is constant across all thresholds — the cost swing depends only on , , and the cost difference, not on which threshold you choose.

What’s Next?

We have covered a lot of ground in Part 3. We started with the likelihood function- a truly theoretical object- and derived the censored likelihood from first principles. We then fit a Weibull to our fleet machine and assessed the fit using two diagnostic tools- the Q-Q plot, which told the story visually and the K-S test, which put a number on it.

Along the way, we took a detour that went from the supremum norm of functional analysis - all the way to a ₹6 crore cost swing for a fleet of 1000 machines. This is the range this series (and survival analysis) operates in: rigorous mathematics grounded in practical consequences. The K-S test is rooted in advanced concepts in stochastic processes and functional analysis but its practical meaning is completely accessible: how wrong can my model be, and what does that cost me?

One thing that Part-3 has made very clear: a single Weibull is not enough for a heterogeneous fleet of machines. The Q-Q plot showed deviations from the diagonal and the K-S test rejected the null hypothesis. The cost analysis showed ₹6 crore of uncertainty. We need a better model- one that accounts for the fact that different machines have different failure characteristics.

But before we get to regression, we need a better estimator of the survival function itself - one that handles censored data properly, unlike the naive empirical CDF we used in the Q-Q plot.

In Part-4 we derive the most famous Kaplan-Meier estimator from first principles, prove Greenwood’s formula for its variance, handle tied failure times rigorously, and apply both to our machine fleet. Most survival analysis blogs introduce this estimator in the very beginning- like an entry point to the field. We are arriving at it in Part-4, after three parts of solid mathematical groundwork: censoring, hazard functions, parametric distributions, MLE and the empirical CDF. This is not a detour. This is the foundation that the Kaplan-Meier estimatior deserves. The clock is still ticking. See you in the next post.

Part 2: Distributions in Survival Analysis

Madhav Prashanth Ramachandran — Fri, 17 Apr 2026 18:30:00 GMT

Recap of Part 1 and What to Expect in Part 2

In Part 1, we established why survival analysis exists as a separate field- regression breaks, censoring is real and the functions we care about for survival analysis are fundamentally different from conditional expectations. Using basic calculus and probability theory, we introduced the key actors in survival analysis- the survival function, the hazard function and the cumulative hazard function. We showed that if you know one of these, you know all of them. In Part 2, we ask a natural follow up question- what does a survival distribution look like? The answer entirely depends on the shape of , the hazard function. We will see how different shapes of the hazard function lead to different survival curves and what that means in real life. A constant hazard leads to something familiar from undergrad probability. A linearly increasing hazard leads somewhere surprising. And a two parameter family called Weibull quietly unifies both of these and more. We will also understand the bathtub curve, a common shape for the hazard function in real life. Finally, we will meet the bathtub curve, a common pattern of failure in real life that cannot be captured by a single parametric distribution but can be approximated by stitching together different phases of the Weibull distribution.

Distributions in Survival Analysis

Constant Hazard: The Exponential Distribution

What happens if the hazard function is constant over time? That is, for all ? From the interpretation of the hazard function, this means for any time interval , the probability of failure in that interval is the same regardless of how long the object has survived so far. That is, for any time intervals and , we have the same probability of failure (conditional on survival up to and respectively). This is a strong assumption, but it leads to a very simple distribution. Let’s derive it.

We know that is constant, so we can write the cumulative hazard function as . Using the relationship between the survival function and the cumulative hazard function, we have . The Cumulative distribution function (CDF) is then . The probability density function (PDF) is the derivative of the CDF, which gives us . This distribution is known as the Exponential distribution with rate parameter - one of the most important continuous distributions in probability theory (probably after the normal distribution). The rate parameter is in fact the rate of failure per unit time. If you didn’t understand why the parameter lambda is called the rate parameter in your previous courses, you should now. We write to denote that the random variable follows an Exponential distribution with rate parameter .

Once we have a distribution, the next natural quantities to compute are the moments of the distribution. It is a simple exercise in integration to show that the mean of the exponential distribution is and the variance is . Other quantities related to the moments are the median and the mode. The median is the time such that , which gives us the half-life (the time at which 50% of the systems have failed) . Notice that the half-life is inversely proportional to the rate parameter, which is quite intuitive given the physical interpretation of . The mode is the time at which the PDF is maximized, which for the exponential distribution is at . The distribution is right skewed- most systems fail early, but there is a long tail of systems that survive for a long time. Here is how the Exponential distribution looks like for different values of .

Figure 1: Exponential survival functions for different rate parameters

An important property of the exponential distribution is that it is memoryless. What does that mean?

Suppose the system has already survived for time . What is the probability that it will survive for an additional time ? Let’s compute this probability. We want . Using the definition of conditional probability, we have:

Since (because if the system survives for , it must have survived for ), we can simplify this to:

Substituting the survival function for the Exponential distribution, we get:

What does this mean? It means that the probability of surviving for an additional time does not depend on how long the system has already survived. In other words, the system has no memory of its past survival time. This is a unique property of the exponential distribution and is not shared by any other continuous distribution. In fact, the only memoryless continuous distribution is exponential.

Let’s look at a simple example to illustrate this. Suppose we have a light bulb that has an Exponential lifetime with a rate of failures per hour. The mean lifetime of the light bulb is hours. If the light bulb has already been on for 5 hours, what is the probability that it will last for another 5 hours? Using the memoryless property, we can compute this as:

This means that even though the light bulb has already lasted for 5 hours, it still has a 60.65% chance of lasting for another 5 hours. Taking this to the extreme, if the light bulb has already lasted for 1000 hours, the probability that it will last for another 5 hours is still 60.65%. This is a direct consequence of the memoryless property of the exponential distribution.

This is clearly unrealistic for most real-world systems. For example, if a machine has been running for 10 years, it is likely to be more prone to failure than a brand new machine. But it makes the exponential distribution a useful starting point for understanding survival analysis and serves as a building block for more complex distributions. When your data looks like it has a constant hazard, the exponential distribution is a good first choice for modeling it.

Next, we will look at what happens when the hazard function is not constant, but instead increases linearly with time. This leads us to the Rayleigh distribution, which has some surprising properties.

Linearly Increasing Hazard: The Rayleigh Distribution

Suppose for a fixed constant , the hazard function increases linearly with time as . This means that the probability of failure in a small time interval increases with time. This is a more realistic assumption for many real-world systems, as they tend to wear out over time. Let’s derive the corresponding distribution (if it exists) and explore its properties. The cumulative hazard function is given by . Using the relationship between the survival function and the cumulative hazard function, we have . The CDF is then . The PDF is the derivative of the CDF, which gives us . This distribution is known as the Rayleigh distribution, and we write . The mean and variance are:

The mode of the Rayleigh distribution is at , which is the time at which the PDF is maximized. The median can be computed by solving , which gives us .

As we will see shortly, the Rayleigh distribution is a special case of the Weibull distribution — nature’s way of telling us that linearly increasing hazard and Weibull are secretly the same thing. But before we get there, let’s look at how the Rayleigh distribution looks like for different values of .

Figure 2: Rayleigh distribution: survival function, density, and hazard for different rate parameters λ

Before we move on, let’s take a short but illuminating detour to understand a connection between independent normal random variables and the Rayleigh distribution. Think of a rotating machine (motor, pump, compressor, etc.) that has two independent sources of random vibration in the horizontal(X) and vertical(Y) directions. Let and be independent normal random variables with mean 0 and variance . The magnitude of the vibration is given by . Can we find the distribution of ?

To find the distribution of , we can use the fact that and are independent normal random variables. The joint distribution of and is given by: (Because the joint distribution of two independent normal random variables is the product of their individual distributions).

To find the distribution of , let’s fix a value and calculate the probability that . Mathematically, we want to compute . Writing this in terms of the joint distribution of and , we have:

By the previous independence expression above, we can write this as:

Go back to your multivariable calculus notes and recall that the region of integration is a disk of radius centered at the origin. It is easier to evaluate this integral in polar coordinates, where and . The Jacobian of the transformation from Cartesian to polar coordinates is , so we have:

The integrand can be decoupled into a product of a function of and a function of , so we can first evaluate the outer integral with respect to and get some cancellations:

Taking the constant outside the integral, we have:

To evaluate this integral, we can use the substitution , which gives us . The limits of integration change accordingly: when , we have , and when , we have . Substituting these into the integral, we get:

If we set U to be an exponential random variable with rate parameter 1, then the above integral is just the CDF of U evaluated at . The CDF of an exponential random variable with rate parameter 1 is given by for . Substituting , we get:

For a general , the CDF of is given by:

The PDF of can be obtained by differentiating the CDF with respect to , which gives us:

This is exactly the PDF of a Rayleigh distribution with parameter . To verify that the hazard rate is indeed linearly increasing, we can compute the hazard function as follows:

which is linearly increasing in . Thus, the magnitude of the vibration follows a Rayleigh distribution with parameter , and its hazard function is linearly increasing in . Before we end this section, let’s stare at the expression for the CDF of for a moment above. Notice that the substitution we used earlier was not just a computational trick. If we define , then:

which is exactly the CDF of a standard random variable. So , or equivalently . Note that you can show this derivation rigorously with the change of variables formula for PDFs but I am not going to do that here. The key takeaway is that the Rayleigh distribution can be obtained by applying a squaring transformation to an exponential random variable. The substitution variable was the exponential random variable all along — the Rayleigh and exponential distributions are secretly the same family, just related by a squaring transformation.

With the Rayleigh distribution under our belt, we now return to the main story. The exponential and Rayleigh are special cases of a more powerful family- one distribution to rule them all, one distribution to find them, one distribution to bring them all and in the darkness bind them. Enter the Weibull.

The Weibull Distribution: A Unifying Family of Distributions

We have seen two specific distributions that arise from particular shapes of the hazard function: the Exponential distribution from a constant hazard and the Rayleigh distribution from a linearly increasing hazard. A natural question is: What if the hazard follows a power law, i.e., for some ? When , we recover the exponential distribution with a constant hazard. When , we recover the Rayleigh distribution with a linearly increasing hazard. And for other values of , we get an entire family of distributions known as the Weibull distribution. It is the workhorse of survival analysis and reliability engineering- flexible enough to model increasing, decreasing, or constant hazard, and simple enough to be analytically tractable.

Before we explore the Weibull distribution in detail, we must make a change. So far a single parameter has controlled both the shape of the hazard (how fast it grows) and the scale of the distribution (how long the system lasts). Mixing two roles into a single parameter is not ideal and makes it harder to understand the effect of each role on the distribution. Think of the normal distribution: if it were parameterized by a single number controlling both the mean and the variance, interpretation would be a nightmare. For the Weibull distribution, we separate these roles by introducing a scale parameter . Making the substitution , the power-law hazard becomes:

Now controls the shape of the hazard — is it growing, shrinking, or flat? And controls the scale — at what timescale are failures happening? Two parameters, two jobs, clean interpretation. We write to denote that the random variable follows a Weibull distribution with scale parameter and shape parameter . The derivation follows the same pattern as before — integrate the hazard to get , exponentiate to get , and differentiate to get . I encourage you to verify this yourself.

Integrating the hazard function, we get the cumulative hazard function:

The probability density function (PDF) of the Weibull distribution is given by:

The cumulative distribution function (CDF) is:

And finally, the survival function is:

The mean and variance of the Weibull distribution can be expressed in terms of the gamma function as follows:

where the gamma function is defined as for . The mode of the Weibull distribution can be computed by finding the value of that maximizes the PDF, which gives us:

For , the PDF is maximized at , just like the exponential distribution. This makes sense — when the hazard is constant or decreasing, failures are most concentrated near the start.

The median can be computed by solving , which gives us:

The scale parameter has a beautiful interpretation: At time , the CDF is . This means that regardless of the shape parameter , about 63.2% of the systems will have failed by the time we reach the scale parameter . This is a neat property of the Weibull distribution and gives us an intuitive way to interpret the scale parameter — it is the time by which approximately 63.2% of the systems have failed.

Special Cases of the Weibull Distribution

As we have noted earlier, the Weibull distribution is a unifying family- the two distributions we have seen so far are special cases of the Weibull distribution.

: The Exponential Distribution. When the shape parameter is equal to 1, the hazard function simplifies to , which is a constant hazard. But we already know that a constant hazard corresponds to the exponential distribution. Thus a Weibull distribution with is equivalent to an exponential distribution with rate parameter or mean . In other words, is the same as . The physical interpretation of this is that when , the system has a constant failure rate over time, which is a hallmark of the exponential distribution. Memorylessness (as we have explored earlier) is a direct consequence of constant hazard. Sudden unexpected failures, lightning strikes and random external shocks are examples of phenomena that can be modeled using the exponential distribution.

: The Rayleigh Distribution. When the shape parameter is equal to 2, the hazard function simplifies to , which is a linearly increasing hazard. We have already seen that a linearly increasing hazard corresponds to the Rayleigh distribution. Matching coefficients gives , so is the same as . The physical interpretation is that when , the system has a failure rate that increases linearly over time — the older it gets, the more dangerous the next instant. Wear-out failures, fatigue in materials, and aging processes are phenomena that can be modeled using the Rayleigh distribution.

: Decreasing Hazard. When the shape parameter is less than 1, the hazard function decreases over time. This models systems with high early failure rates that stabilize over time — think of manufacturing defects that cause early failures, but surviving units are essentially robust. This is sometimes called infant mortality. As we will see in the next section, decreasing hazard is just one phase of a richer failure pattern known as the bathtub curve.

: Increasing Hazard. The hazard function increases over time. This models systems that wear out — the longer they run, the more likely they are to fail. This is the most common scenario in reliability engineering.

Let us look at how the Weibull distribution looks like for different values of and a fixed scale parameter .

Figure 3: Weibull distribution: effect of shape parameter γ on survival function, density, and hazard (θ = 1)

The Bathtub Curve: A Common Failure Pattern in Real Life

Unfortunately, real-world failures are often more complex than what can be captured by (a single) parametric distribution like the Weibull. One common pattern observed in many systems is the bathtub curve, which describes a failure rate that has three distinct phases:

Phase 1: Infant Mortality (Decreasing Hazard, ).: In the early life of a system, there is a high failure rate due to manufacturing defects, installation errors or weak components. Systems that survive this phase are typically more robust and have a lower failure rate.
Phase 2: Useful Life (Constant Hazard, ).: After the initial phase, the failure rate stabilizes and remains relatively constant. This is the “useful life” phase where failures are mostly random and not due to wear-out. This is the exponential regime- memorylessness is a good approximation here.
Phase 3: Wear-Out (Increasing Hazard, ).: As the system ages, components wear out and the failure rate increases. This is the wear-out phase where failures are more likely to occur as time goes on. The older the system, the more dangerous the next instant.

Here is a schematic of the bathtub curve:

Figure 4: The bathtub curve: three phases of failure

The Weibull distribution can model each phase of the bathtub curve individually by adjusting the shape parameter . However, a single Weibull cannot capture all three phases simultaneously — the hazard can only be monotonically increasing, decreasing, or constant. To model the full bathtub curve, reliability engineers often use a mixture of Weibull distributions or more flexible models that allow the hazard to change shape over time. This is an active area of research in reliability engineering and survival analysis, and we will revisit it in later parts of this series.

What’s Next?

We now have a vocabulary of distributions to model different shapes of hazard functions. The exponential distribution for constant hazard, the Rayleigh distribution for linearly increasing hazard, and the Weibull distribution for power-law hazard. We have also seen how the Weibull distribution unifies these special cases and provides a flexible framework for modeling a wide range of failure patterns. But a distribution is unfortunately not a model. In Part 3, we ask- given a dataset of failure times (some censored, some not), how do we fit a distribution to the data? How do we estimate the parameters of the distribution? We will extend the maximum likelihood estimation framework to handle censored observations — the key ingredient that makes survival analysis different from standard statistical inference. We will review and revisit familiar diagnostic tools like the Q-Q plot and the K-S test to assess the quality of our fits. We will also write Python code to fit these distributions to real (synthetic) data using the library lifelines. Make no mistake- the series is still very mathematical but we will also have plenty of code and practical examples to keep things grounded. The clock is still ticking.

A Note on Agentic Predictive Maintenance

The mathematical framework we are building in this series is not purely academic. Modern predictive maintenance systems are increasingly or striving to be agentic — AI agents that continuously monitor machine health, estimate survival probabilities in real time, and autonomously trigger maintenance actions before failures occur. The survival function and the hazard function are the core quantities these agents reason about. An agent that knows a machine’s hazard rate is spiking can schedule maintenance, reroute workloads, or escalate to a human operator — all without being explicitly programmed for every failure scenario. We will dedicate a future part of this series to this intersection of survival analysis and agentic AI. For now, keep this application in the back of your mind as we build the mathematical foundations. And no — the mathematics is not going anywhere. Every agent decision we discuss will be grounded in the theory and proofs we are building right now.

Part 1: Introduction to Survival Analysis

Madhav Prashanth Ramachandran — Fri, 03 Apr 2026 18:30:00 GMT

Note

A note on platform and process. Part 0 of this series was published on Medium. Starting from Part 1, I have moved to Quarto on GitHub Pages for one reason: proper LaTeX rendering. A series this mathematical deserves a math-native home.

This article is not AI generated. I used Claude for proofreading, LaTeX syntax, and occasional structural feedback. Every derivation, example, and word is mine.

Time = 0 has passed. There’s a machine fresh off the assembly line, there’s a patient that got a second shot at life after a surgery, there’s a customer who just signed up with your services. The question is not whether — machines fail, patients die, customers leave. The question is when, and what we can say about that when, given everything we know. The branch of statistics that deals with answering this and other related questions is called survival analysis.

If you are trained in basic statistics and machine learning methods, your first instinct would probably be to reach for regression. Time is a continuous variable and (linear) regression — the workhorse of statistics — is the best fit for dealing with continuous variables. Right? No. Wrong. Here are the reasons regression breaks.

1. Regression does not place any constraints on the range of the response variable whereas the time to failure is always non-negative.

2. In the linear regression setup, the response variable is an estimate of its conditional expectation given the covariates, i.e . For survival analysis, conditional expectation is sometimes not the right quantity to estimate. In a clinical setting, you might care about “What fraction of patients survive beyond 10 years post surgery?” In a predictive maintenance setting, an engineer might ask “At what rate are machines failing after 10,000 hours of operation?” These are fundamentally different questions from “what is the average time to failure given these covariates?” — and regression, by construction, can only answer the latter.

3. And here comes the most important part. Imagine you are studying the time-to-failure of 10 machines. You run the study for 10,000 hours and stop. At the end of your study, 6 machines have failed and you know their exact failure times. The 4 machines that survived are, well, still running and all you know is that their survival time is greater than 10,000. If you naively regress on the six machines that failed and discard the ones that did not, your estimate of the time to failure is biased downward — because the machines you discarded are precisely the most durable ones in your sample. There is a name for this situation which we will get to very shortly.

The Setup

From here on, let’s fix our language. We’ll talk about the lifetime of a system — a machine, a patient, a customer — and the results will be general enough to apply to all of them.

Let be the lifetime of a system — a non-negative continuous random variable with Probability Density Function (PDF) and Cumulative Distribution Function (CDF) , which is the probability that the system’s lifetime is at most units.

The Survival Function of , denoted , is the probability that the system survives at least units of time:

The CDF and the Survival Function are related by the following simple relationship:

Censoring

Censoring is the reason survival analysis exists as a separate field of statistics. It is a situation where the exact time-to-event is unknown for the subject. All we know is that the failure had not occurred by the time we stopped observing — either because the study ended, or the subject was lost to follow-up. Going back to our machine example, the four machines that are still running after 10,000 hours are censored machines.

Censoring has three types.

Right censoring — the most common. The study ends before the event of interest occurs. The four machines that are still running after 10,000 hours are right censored.
Left censoring — The failure happened before you started observing. You inspect a machine for the first time and find it has already failed — you know failure occurred, but not when. Or a patient comes in with a disease that has already progressed to a certain stage — you know the disease started but not when.
Interval censoring — You don’t know when the system failed but it happened between two inspection times.

We will work with right censored data for pretty much the entire 12 part series.

Censoring vs. Missing Data. A missing value tells you nothing about a variable whereas a censored value conveys partial but concrete information. A machine that survived beyond 10,000 hours tells you exactly that — it lasted at least that long. That lower bound is real, and throwing it away is a statistical crime.

Here is a formal mathematical setup of right censored data. We have independent and identically distributed samples of the form where and . Here is the lifetime of the -th system and is the censoring time for that system. is an indicator variable which is 1 if failure happened before censoring, 0 otherwise. is the observed time, which is the minimum of the true lifetime and the censoring time.

You never observe and separately — only their minimum and whether the event got there first. Here’s what you know about the -th system based on the observed data:

If , failure happened before the study ended. You know exactly when.
If , the unit was still alive when you stopped watching. You only know that .

Here is what a survival analysis dataset looks like in the real world. Each row is one machine — you never see and separately, only their minimum and whether the event occurred.

Survival data for 10 machines. indicates failure observed; indicates censoring.
Machine ID	(hours)
M01	2,341	1
M02	10,000	0
M03	7,823	1
M04	10,000	0
M05	1,205	1
M06	9,441	1
M07	10,000	0
M08	4,678	1
M09	10,000	0
M10	6,102	1

The Hazard Function

We have established that tells us the probability of surviving past time . But consider a different and more pointed question: given that a system has already survived until time , how likely is it to fail in the next small instant?

Let be a small time interval. The probability that a system which has survived at least units of time fails in is, by the definition of conditional probability:

Let us call this quantity . Dividing both sides by :

Now, where have you seen an expression of the form where is small? That’s right — calculus. As :

because the derivative of the CDF is the PDF. So in the limit:

This quantity has a name. It is the hazard function of evaluated at , denoted :

Now I know what you’re thinking. The math is straightforward but why in the world would I care about this silly expression?

What does actually mean?

Recall that for a continuous random variable, probabilities over small intervals can be approximated using the PDF:

So the conditional probability of failure in becomes:

This is the key insight. is approximately the conditional probability that a machine which has survived until will fail in the next hours. This is why is called the rate of failure per unit time — it tells you, at every moment, how risky the next instant is for a system that has made it this far.

Let us clear any confusion between the probability density function and the hazard function . The probability density function is the unconditional density of failure at . conditions on survival. To make it more explicit, is a probability density, not a probability. does not mean “the probability of failure at exactly 5000 hours is 0.0001.” It means the probability of failure in a small interval around 5000 hours is approximately . Here is a concrete example to illustrate the difference.

Consider two machines — call them Machine A and Machine B. At hours, both have the same unconditional failure density: . A naive reading suggests they are equally “failure-prone” at this moment. They are not.

Machine A has — 90% of such machines survive to 5000 hours. Reaching this point is unremarkable.
Machine B has — only 10% of such machines survive to 5000 hours. This machine is a true survivor against the odds.

Their hazard rates at :

Machine B’s hazard rate is 9 times higher than Machine A’s at the exact same moment, despite having the same . The reason is simple: Machine B has survived longer than 90% of its kind. The few that remain are under severe stress — and the hazard function knows this. does not. This is the power of conditioning on survival. The hazard function carries information that the PDF simply cannot.

Another way to write , an introduction to the cumulative hazard function

The hazard function can also be expressed in terms of the survival function. Starting from the definition: We can rewrite the PDF in terms of the survival function:

As we know, the survival function is related to the CDF by . Differentiating both sides with respect to gives us

Substituting this into the hazard function: This can be rewritten as:

This expression shows that the hazard function is the negative derivative of the logarithm of the survival function. Integrating both sides with respect to gives us:

This integral is known as the cumulative hazard function, often denoted by or . This relationship between the cumulative hazard function and the survival function is fundamental in survival analysis. We can express the survival function in terms of the cumulative hazard function as follows:

What does the cumulative hazard function mean?

The cumulative hazard function can be interpreted as the accumulated risk of failure or death up to time since the start of observation. Think of it as a “risk score” that increases over time. The higher is, the more likely the system is to fail by time .

However, the most important consequence of the relationship between the cumulative hazard and the survival function is that the probability of survival decays exponentially with the cumulative hazard. This means that if the cumulative hazard increases linearly over time, the survival function will decay exponentially. If the cumulative hazard increases more rapidly, the survival function will decay even faster. This exponential relationship is the bridge between the hazard function (which can be estimated from data as we will see in the later sections) and the survival function (which is often what engineers or clinicians care about).

The Unified Framework of Survival Analysis

We have met three key functions in survival analysis: the PDF , the survival function , and the hazard function . These functions are not independent of each other. Know any one of them — you know them all.

Here’s how:

From :

In survival analysis, we can choose to work with any of these functions depending on the context and the question at hand. The Cox Proportional Hazards model, for example, is a regression model that directly models the hazard function. The Kaplan-Meier estimator is a non-parametric estimator of the survival function. The choice of which function to work with is often guided by the nature of the data and the specific research question being addressed.

What’s next?

We now have the language of survival analysis. We know what the key functions are and how they relate to each other. In the next part, we will meet some commonly used distributions of survival analysis and how they are characterized purely by their hazard functions. We will start with the simplest one- the distribution with a constant hazard function, which is the well known Exponential distribution. We will understand the memorylessness property of the exponential distribution and its consequences. We will then move on to the linearly increasing hazard function- the Rayleigh distribution, and the linearly decreasing hazard function, which is the Pareto distribution. We will see how the shape of the hazard function dictates the shape of the survival curve and what that means in real life. And finally, we will meet the workhorse of survival analysis and reliability engineering — the Weibull distribution, which can model a wide variety of hazard shapes and is used in a huge range of applications. We will also see how to estimate the parameters of these distributions from data and how to use them for prediction. The clock is still ticking. See you in Part 2.