Doubly and Randomly Censored Data
A derivation of the density functions and likelihood expression associated with doubly and randomly censored data.
Censored data is an artifact of partial or incomplete measurements.
A typical scenario would be a survival analysis of time to event data. For example, a study may end before a final measurement is available (right censoring). Another situation might occur when batch processing log file data: the reported timestamp might reflect the time of processing and not the true event time (left censoring).
This post derives the density equations for censored data. Given a parameterization \(\theta\), this leads naturally to a log likelihood formulation. As the censoring mechanism is, in general, random, we further allow for the possibility that this too depends on \(\theta\).
Let \(\rvz\) be the uncensored random variable; \(\rvl\) and \(\rvr\), the left and right censoring thresholds respectively; \(\rvd\), an observed random variable denoting whether \(\rvz\) was uncensored(0), censored only on the left (1), censored only on the right (2), or censored on both left and right (3).
Let \(\rvy\) be the observed value associated with each value of \(\rvd\). We define this accordingly:
\[ \rvy = \begin{cases} \rvz & \rvd = 0 \\ \rvl & \rvd = 1 \\ \rvr & \rvd = 2 \\ (\rvl, \rvr) & \rvd = 3 \end{cases} \]
The last case addresses the scenario where a non-negative random variable has limited precision: for example, when measurements are only accurate above some threshold, it may be more sensible to proceed as though we know only that these values fall below the threshold.
We assume that \(\rvl, \rvr, \rvz\) are independent with distributions \(G_L, G_R, F\), respectively. Then,
\[ \prob{ \rvl < \rvz < \rvr } = \int f(u) G_L(u) (1 - G_R(u)) \,du \]
\[ \prob { \left(\rvz \leq \rvl\right) \cap \left(\rvz < \rvr \right)} = \int g_L(u) \underbrace{ \int_{-\infty}^u f(v) (1-G_R(v)) \, dv }_{ I_1(u) } \, du \]
\[ \prob { \left(\rvl < \rvz \right) \cap \left(\rvr \leq \rvz \right)} = \int g_R(u) \underbrace{ \int_u^{\infty} f(v) G_L(v) \, dv }_{ I_2(u) } \, du \]
\[ \prob { \left(\rvz \leq \rvl \right) \cap \left(\rvr \leq \rvz \right)} = \iint 1_{\left\{ v < u\right\}}(u,v) g_L(u) g_R(v) (F(u) - F(v)) \,du \,dv \]
Note that the events above are a partition of the underlying probability space.
We consider the following model, taking \(\mu = 1\) and assuming known variances:
\[ \begin{gather*} \rvz \sim N\left(\mu, 1.0^2 \right), \\ \rvl \sim N\left(\frac{\mu}{2}, 0.25^2\right), \\ \rvr \sim N\left(2 \mu, 1.0^2\right). \end{gather*} \]
We implement this in Python as such:
"""
Define the random variables
"""
import numpy as np
import pandas as pd
import scipy.stats
mu = 1.0
Z = scipy.stats.norm(loc=mu, scale=1.0)
Yl = scipy.stats.norm(loc=0.5*mu, scale=0.25)
Yr = scipy.stats.norm(loc=2.0*mu, scale=1.0)
(F,f) = (Z.cdf, Z.pdf)
(Gl,gl) = (Yl.cdf,Yl.pdf)
(Gr,gr) = (Yr.cdf, Yr.pdf)
We treat the events above as a multinomial random variable and compute the standard ML estimators below.
"""
Simulated ML estimates
"""
rng = np.random.default_rng(seed=42)
n = int(1e6)
rZ = Z.rvs(size=n, random_state = rng)
rYl = Yl.rvs(size=n, random_state = rng)
rYr = Yr.rvs(size=n, random_state = rng)
d0_idx = ((rYl < rZ) & (rZ < rYr))
d1_idx = ((rYl >= rZ) & (rZ < rYr))
d2_idx = ((rYl < rZ) & (rZ >= rYr))
d3_idx = ((rZ < rYl) & (rZ > rYr))
p0 = d0_idx.sum() / n
p1 = d1_idx.sum() / n
p2 = d2_idx.sum() / n
p3 = d3_idx.sum() / n
ps = [p0,p1,p2,p3]
While slow, we may also evaluate the above probabilities directly via numerical integration. These are computed as follows.
"""
Direct numerical integration
"""
from scipy.integrate import quad, dblquad
fn0 = lambda u: Gl(u) * f(u) * (1-Gr(u))
q0, _ = quad(fn0, -np.inf, np.inf)
fn1 = lambda v,u: gl(u) * f(v) * (1-Gr(v))
q1, _ = dblquad(fn1, -np.inf, np.inf, gfun = lambda v: -np.inf, hfun = lambda u: u)
fn2 = lambda v,u: gr(u) * f(v) * Gl(v)
q2, _ = dblquad(fn2, -np.inf, np.inf, gfun = lambda u: u, hfun = lambda u: np.inf)
fn3 = lambda v,u: gl(u) * gr(v) * (F(u) -F(v))
q3, _ = dblquad(fn3, -np.inf, np.inf, gfun = lambda u: -np.inf, hfun = lambda u: u)
qs = [q0,q1,q2,q3]
"""
Comparing simulated estimates and direct calculations
"""
df = pd.DataFrame({
'MLE' : ps,
'Direct' : qs
})
df.index.name ='D'
df
MLE | Direct | |
---|---|---|
D | ||
0 | 0.456583 | 0.456117 |
1 | 0.304057 | 0.304133 |
2 | 0.229668 | 0.230070 |
3 | 0.009692 | 0.009680 |
The loglikelihood can be read off directly from the densitity equations above. The only complication that arises is to efficiently evaluate the inner integrals, \(I_1(u)\) and \(I_2(u)\).
"""
Loglikelihood calculation
"""
rdf = pd.DataFrame({
'Yl' : rYl,
'Z' : rZ,
'Yr' : rYr
}
)
u = rdf.loc[d0_idx,'Z']
ll0 = Z.logpdf(u) + Yl.logcdf(u) + np.log( 1 - Yr.cdf(u) )
u = rdf.loc[d1_idx,'Yl']
ll1 = Yl.logpdf(u) + np.log(I1(u))
u = rdf.loc[d2_idx,'Yr']
ll2 = Yr.logpdf(u) + np.log(I2(u))
u = rdf.loc[d3_idx,'Yl']
v = rdf.loc[d3_idx,'Yr']
ll3 = Yl.logpdf(u) + Yr.logpdf(v) + np.log( Z.cdf(u) - Z.cdf(v) )
ll = ll0.sum() + ll1.sum() + ll2.sum() + ll3.sum()
ll
np.float64(-1747157.5396823615)