$$ \newcommand{\rvz}{ \mathbf{Z} } \newcommand{\rvy}{ \mathbf{Y} } \newcommand{\rvd}{ \mathbf{D} } \newcommand{\rvl}{ \mathbf{Y_L} } \newcommand{\rvr}{ \mathbf{Y_R} } \DeclareMathOperator{\PrOp}{Pr} \newcommand{\prob}[1] { \PrOp \left[ #1 \right] } $$

Abstract

Censored data is an artifact of partial or incomplete measurements.

A typical scenario would be a survival analysis of time to event data. For example, a study may end before a final measurement is available (right censoring). Another situation might occur when batch processing log file data: the reported timestamp might reflect the time of processing and not the true event time (left censoring).

This post derives the density equations for censored data. Given a parameterization $\theta$, this leads naturally to a log likelihood formulation. As the censoring mechanism is, in general, random, we further allow for the possibility that this too depends on $\theta$.

Doubly Censored Data

Notation

Let $\rvz$ be the uncensored random variable; $\rvl$ and $\rvr$, the left and right censoring thresholds respectively; $\rvd$, an observed random variable denoting whether $\rvz$ was uncensored(0), censored only on the left (1), censored only on the right (2), or censored on both left and right (3).

Let $\rvy$ be the observed value associated with each value of $\rvd$. We define this accordingly:

$$ \rvy = \begin{cases} \rvz & \rvd = 0 \\ \rvl & \rvd = 1 \\ \rvr & \rvd = 2 \\ (\rvl, \rvr) & \rvd = 3 \end{cases} $$

The last case addresses the scenario where a non-negative random variable has limited precision: for example, when measurements are only accurate above some threshold, it may be more sensible to proceed as though we know only that these values fall below the threshold.

Density Calculations

We assume that $\rvl, \rvr, \rvz$ are independent with distributions $G_L, G_R, F$, respectively. Then,

Uncensored, $\rvd = 0$:

$$ \prob{ \rvl < \rvz < \rvr } = \int f(u) G_L(u) (1 - G_R(u)) \,du $$

Only left censored, $\rvd = 1$:

$$ \prob { \left(\rvz \leq \rvl\right) \cap \left(\rvz < \rvr \right)} = \int g_L(u) \underbrace{ \int_{-\infty}^u f(v) (1-G_R(v)) \, dv }_{ I_1(u) } \, du $$

Only right censored, $\rvd = 2$:

$$ \prob { \left(\rvl < \rvz \right) \cap \left(\rvr \leq \rvz \right)} = \int g_R(u) \underbrace{ \int_u^{\infty} f(v) G_L(v) \, dv }_{ I_2(u) } \, du $$

Left and right censored, $\rvd = 3$:

$$ \prob { \left(\rvz \leq \rvl \right) \cap \left(\rvr \leq \rvz \right)} = \iint 1_{\left\{ v < u\right\}}(u,v) g_L(u) g_R(v) (F(u) - F(v)) \,du \,dv $$

Note that the events above are a partition of the underlying probability space.

A Simple Example

We consider the following model, taking $\mu = 1$ and assuming known variances:

$$ \begin{gather*} \rvz \sim N\left(\mu, 1.0^2 \right), \\ \rvl \sim N\left(\frac{\mu}{2}, 0.25^2\right), \\ \rvr \sim N\left(2 \mu, 1.0^2\right). \end{gather*} $$

We implement this in Python as such:


"""

    Define the random variables

"""


import numpy as np
import pandas as pd
import scipy.stats

mu      = 1.0

Z       = scipy.stats.norm(loc=mu,     scale=1.0)
Yl      = scipy.stats.norm(loc=0.5*mu, scale=0.25)
Yr      = scipy.stats.norm(loc=2.0*mu, scale=1.0)

(F,f)   = (Z.cdf, Z.pdf)
(Gl,gl) = (Yl.cdf,Yl.pdf)
(Gr,gr) = (Yr.cdf, Yr.pdf)

Simluated Maximum Likelihood Estimates

We treat the events above as a multinomial random variable and compute the standard ML estimators below.


"""

     Simulated ML estimates

"""

rng     = np.random.default_rng(seed=42)
n       = int(1e6)

rZ      = Z.rvs(size=n,  random_state = rng)
rYl     = Yl.rvs(size=n, random_state = rng)
rYr     = Yr.rvs(size=n, random_state = rng)

d0_idx = ((rYl < rZ)  & (rZ < rYr))
d1_idx = ((rYl >= rZ) & (rZ < rYr))
d2_idx = ((rYl < rZ)  & (rZ >= rYr))
d3_idx = ((rZ < rYl)  & (rZ > rYr))

p0 = d0_idx.sum() / n
p1 = d1_idx.sum() / n
p2 = d2_idx.sum() / n
p3 = d3_idx.sum() / n
ps = [p0,p1,p2,p3]

Direct Numerical Integration

While slow, we may also evaluate the above probabilities directly via numerical integration. These are computed as follows.


"""

    Direct numerical integration

"""

from scipy.integrate import quad, dblquad
fn0 = lambda u: Gl(u) * f(u) * (1-Gr(u))
q0, _ = quad(fn0, -np.inf, np.inf)

fn1 = lambda v,u: gl(u) * f(v) * (1-Gr(v))
q1, _ = dblquad(fn1, -np.inf, np.inf, gfun = lambda v: -np.inf, hfun = lambda u: u)

fn2 = lambda v,u: gr(u) * f(v) * Gl(v)
q2, _ = dblquad(fn2, -np.inf, np.inf, gfun = lambda u: u, hfun = lambda u: np.inf)

fn3 = lambda v,u: gl(u) * gr(v) * (F(u) -F(v))
q3, _ = dblquad(fn3, -np.inf, np.inf, gfun = lambda u: -np.inf, hfun = lambda u: u)

qs = [q0,q1,q2,q3]


"""

    Comparing simulated estimates and direct calculations

"""

df = pd.DataFrame({
    'MLE' : ps,
    'Direct' : qs
})
df.index.name ='D'
df

MLE Direct
D
0 0.456583 0.456117
1 0.304057 0.304133
2 0.229668 0.230070
3 0.009692 0.009680

Loglikelihood

The loglikelihood can be read off directly from the densitity equations above. The only complication that arises is to efficiently evaluate the inner integrals, $I_1(u)$ and $I_2(u)$.


import collections
Map = collections.namedtuple('Map', ['t', 'v', 'dtdv'])

tmap = Map(
    t       = lambda v: 2/np.pi * np.arctan(v),     # t \in [-1,1]
    v       = lambda t: np.tan(np.pi/2 * t),        # v \in [-inf, inf]
    dtdv    = lambda v: np.pi/2 * (v**2 + 1)
)    

# nb = 129
# tdom = np.linspace(-1,1,num=nb)
# vdom = tmap.v(tdom)

yp1 = lambda v: f(v) * (1 - Gr(v))
yp2 = lambda v: f(v) * Gl(v) 

"""

    Runge Kutta 45; dense output

"""

from scipy.integrate import RK45

def I1(u, nb=129):
    fun = lambda t,y: yp1(tmap.v(t))*tmap.dtdv( tmap.v(t) )
    y0 = np.array([0.0])
    t_bound = 1.0
    max_step = 2/(nb-1)

    ta = tmap.t(u).to_numpy()
    ia = np.zeros_like(ta)
    sorter = np.argsort(ta)

    t0 = -1.0
    rk45 = RK45(fun, t0, y0, t_bound, max_step = max_step, vectorized=True)
    while True:
        try:
            rk45.step()    
        except RuntimeError:
            if rk45.status == 'finished':
                break
            else:
                raise
        # else:
        #     print(rk45.t, rk45.y)

        idx = sorter[range(*np.searchsorted(ta,[rk45.t_old,rk45.t],sorter=sorter))]
        ia[idx] = rk45.dense_output()(ta[idx])

    return ia


def I2(u, nb=129):
    fun = lambda t,y: -yp2(tmap.v(t)) * tmap.dtdv( tmap.v(t) )
    y0 = np.array([0.0])
    t_bound = -1.0
    max_step = 2/(nb-1)

    ta = tmap.t(u).to_numpy()
    ia = np.zeros_like(ta)
    sorter = np.argsort(ta)

    t0 = 1.0
    rk45 = RK45(fun, t0, y0, t_bound, max_step = max_step, vectorized=True)
    while True:
        try:
            rk45.step()    
        except RuntimeError:
            if rk45.status == 'finished':
                break
            else:
                raise
        # else:
        #     print(rk45.t, rk45.y)

        idx = sorter[range(*np.searchsorted(ta,[rk45.t,rk45.t_old],sorter=sorter))]
        ia[idx] = rk45.dense_output()(ta[idx])

    return ia


"""

    Loglikelihood calculation

"""

rdf = pd.DataFrame({
        'Yl' : rYl, 
        'Z' : rZ, 
        'Yr' : rYr
    }
)

u   = rdf.loc[d0_idx,'Z']
ll0 = Z.logpdf(u) + Yl.logcdf(u) + np.log( 1 - Yr.cdf(u) )

u   = rdf.loc[d1_idx,'Yl']
ll1 = Yl.logpdf(u) + np.log(I1(u))

u   = rdf.loc[d2_idx,'Yr']
ll2 = Yr.logpdf(u) + np.log(I2(u))

u   = rdf.loc[d3_idx,'Yl']
v   = rdf.loc[d3_idx,'Yr']
ll3 = Yl.logpdf(u) + Yr.logpdf(v) + np.log( Z.cdf(u) - Z.cdf(v) )

ll = ll0.sum() + ll1.sum() + ll2.sum() + ll3.sum()
ll

np.float64(-1747157.5396823615)



Doubly and Randomly Censored Data