×

### Dustin Lennon

##### Applied Scientistdlennon.org

(206) 291-8893

censored data likelihood distribution derivation event data

Doubly and Randomly Censored Data

A derivation of the density functions and likelihood expression associated with doubly and randomly censored data.

Dustin Lennon
February 2021
https://dlennon.org/20210201_censor
February 2021

Abstract

#### Abstract

Censored data is an artifact of partial or incomplete measurements.

A typical scenario would be a survival analysis of time to event data. For example, a study may end before a final measurement is available (right censoring). Another situation might occur when batch processing log file data: the reported timestamp might reflect the time of processing and not the true event time (left censoring).

This post derives the density equations for censored data. Given a parameterization $\theta$, this leads naturally to a log likelihood formulation. As the censoring mechanism is, in general, random, we further allow for the possibility that this too depends on $\theta$.

Doubly Censored Data

#### Doubly Censored Data

##### Notation

Let $\rvz$ be the uncensored random variable; $\rvl$ and $\rvr$, the left and right censoring thresholds respectively; $\rvd$, an observed random variable denoting whether $\rvz$ was uncensored(0), censored only on the left (1), censored only on the right (2), or censored on both left and right (3).

Let $\rvy$ be the observed value associated with each value of $\rvd$. We define this accordingly:

$\rvy = \begin{cases} \rvz & \rvd = 0 \\ \rvl & \rvd = 1 \\ \rvr & \rvd = 2 \\ (\rvl, \rvr) & \rvd = 3 \end{cases}$

The last case addresses the scenario where a non-negative random variable has limited precision: for example, when measurements are only accurate above some threshold, it may be more sensible to proceed as though we know only that these values fall below the threshold.

##### Density Calculations

We assume that $\rvl, \rvr, \rvz$ are independent with distributions $G_L, G_R, F$, respectively. Then,

###### Uncensored, $\rvd = 0$:

$\prob{ \rvl < \rvz < \rvr } = \int f(u) G_L(u) (1 - G_R(u)) \,du$

###### Only left censored, $\rvd = 1$:

$\prob { \left(\rvz \leq \rvl\right) \cap \left(\rvz < \rvr \right)} = \int g_L(u) \underbrace{ \int_{-\infty}^u f(v) (1-G_R(v)) \, dv }_{ I_1(u) } \, du$

###### Only right censored, $\rvd = 2$:

$\prob { \left(\rvl < \rvz \right) \cap \left(\rvr \leq \rvz \right)} = \int g_R(u) \underbrace{ \int_u^{\infty} f(v) G_L(v) \, dv }_{ I_2(u) } \, du$

###### Left and right censored, $\rvd = 3$:

$\prob { \left(\rvz \leq \rvl \right) \cap \left(\rvr \leq \rvz \right)} = \iint 1_{\left\{ v < u\right\}}(u,v) g_L(u) g_R(v) (F(u) - F(v)) \,du \,dv$

Note that the events above are a partition of the underlying probability space.

A Simple Example

#### A Simple Example

We consider the following model, taking $\mu = 1$ and assuming known variances:

$\begin{gather*} \rvz \sim N\left(\mu, 1.0^2 \right), \\ \rvl \sim N\left(\frac{\mu}{2}, 0.25^2\right), \\ \rvr \sim N\left(2 \mu, 1.0^2\right). \end{gather*}$

We implement this in Python as such:

"""

Define the random variables

"""

import numpy as np
import pandas as pd
import scipy.stats

mu      = 1.0

Z       = scipy.stats.norm(loc=mu,     scale=1.0)
Yl      = scipy.stats.norm(loc=0.5*mu, scale=0.25)
Yr      = scipy.stats.norm(loc=2.0*mu, scale=1.0)

(F,f)   = (Z.cdf, Z.pdf)
(Gl,gl) = (Yl.cdf,Yl.pdf)
(Gr,gr) = (Yr.cdf, Yr.pdf)

##### Simluated Maximum Likelihood Estimates

We treat the events above as a multinomial random variable and compute the standard ML estimators below.

"""

Simulated ML estimates

"""

rng     = np.random.default_rng(seed=42)
n       = int(1e6)

rZ      = Z.rvs(size=n,  random_state = rng)
rYl     = Yl.rvs(size=n, random_state = rng)
rYr     = Yr.rvs(size=n, random_state = rng)

d0_idx = ((rYl < rZ)  & (rZ < rYr))
d1_idx = ((rYl >= rZ) & (rZ < rYr))
d2_idx = ((rYl < rZ)  & (rZ >= rYr))
d3_idx = ((rZ < rYl)  & (rZ > rYr))

p0 = d0_idx.sum() / n
p1 = d1_idx.sum() / n
p2 = d2_idx.sum() / n
p3 = d3_idx.sum() / n
ps = [p0,p1,p2,p3]

##### Direct Numerical Integration

While slow, we may also evaluate the above probabilities directly via numerical integration. These are computed as follows.

"""

Direct numerical integration

"""

fn0 = lambda u: Gl(u) * f(u) * (1-Gr(u))
q0, _ = quad(fn0, -np.inf, np.inf)

fn1 = lambda v,u: gl(u) * f(v) * (1-Gr(v))
q1, _ = dblquad(fn1, -np.inf, np.inf, gfun = lambda v: -np.inf, hfun = lambda u: u)

fn2 = lambda v,u: gr(u) * f(v) * Gl(v)
q2, _ = dblquad(fn2, -np.inf, np.inf, gfun = lambda u: u, hfun = lambda u: np.inf)

fn3 = lambda v,u: gl(u) * gr(v) * (F(u) -F(v))
q3, _ = dblquad(fn3, -np.inf, np.inf, gfun = lambda u: -np.inf, hfun = lambda u: u)

qs = [q0,q1,q2,q3]

"""

Comparing simulated estimates and direct calculations

"""

df = pd.DataFrame({
'MLE' : ps,
'Direct' : qs
})
df.index.name ='D'
df

MLE Direct
D
0 0.456583 0.456117
1 0.304057 0.304133
2 0.229668 0.230070
3 0.009692 0.009680
##### Loglikelihood

The loglikelihood can be read off directly from the densitity equations above. The only complication that arises is to efficiently evaluate the inner integrals, $I_1(u)$ and $I_2(u)$.

"""

Loglikelihood calculation

"""

rdf = pd.DataFrame({
'Yl' : rYl,
'Z' : rZ,
'Yr' : rYr
}
)

u   = rdf.loc[d0_idx,'Z']
ll0 = Z.logpdf(u) + Yl.logcdf(u) + np.log( 1 - Yr.cdf(u) )

u   = rdf.loc[d1_idx,'Yl']
ll1 = Yl.logpdf(u) + np.log(I1(u))

u   = rdf.loc[d2_idx,'Yr']
ll2 = Yr.logpdf(u) + np.log(I2(u))

u   = rdf.loc[d3_idx,'Yl']
v   = rdf.loc[d3_idx,'Yr']
ll3 = Yl.logpdf(u) + Yr.logpdf(v) + np.log( Z.cdf(u) - Z.cdf(v) )

ll = ll0.sum() + ll1.sum() + ll2.sum() + ll3.sum()
ll

-1747157.5396823618