--- title: "Cash Flow Considerations for Machine Learning" date: 2020-12-17 tags: [cash flow, loans, survival analysis, logrank statistic, machine learning, lending club] short: > An analysis of early LendingClub data, focused on default rates for various load traunches. Note that this is packaged as a reveal.js slideshow. toc_type: slides ---

$$ \text{global} \, \LaTeX \newcommand{\matr}[1]{\mathbf{#1}} \newcommand{\tran}{^{\mkern-1.5mu\mathsf{T}}} \newcommand{\cond}[2] { \left. #1 \, \right\vert #2 } \DeclareMathOperator{\ProbOp}{P} \newcommand{\Pr}[1] { \ProbOp #1 } \newcommand{\Prp}[1] { \ProbOp \left( #1 \right) } \newcommand{\Prs}[1] { \ProbOp \left[ #1 \right] } \newcommand{\Prb}[1] { \ProbOp \left\{ #1 \right\} } \newcommand{\cPr}[2] { \ProbOp \cond{#1}{#2} } \newcommand{\cPrp}[2] { \ProbOp \left( \cond{#1}{#2} \right) } \newcommand{\cPrs}[2] { \ProbOp \left[ \cond{#1}{#2} \right] } \newcommand{\cPrb}[2] { \ProbOp \left\{ \cond{#1}{#2} \right\} } \DeclareMathOperator{\VarOp}{Var} \newcommand{\Var}[1] { \VarOp #1 } \newcommand{\Varp}[1] { \VarOp \left( #1 \right) } \newcommand{\Vars}[1] { \VarOp \left[ #1 \right] } \newcommand{\Varb}[1] { \VarOp \left\{ #1 \right\} } \newcommand{\cVar}[2] { \VarOp \cond{#1}{#2} } \newcommand{\cVarp}[2] { \VarOp \left( \cond{#1}{#2} \right) } \newcommand{\cVars}[2] { \VarOp \left[ \cond{#1}{#2} \right] } \newcommand{\cVarb}[2] { \VarOp \left\{ \cond{#1}{#2} \right\} } \DeclareMathOperator{\EvOp}{E} \newcommand{\Ev}[1] { \EvOp #1 } \newcommand{\Evp}[1] { \EvOp \left( #1 \right) } \newcommand{\Evs}[1] { \EvOp \left[ #1 \right] } \newcommand{\Evb}[1] { \EvOp \left\{ #1 \right\} } \newcommand{\cEv}[2] { \EvOp \cond{#1}{#2} } \newcommand{\cEvp}[2] { \EvOp \left( \cond{#1}{#2} \right) } \newcommand{\cEvs}[2] { \EvOp \left[ \cond{#1}{#2} \right] } \newcommand{\cEvb}[2] { \EvOp \left\{ \cond{#1}{#2} \right\} } \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\argmax}{arg\,max} $$

$$ \text{local} \, \LaTeX \newcommand{\utif}{ \left( 1 + \frac{r}{12} \right) } \newcommand{\rnutif}{ \left( 1 + \frac{r^*}{12} \right) } \DeclareMathOperator{\ProbOp}{Pr} \DeclareMathOperator{\EvOp}{E} \DeclareMathOperator{\VarOp}{Var} \newcommand{\prp}[1]{ % probability operator with parentheses "prp" \ProbOp \left(#1\right) } \newcommand{\prc}[1]{ % probability operator with braces "prc" \ProbOp \left[#1\right] } \newcommand{\prk}[1]{ % probability operator with bracket "prk" \ProbOp \left\{#1\right\} } \newcommand{\Ev}[1]{ \EvOp \left[ {#1} \right] } \newcommand{\Evk}[1]{ \EvOp \left\{ {#1} \right\} } \newcommand{\Var}[1]{ \VarOp \left[ {#1} \right] } $$

Cash Flow Analytics¶

by Dustin Lennon¶

December 2020¶

Agenda¶

  • Loans 101
  • Survival Analysis
  • Metrics
  • Lending Club
  • Machine Learning
  • Summary

Loans 101¶

Loans 101¶

From the lender's perspective:

  • $P_0$, initial investment (dollars out)
  • $c$, sequence of payments (dollars in)
  • $r$, interest rate (annualized)
  • $T$, term of the loan (e.g., 36 months)
  • arbitrage: these two cash flows must be equivalent

However, this isn't immediate:

$$ \begin{align*} P_0 \utif^ T \ne c T \end{align*} $$

Loans 101: Amortization¶

The resolution is straightforward:

$P_0$ is lent at time 0.

$P_1$ = $\utif P_0 - c \qquad$ At time 1, $\utif P_0$ is owed and $c$ is paid.

Recursing,

$$ \begin{align*} P_k & = \utif P_{k-1} - c \\ & = \utif \left[ \utif P_{k-2} -c \right] - c \\ & = \utif^k P_0 - c \sum_{i=0}^{k-1} \utif^i \end{align*} $$

At time $T$, $P_T = 0$: the loan is paid off.

Loans 101: An arbitrage condition¶

$P_T = 0$ implies:

$$ \begin{equation*} \utif^T P_0 = c \sum_{i=0}^{T-1} \utif^i \end{equation*} $$

Note, each payment must be reinvested.

Implications¶

  • Positions may need to be unwound.
  • Any risk persists with each reinvestment.
  • Survival analysis "makes sense."
  • Generalizations exist for arbitrary time steps, arbitrary payments.
  • Internal Rate of Return

Survival Analysis¶

Survival Analysis: Kaplan Meier¶

Kaplan Meier

$$ \begin{align*} \mathcal{F}(y) & = \prod_{i:t_i < y} \left( 1 - h_i \right) \end{align*} $$

Survival Analysis: Notation¶

$$ \begin{align*} \widehat{\mathcal{F}}(y) & = \prod_{i:t_i < y} \left( 1 - \hat{h}_i \right) \\ & = \prod_{i:t_i < y} \left( 1 - \frac{a_i}{r_i} \right) \end{align*} $$

  • $r_i$ the count of units not yet failed or censored at $t_i$, the "risk set"
  • $a_i$ is the count of units that fail in $[t_i, t_{i+1})$
  • discrete accounting at fixed times; e.g., end of month

Survival Analysis: Definitions¶

$$ \begin{align*} \mathcal{F}(y) & = \prp{Y \geq y} & \mbox{survival probability} \\ h(y) & = \underset{\Delta \to 0}{\lim} \prp{ y\leq Y < y + \Delta \vert Y \geq y} & \mbox{continuous hazard} \\ & = f(y) / \mathcal{F}(y) \\ h_k & = \prp{ Y = t_k \vert Y \geq t_k } & \mbox{discrete hazard} \\ H(y) & = \int_0^y h(u) du & \mbox{integrated hazard function} \end{align*} $$

Parametric modeling is often via the hazard function.

Survival Analysis: Likelihood¶

Add in censoring:

$d_i = 1$ indicates an observation, $y_i$, is an uncensored failure time; $d_i = 0$, a censored failure time

$$ \begin{align*} l(\theta) & \equiv \sum_{i \in \mathcal{U}} \log f(y_i; \theta) + \sum_{i \in \mathcal{C}} \log \mathcal{F}(y_i) \\ \end{align*} $$

When failure is possible only at discrete times,

  • $f(y_i)$ and $\mathcal{F}(y_i)$ have telescoping expansions in terms of $h_k$ and $(1-h_k)$; and
  • apply a "ragged pivot": aggregate terms over time instead of over unit

$$ l(\theta) = \sum_{k \in \mathcal{T}} \left\{ a_k \log h_k + (r_k - a_k) \log (1-h_k) \right\} $$

Survival Analysis: Modeling¶

Non-parametric $h_k$¶

Set $\frac{\partial l(\theta)}{\partial h_k} = 0$ to recover Kaplan-Meier estimator.

Parametric $h_k(\theta)$¶

If discretized in time:

  • Likelihood is a collection of binomial distributions with denominator $r_k$
  • Model probabilities are derived via integrated hazard function

Survival Modeling: Hypothesis testing¶

For each $t_k$, a 2 by 2 contingency table:

  Group A Group B Total
failed \begin{equation*} a_{k,1} \end{equation*} $a_{k,2}$ \begin{equation*} a_{k,1} + a_{k,2} \end{equation*}
survived \begin{equation*} r_{k,1} - a_{k,1} \end{equation*} \begin{equation*} r_{k,2} - a_{k,2} \end{equation*} \begin{equation*} (r_{k,1} + r_{k,2}) - (a_{k,1} + a_{k,2}) \end{equation*}
total \begin{equation*} r_{k,1} \end{equation*} \begin{equation*} r_{k,2} \end{equation*} \begin{equation*} r_{k,1} + r_{k,2} \end{equation*}

Test statistic under null hypothesis is hypergeometric:

$$ A_{k,1} \sim \frac{ {{a_{k,1} + a_{k,2}} \choose a_{k,1}} {{(r_{k,1} + r_{k,2}) - (a_{k,1} + a_{k,2})} \choose {r_{k,1} - a_{k,1}}} } { r_{k,1} + r_{k,2} \choose r_{k,1} } $$

Log rank statistic: aggregate across the time index, $\mathcal{T}$: $$ Z = \frac{ \sum_{k \in \mathcal{T}} (A_{k,1} - \Ev{A_{k,1}}) }{ \sqrt{ \sum_{k \in \mathcal{T}} \Var{A_{k,1}} } } $$

Metrics¶

Metrics: Internal rate of return¶

For a cash flow with no risk of default, IRR is the $r$ that satisfies the earlier arbitrage condition:

$$ \begin{align*} \utif^T P_0 = c \sum_{i=0}^{T-1} \utif^i \end{align*} $$

We write the present value of the cash flow as

$$ \begin{align*} P_0 = \frac{1}{\utif}c + \frac{1}{\utif^2}c + \cdots + \frac{1}{\utif^T}c \end{align*} $$

and the idea is to appropriately discount each future installment.

Metrics: stochastic discount factors¶

Let $W_T^{(0)}$ be a portfolio with present value \$1; annualized interest rate, $r$; an installment of $f = c / P$, and $T$ payments remaining.

arbitrage requirement¶

$$ \begin{align*} \Ev{ W_T^{(0)} } & = \Ev{ D_{1,0} \left( W_{T-1}^{(0)} + f W_{T}^{(1)} \right) } \\ \end{align*} $$

Metrics: arbitrage requirement¶

$$ \begin{align*} \Ev{ W_T^{(0)} } & = \Ev{ D_{1,0} \left( W_{T-1}^{(0)} + f W_{T}^{(1)} \right) } \\ & = \Evk{ D_{1,0} \left. \left( W_{T-1}^{(0)} + f W_T^{(1)} \right) \right\vert Y^{(0)} \geq 1 } \prp{ Y^{(0)} \geq 1 } + 0 \\ & = \frac{ \prp{ Y^{(0)} \geq 1 } }{ \utif } \left\{ \Evk{ \left. W_{T-1}^{(0)} \right\vert Y^{(0)} \geq 1 } + f \Ev{ W_T^{(1)} } \right\} \\ & = d_{1,0} \Evk{ \left. W_{T-1}^{(0)} \right\vert Y^{(0)} \geq 1 } + d_{1,0} f \Ev{ W_T^{(1)} } \end{align*} $$

We can apply the above argument recursively:

$$ \begin{align*} 1 & = \Ev{ W_T^{(0)} } \\ & = d_{1,0} f + d_{2,1} d_{1,0} f + \cdots + \left( d_{T,T-1}\cdots d_{1,0} \right) f \end{align*} $$

yielding the risk-neutral discount terms

$$ d_{k,0} = d_{k,k-1}\cdots d_{1,0} = \frac{ \prp{ T^{(0)} \geq k } }{ \utif^k } $$

Metrics: Risk neutral IRR¶

We can now meaningfully compute a risk-neutral IRR, $r^*$, for the weighted cash flow:

$$ \begin{align*} P_0 = \frac{\prp{ Y^{(0)} \geq 1 } }{\rnutif}c + \frac{\prp{ Y^{(0)} \geq 2}}{\rnutif^2}c + \cdots + \frac{\prp{ Y^{(0)} \geq T}}{\rnutif^T}c \end{align*} $$

Metrics: Probability of repayment¶

  • Trivial from KM curve, $\prp{ Y \geq T }$

  • Potentially useful as:

    • an ethical lending constraint;
    • a component of a credit score; e.g., the maximal risk-neutral IRR on a loan amount set to 10% of income that has at least a 95% chance of repayment

Lending Club¶

Lending Club¶

January 2014¶

  • 5bn in loans issued, 1.55bn valuation
  • A new financial market?
  • Historical data!

December 2020¶

Lending Club was an American peer-to-peer lending company...

---Wikipedia

But historical data is still interesting...

Lending Club: Covariate set, at origination¶

column description
loan-amnt loan amount requested
funded-amnt loan amount funded
int-rate interest rate on loan
installment monthly payment
grade loan quality grade
sub-grade loan quality subgrade
purpose loan category: DEBT-CONSOLIDATION, MEDICAL, ETC.
emp-length employment length
home-ownership home ownership status: RENT, OWN, MORTGAGE, OTHER
annual-inc self reported annual income
is-inc-v income verified
fico FICO score
dti debt to income ratio
earliest-cr-line date of earliest reported credit line
open-acc number of open credit lines
revol-bal total credit revolving balance
revol-util percent credit used
total-acc total number of credit lines in credit file
delinq-2yrs number of 30+ days past-due incidences of delinquency in credit
inq-last-6mths number of inquiries by creditors in the past 6 months
mths-since-last-delinq months since the borrower’s last delinquency

Lending Club: Covariate set, post-origination¶

column description
loan-status current status of the loan: Charged Off, Current, Fully Paid
total-rec-int interest received to date
total-rec-prncp principal received to date
total-rec-late-fee late fees received to date
out-prncp remaining outstanding principal for total amount funded
total-pymnt payments received to date for total amount funded
last-pymnt-d last date payment was received
next-pymnt-d next scheduled payment date
pymnt-plan indicates if a payment plan is in place for the loan

Lending Club: Challenges¶

The full cash flow is unavailable. For post-origination, on the day of data collection, we know only:

  • last and next payment dates
  • total principal, interest, and fees paid out
  • loan status: current, paid, late, very late, charged off

no visibility into timeliness or completeness of payments

Difficult to establish a time of default or assess prepayment risk

In [3]:
"""
  Three estimators of loan length: 
    a. last recorded payment        -- "dur_last_pymnt_date"
    b. number of installments paid  -- "dur_total_installment"
    c. interest paid out            -- "dur_total_interest"
"""
examples
Out[3]:
charged off prepayment standard malformed loan
loan_amnt 1800 20000 18000 7500
int_rate 11.89 11.78 14.72 13.8
term 36 36 36 36
installment 59.7 662.19 621.52 236.86
issue_d 2008-12-05 2008-12-05 2010-08-06 2008-09-26
last_pymnt_d 2010-03-30 2010-09-08 2013-08-22 2011-09-29
total_pymnt 673.21 23036.2 22368.3 8526.93
total_rec_prncp 436.01 20000 18000 6950
full_interest 349.2 3838.84 4374.72 1026.96
total_rec_int 160.99 3003.13 4368.26 1576.93
loan_status Charged Off Fully Paid Fully Paid Fully Paid
dur_last_pymnt_date 15.7703 21.0928 36.5346 36.0747
dur_total_installment 11.2765 34.7879 35.9896 35.9999
dur_total_interest 10.134 19.9876 35.1976 24.7193

Charged off¶

could be up to 120 days late

Prepayment¶

censoring should be at time of payoff only if loan had fixed payments until the final payoff

Standard¶

difference in duration metrics due to installment * term mismatch

Malformed Loan¶

Doesn't satisfy the arbitrage conditions; off by more than 0.1%

Lending Club: Data challenges¶

Low data quality¶

  • Nonconformal loan parameters
  • Impossible payment aggregations

Poor data design¶

  • When did a loan fail? (larger question)
  • Missed opportunity to measure IRR directly

Machine Learning¶

Machine Learning: Where to start?¶

  • We can estimate survival curves for sets of loans (Kaplan Meier).
  • We can quantify the difference equivalence classes (log rank statistic).
  • We don't want to work too hard.

Random forest¶

  • Split on the log rank statistic.
  • Obtain bootstrap estimates (and variances) of our metrics.
  • Easily handle categorical variables and outliers.

Machine Learning: Random forest, construction¶

Take B bootstrap samples of size N.

For each bootstrap sample, build a partitioning tree, $T_b$:

  1. Select $m < p$ variables as split candidates
  2. For each split candidate, determine a set of split values
  3. Compute best split
  4. Partition the node and recurse

  • blue nodes are terminal nodes; size, or statistically homogeneous

Machine learning: Random forest, scoring¶

Score a loan, $i$, via OOB tree set, $\mathcal{T}_i$:

  1. For each $t \in \mathcal{T}_i$, find the leaf node that would have contained $i$
  2. Compute any summary statistics, $S_{i,t}$
  3. $\mathcal{X}_i = \left\{ S_{i,t} : t \in \mathcal{T}_i \right\}$ is a predictive distribution of the summary statistics associated with loan $i$

Machine Learning: Random forest, survival curves¶

Machine Learning: Variable importance¶

Summary¶

Summary: Context matters¶

Investor¶

  • Terms of the loan are known.
  • Analysis is retrospective.

Lender¶

  • Terms of the loan are to be specified.
  • Controlled experimentation is possible (perhaps expensive).

Summary: Using an investor-trained model¶

  • Term and loan amounts: likely fixed.
  • One remaining free parameter: interest rate or installment.

Analogous to profile likelihood / parameter scan:

  1. Partition the covariate vector: loan- and user-specific components
  2. Replace loan-specific components by a set of loan parameters (levels)
  3. For each query, use the corresponding survival curve to infer IRR, probability of repayment.