Curriculum Vitae

Dustin Lennon
June 2025
https://dlennon.org/archive/cv

June 2025

About

I get excited about building analytical deliverables and take pride in developing code that is robust, reproducible, and scalable. My skill set is half software engineer, half statistician.

Highlights

"big picture" experience: scoping a business problem; establishing data requirements; robustifying data pipelines; generating reproducible results
expertise in time series forecasting, including real-time anomoly detection and retrospective "root cause" analysis
extensive machine learning and statistical modeling, including novel extensions of off-the-shelf algorithms
patent inventor; published in peer-reviewed, academic journal

My email is dustin.lennon@gmail.com.

Work Experience

Principal Data Scientist, Conviva

Jun, 2021 - Dec, 2022

My technical work has focused on improvements to the Stream ID product (community detection). More generally, I've tried to socialize data science best practices and build a more data-driven culture. Scala, python, databricks, SBT, GCP.
- I designed and built a Scala implementation of Conviva's next generation household id algorithm. This used a more modern, flexible statistical machinery to preprocess and detect communities in unreliably labeled data.
  The prototype was used for an RFP on a production sized dataset that required functionality outside the scope of our production algorithm.
- I designed and built a framework to enable reproducible research. The idea is that an analysis is typically comprised of multiple tasks that can be described by a DAG, and, in general, each node of the DAG will use a different parameter set. Typically, one wants to build and test the functionality of each node independently, and upstream calculations should be cached when possible. The framework manages these incremental results. It furthermore enables low-overhead, per-task logging through log4j / slf4j. It is designed to be useful for local testing, deployed as a JAR in databricks, and as a databricks Job submission tool through the command line. It centralizes the implementation of common tasks and largely solves the problem of databricks notebook drift. This enabled a shift where databricks is used to prototype an idea that is subsequently properly codified within the framework.
- I'm currently providing data science support / consulting to a new product that hopes to provide instrumentation as a service beyond the streaming video space.
Consultant, Inferentialist LLC

May, 2012 - Jun, 2021

My consulting company. I provide statistical and data science expertise to clients.
- capacity planning,
- promote, design, and deliver controlled experiments (A/B tests) where possible,
- marketing analyses including engagement ladders,
- customer retention models and churn analysis,
- data integrity, data consistency, and data quality assessments.
Senior Data Scientist, ServiceNow

Dec, 2016 - Oct, 2017

I developed, implemented, and tested statistical algorithms for the Operational Intelligence team.
- Developed a framework for simulating correlated events from (randomly generated) branching processes that included (randomly) censored event data.
- Developed a novel event correlation algorithm which recovered 98 percent of correlated event pairs in the simulated datasets and generated new insights in unlabeled customer datasets.
- Implemented a bivariate Kalman Filter with level, trend, and seasonality components and which allowed for missing data; combined this with a novel, 'split-hypothesis' paradigm to detect anomalous jumps in the state space. US Patent 10635565.
Researcher / Senior Data Scientist, Microsoft

May, 2014 - Oct, 2015

I provided statistical support for the O365 Customer Intelligence and Analysis and Bing Experimentation Teams.
- Extended existing CLI software, and incorporated the new tooling into a web service. This allowed external partner teams to preview the ExP platform while waiting for a more formal migration.
- Designed and built a generic, extensible data summary library in SCOPE/C#. The library was widely adopted, and the Avocado Team internalized it as a core part of a web-facing, 'deep-dive' toolset.
- Worked cross-functionally with PMs and developers on the OneNote and Exchange Online teams to enable A/B experimentation on the ExP platform. Designed first-run experiments for these teams.
- Implemented a change point detection algorithm for support ticket volumes which was used to identify unannounced deployments of new instrumentation. Corroborating these identifications allowed us to improve the documentation of, and communication around, the deployment process. It also allowed us to better clean the data, drastically reducing the variability of statistical estimates.
- Identified classes of support tickets that were strongly correlated with early tenant lifecycle / onboarding issues.
- Developed an interpretable forecast model for customer support tickets which informed budget allocations related to future staffing requirements.
Quantitative Modeler, Zillow

Jun, 2011 - Feb, 2012

I worked on fixes for the Zestimate algorithm.
- Advocated for interpretable home valuation models that incorporated spatial and temporal structures (in contrast to off-the-shelf random forests).
- Developed a cross-validated, coefficient-of-variation metric to assess the risk of temporal instability in a home's Zestimate history. This indicated that Zestimates with non-physical behavior were far more prevalent than previously thought.
- Developed an alternative to the ZHVI--Zillow's proprietary home value index--based on estimating discount curves from longitudinal, repeat sales. This provided an improvement in the estimator for small samples.
- Identified, and removed (post-hoc), 'spikey' Zestimate behavior in a collection of 100 million Zestimate histories. This resulted in 'corrections' to nearly 4 million time-series.
Machine Learning Intern, Globys

Oct, 2010 - Jun, 2011

I built statistical models for up- and cross-sell marketing opportunities for mobile add-on packages.
- I used an Apriori algorithm to create new features from historical purchases. These attributes had higher predictive power and produced significant lift in our models.
- "I provided statistical support for implementing a step-wise, logistic regression model in production.
Consultant, Numerix

Jun, 2009 - Aug, 2010

I provided expertise on numerical stability issues arising in the multi-factor backward lattice algorithm.
Senior Software Developer, Numerix

Jun, 2006 - Aug, 2007

I worked on numerical codes for pricing exotic financial derivatives.
- Reverse engineered a multi-factor, backward-lattice pricing algorithm to diagnose and fix numerical instabilities.
- Wrote new solvers for calibrating BGM Libor interest-rate models to market data.
- Implemented a PDE solver to price Asian and Lookback options with discrete observation dates.
Technical Staff, MIT Lincoln Laboratory

Sep, 2001 - May, 2002

Sensor Measurement and Analysis Team
- Implemented backscatter models and tracking algorithms for RADAR applications.

Projects

Capacity Planning (xCloud), Analytics

Jun, 2020

Introduced back testing into an existing methodology which showed that 13% of presumed-to-be-available resources were never needed.

Developed a cohort- and market-segmented analysis using a markov chain model designed to capture invitation lag, onset spike, and steady state behaviors. This new model suggests that the existing methodology consistently over-estimated monthly steady state capacity needs by 100% to 300%.

Deployed calibrated models into Excel to enable real-time, what-if scenarios for the finance team. The excel add-in used the Python xlwings package.
Measuring Effect Size (xCloud), Analytics

Apr, 2020

Platform- and title-level analyses quantifying the effect of xCloud participation on invited-user behavior within the larger Xbox ecosystem. We considered user activity on the console in the 28 days prior to receiving an invitation; and the user activity on both console and xCloud for the 28 days after their first gameplay on xCloud (or synthetic first gameplay).

Addressed self-selection bias by introducing sample weights for a non-participating, pseudo-control group. The weights were chosen to induce a change of measure that rebalanced the pseudo-control group against a participating, pseudo-treatment group. The rebalancing was with respect to previous console behavior.

Addressed invitation lag for the pseudo-control group by synthesizing a first gameplay date. This was accomplished by sampling from the distribution of invitation lag conditioned on the date of invitation obtained from the pseudo-treatment group.

For title-level analyses, developed a family of usage-pattern indicator metrics which provided intuitive estimates of standard business metrics: retention, replacement, extension, new usage, exploration, and conversion.
Published Pages, Publishing Platform

https://dlennon.org/pages

Jul, 2019

I designed the publishing platform to host this resume and my portfolio. The look and feel was inspired by the jsonresume.org 'elegant' theme. The project depends heavily on the Python Klein library, Jinja2 templating engine, and pandoc to transform Markdown into HTML. The platform also permits easy inclusion of external, templated HTML; page-specific JavaScript; and registered python plugins associated with ajax endpoints on the server.
Charity Search, Web Application

Jul, 2019

I imported non-profit IRS tax returns into ElasticSearch and built a website to search for local charities.
Multivariate Kalman Filter, Python Package

Apr, 2019

I developed a multivariate Kalman filter code for non-stationary time-series analysis.
- Full multivariate model enabling fast, online analysis;
- Native support for level, trend, and seasonality components;
- Non-informative priors for state space initial conditions;
- Automatically handles missing data;
- Support for modeling intervention effects
Twittalytics, Web Application

Aug, 2018

The app monitors the Twitter stream and maintains a dynamic list of trending hashtags; and, for each hashtag, a random sample of relevant tweets.
Carpool Project, Android Application

http://dlennon.org/carpoolproject

Feb, 2013

I prototyped a location tracking app to collect daily commute data with the intention of helping people create viable carpools.
Interactive Spatial Heatmaps, Web Application

Dec, 2013

I built a website to visualize fused geographic (TIGER/Line) and census (ACS,BLS) datasets. This featured an interactive heatmap that reported spatial statistics aggregated at a metro (MSA) level.

Publications

Systems and methods for robust anomaly detection , US Patent No. 1063556

Published on: Apr 28, 2020

My contribution was the change detector described in the first claim and Fig. 17: using a secondary Kalman filter to distinguish transient noise from a level shift.
The Effect of Active Users on Support Tickets, Microsoft Internal

Published on: Oct 01, 2014

This work presents a simple statistical analysis characterizing the relation between the number of active / engaged users in the system and the rate at which service request tickets are created.
Support Tickets: Confidence Intervals for Population Estimates, Microsoft Internal

Published on: Aug 01, 2014

This work showcases two simple models that allow for the construction of confidence intervals for population estimates associated with customer support tickets.
Why is this important? Because it allows us to separate natural variation in business metrics from abnormal behavior that would warrant further investigation.
What did we do? We built a model for data loss and a model for label misclassification. These models are used to assess how these two distinct sources of variation affect population estimates such as total minutes spent in customer service.
Probabilistic Performance Guarantees for Oversubscribed Resources , Inferentialist

Published on: Nov 01, 2013

The paper examines the risk associated with resource allocation in the case of oversubscription. We formulate the problem in a mathematical context and provide a business level parameterization that bounds, in probability, the rate of resource exhaustion.
To validate the procedure, we run a simulation over different resource consumption scenarios. In the traditional case, we obtain a 26.4% usage rate; hence, 73.6% of our resource pool goes unused. Using the strategy described in the paper, we can guarantee, with 95% confidence, that resources will be available 99% of the time. This relaxation provides a 2.5x increase in utilization, and the median usage rate jumps to 66.7%.
Optimal Lending Club Portfolios , Inferentialist

Published on: Oct 01, 2013

This paper extends the concept of an actively managed, Lending Club portfolio. It introduces a novel, random forest type algorithm that treats portfolio assets in a survival context. Using historical data provided by the company, we use our algorithm to constructing optimial portfolios of Lending Club loans.
Our results, driven by expected returns, compare favorably to investment strategies based solely on the loan grade assigned by Lending Club. Our optimal, actively managed portfolios have an expected return exceeding 12% annually. In contrast, portfolios constructed on A-grade loans return 6.68%; B-grade loans, 7.49%; and C-grade loans, 8.11%.
Measuring Microsatellite Conservation in Mammalian Evolution with a Phylogenetic Birth-Death Model , Genome Biology and Evolution

Published on: May 16, 2012

Microsatellites make up about three percent of the human genome, and there is increasing evidence that some microsatellites can have important functions and can be conserved by selection. To investigate this conservation, we performed a genome-wide analysis of human microsatellites and measured their conservation using a binary character birth-death model on a mammalian phylogeny. Using a maximum likelihood method to estimate birth and death rates for different types of microsatellites, we show that the rates at which microsatellites are gained and lost in mammals depend on their sequence composition, length, and position in the genome. Additionally, we use a mixture model to account for unequal death rates among microsatellites across the human genome. We use this model to assign a probability-based conservation score to each microsatellite. We found that microsatellites near the transcription start sites of genes are often highly conserved, and that distance from a microsatellite to the nearest transcription start site is a good predictor of the microsatellite conservation score. An analysis of gene ontology terms for genes that contain microsatellites near their transcription start site reveals that regulatory genes involved in growth and development are highly enriched with conserved microsatellites.

Skills

Data Science (Venti)
Interpretable ModelsTime SeriesVisualization
Data Science (Grande)
Controlled Experiments (A/B Tests)Feature EngineeringData CleaningETL Pipelines
Machine Learning (Grande)
Predictive Models
Mathematics (Tall)
Linear AlgebraOptimizationNumerical Analysis
Software Development (Grande)
Python
Software Development (Tall)
RC/C++bash:awk/grep/sedPostgreSQL
Software Development (Short)
JavaMapReduce

Education

Statistics, M.S., University of Washington

Jan, 2006 - Jun, 2010
GPA 3.81/4.00
STAT504 - Applied RegressionSTAT492 - Stochastic CalculusSTAT516/517 - Stochastic ModelingSTAT581/582/583 - Advanced Theory of Statistical InferenceMATH516 - Numerical OptimizationMATH530 - Convex AnalysisMATH582 - Convex Optimization AlgorithmsSTAT570 - Introduction to Linear ModelsSTAT599 - Statistical ConsultingBIOST579 - Data Analysis
Applied Mathematics, M.S., University of Washington

Sep, 2003 - Dec, 2005
GPA 3.82/4.00
AMATH584 - Applied Linear AlgebraAMATH515 - Fundamentals of OptimizationAMATH585/586 - Boundary Value and Time Dependent ProblemsMATH554 - Linear AnalysisEE520 - Spectral Analysis of Time SeriesSTAT538 - Statistical ComputingSTAT530 - Wavelets
Computer Science, B.S.E., Princeton University

Sep, 1997 - May, 2001
graduated magna cum laude
COS341 - Discrete MathematicsCOS423 - Theory of AlgorithmsCOS451 - Computational GeometryCOS426 - Computer GraphicsCOS333 - Advanced Programming TechniquesCOS425 - Database SystemsCOS318 - Operating SystemsCOS471 - Computer Architecture and OrganizationELE301 - Circuits and Signal ProcessingELE482 - Digital Signal Processing

Interests

At Work

interpretable data sciencemachine learningconvex optimizationrandomized algorithmsstatistical computingclean and efficient, intuitive design
Away From Work

familyrock climbingmountain bikingroad cyclinghiking

Dustin Lennon

Applied Scientist
dlennon.org

About

Highlights

Work Experience

Projects

Publications

Skills

Education

Interests

Dustin Lennon

Applied Scientist
dlennon.org

Dustin Lennon

Applied Scientist dlennon.org

About

Highlights

Work Experience

Projects

Publications

Skills

Education

Interests

Dustin Lennon

Applied Scientist dlennon.org

Applied Scientist
dlennon.org

Applied Scientist
dlennon.org