×
Dustin Lennon

Dustin Lennon

Applied Scientist
dlennon.org

 

Curriculum Vitae


Dustin Lennon
February 2024
https://dlennon.org/cv
February 2024


About

About

I get excited about building analytical deliverables and take pride in developing code that is robust, reproducible, and scalable. My skill set is half software engineer, half statistician.

Highlights
  • "big picture" experience: scoping a business problem; establishing data requirements; robustifying data pipelines; generating reproducible results
  • expertise in time series forecasting, including real-time anomoly detection and retrospective "root cause" analysis
  • extensive machine learning and statistical modeling, including novel extensions of off-the-shelf algorithms
  • patent inventor; published in peer-reviewed, academic journal

My email is dustin.lennon@gmail.com.

Work Experience

Work Experience

  • Principal Data ScientistConviva

    Jun, 2021 - Dec, 2022

    My technical work has focused on improvements to the Stream ID product (community detection). More generally, I've tried to socialize data science best practices and build a more data-driven culture. Scala, python, databricks, SBT, GCP.

    • I designed and built a Scala implementation of Conviva's next generation household id algorithm. This used a more modern, flexible statistical machinery to preprocess and detect communities in unreliably labeled data.
      The prototype was used for an RFP on a production sized dataset that required functionality outside the scope of our production algorithm.

    • I designed and built a framework to enable reproducible research. The idea is that an analysis is typically comprised of multiple tasks that can be described by a DAG, and, in general, each node of the DAG will use a different parameter set. Typically, one wants to build and test the functionality of each node independently, and upstream calculations should be cached when possible. The framework manages these incremental results. It furthermore enables low-overhead, per-task logging through log4j / slf4j. It is designed to be useful for local testing, deployed as a JAR in databricks, and as a databricks Job submission tool through the command line. It centralizes the implementation of common tasks and largely solves the problem of databricks notebook drift. This enabled a shift where databricks is used to prototype an idea that is subsequently properly codified within the framework.

    • I'm currently providing data science support / consulting to a new product that hopes to provide instrumentation as a service beyond the streaming video space.

  • ConsultantInferentialist LLC

    May, 2012 - Jun, 2021

    My consulting company. I provide statistical and data science expertise to clients.

    • capacity planning,

    • promote, design, and deliver controlled experiments (A/B tests) where possible,

    • marketing analyses including engagement ladders,

    • customer retention models and churn analysis,

    • data integrity, data consistency, and data quality assessments.

  • Senior Data ScientistServiceNow

    Dec, 2016 - Oct, 2017

    I developed, implemented, and tested statistical algorithms for the Operational Intelligence team.

    • Developed a framework for simulating correlated events from (randomly generated) branching processes that included (randomly) censored event data.

    • Developed a novel event correlation algorithm which recovered 98 percent of correlated event pairs in the simulated datasets and generated new insights in unlabeled customer datasets.

    • Implemented a bivariate Kalman Filter with level, trend, and seasonality components and which allowed for missing data; combined this with a novel, 'split-hypothesis' paradigm to detect anomalous jumps in the state space. US Patent 10635565.

  • Researcher / Senior Data ScientistMicrosoft

    May, 2014 - Oct, 2015

    I provided statistical support for the O365 Customer Intelligence and Analysis and Bing Experimentation Teams.

    • Extended existing CLI software, and incorporated the new tooling into a web service. This allowed external partner teams to preview the ExP platform while waiting for a more formal migration.

    • Designed and built a generic, extensible data summary library in SCOPE/C#. The library was widely adopted, and the Avocado Team internalized it as a core part of a web-facing, 'deep-dive' toolset.

    • Worked cross-functionally with PMs and developers on the OneNote and Exchange Online teams to enable A/B experimentation on the ExP platform. Designed first-run experiments for these teams.

    • Implemented a change point detection algorithm for support ticket volumes which was used to identify unannounced deployments of new instrumentation. Corroborating these identifications allowed us to improve the documentation of, and communication around, the deployment process. It also allowed us to better clean the data, drastically reducing the variability of statistical estimates.

    • Identified classes of support tickets that were strongly correlated with early tenant lifecycle / onboarding issues.

    • Developed an interpretable forecast model for customer support tickets which informed budget allocations related to future staffing requirements.

  • Quantitative ModelerZillow

    Jun, 2011 - Feb, 2012

    I worked on fixes for the Zestimate algorithm.

    • Advocated for interpretable home valuation models that incorporated spatial and temporal structures (in contrast to off-the-shelf random forests).

    • Developed a cross-validated, coefficient-of-variation metric to assess the risk of temporal instability in a home's Zestimate history. This indicated that Zestimates with non-physical behavior were far more prevalent than previously thought.

    • Developed an alternative to the ZHVI--Zillow's proprietary home value index--based on estimating discount curves from longitudinal, repeat sales. This provided an improvement in the estimator for small samples.

    • Identified, and removed (post-hoc), 'spikey' Zestimate behavior in a collection of 100 million Zestimate histories. This resulted in 'corrections' to nearly 4 million time-series.

  • Machine Learning InternGlobys

    Oct, 2010 - Jun, 2011

    I built statistical models for up- and cross-sell marketing opportunities for mobile add-on packages.

    • I used an Apriori algorithm to create new features from historical purchases. These attributes had higher predictive power and produced significant lift in our models.

    • "I provided statistical support for implementing a step-wise, logistic regression model in production.

  • ConsultantNumerix

    Jun, 2009 - Aug, 2010

    I provided expertise on numerical stability issues arising in the multi-factor backward lattice algorithm.

  • Senior Software DeveloperNumerix

    Jun, 2006 - Aug, 2007

    I worked on numerical codes for pricing exotic financial derivatives.

    • Reverse engineered a multi-factor, backward-lattice pricing algorithm to diagnose and fix numerical instabilities.

    • Wrote new solvers for calibrating BGM Libor interest-rate models to market data.

    • Implemented a PDE solver to price Asian and Lookback options with discrete observation dates.

  • Technical StaffMIT Lincoln Laboratory

    Sep, 2001 - May, 2002

    Sensor Measurement and Analysis Team

    • Implemented backscatter models and tracking algorithms for RADAR applications.

Projects

Projects

  • Capacity Planning (xCloud), Analytics

    Jun, 2020

    Introduced back testing into an existing methodology which showed that 13% of presumed-to-be-available resources were never needed.

    Developed a cohort- and market-segmented analysis using a markov chain model designed to capture invitation lag, onset spike, and steady state behaviors. This new model suggests that the existing methodology consistently over-estimated monthly steady state capacity needs by 100% to 300%.

    Deployed calibrated models into Excel to enable real-time, what-if scenarios for the finance team. The excel add-in used the Python xlwings package.

  • Measuring Effect Size (xCloud), Analytics

    Apr, 2020

    Platform- and title-level analyses quantifying the effect of xCloud participation on invited-user behavior within the larger Xbox ecosystem. We considered user activity on the console in the 28 days prior to receiving an invitation; and the user activity on both console and xCloud for the 28 days after their first gameplay on xCloud (or synthetic first gameplay).

    Addressed self-selection bias by introducing sample weights for a non-participating, pseudo-control group. The weights were chosen to induce a change of measure that rebalanced the pseudo-control group against a participating, pseudo-treatment group. The rebalancing was with respect to previous console behavior.

    Addressed invitation lag for the pseudo-control group by synthesizing a first gameplay date. This was accomplished by sampling from the distribution of invitation lag conditioned on the date of invitation obtained from the pseudo-treatment group.

    For title-level analyses, developed a family of usage-pattern indicator metrics which provided intuitive estimates of standard business metrics: retention, replacement, extension, new usage, exploration, and conversion.

  • Published Pages, Publishing Platform

    https://dlennon.org/pages

    Jul, 2019

    I designed the publishing platform to host this resume and my portfolio. The look and feel was inspired by the jsonresume.org 'elegant' theme. The project depends heavily on the Python Klein library, Jinja2 templating engine, and pandoc to transform Markdown into HTML. The platform also permits easy inclusion of external, templated HTML; page-specific JavaScript; and registered python plugins associated with ajax endpoints on the server.

  • Charity Search, Web Application

    Jul, 2019

    I imported non-profit IRS tax returns into ElasticSearch and built a website to search for local charities.

  • Multivariate Kalman Filter, Python Package

    Apr, 2019

    I developed a multivariate Kalman filter code for non-stationary time-series analysis.

    • Full multivariate model enabling fast, online analysis;
    • Native support for level, trend, and seasonality components;
    • Non-informative priors for state space initial conditions;
    • Automatically handles missing data;
    • Support for modeling intervention effects
  • Twittalytics, Web Application

    http://twittalytics.com

    Aug, 2018

    The app monitors the Twitter stream and maintains a dynamic list of trending hashtags; and, for each hashtag, a random sample of relevant tweets.

  • Carpool Project, Android Application

    http://carpoolproject.org

    Feb, 2013

    I prototyped a location tracking app to collect daily commute data with the intention of helping people create viable carpools.

  • Interactive Spatial Heatmaps, Web Application

    Dec, 2013

    I built a website to visualize fused geographic (TIGER/Line) and census (ACS,BLS) datasets. This featured an interactive heatmap that reported spatial statistics aggregated at a metro (MSA) level.

Publications

Publications

  • Systems and methods for robust anomaly detection , US Patent No. 1063556

    Published on: Apr 28, 2020

    My contribution was the change detector described in the first claim and Fig. 17: using a secondary Kalman filter to distinguish transient noise from a level shift.

  • The Effect of Active Users on Support Tickets, Microsoft Internal

    Published on: Oct 01, 2014

    This work presents a simple statistical analysis characterizing the relation between the number of active / engaged users in the system and the rate at which service request tickets are created.

  • Support Tickets: Confidence Intervals for Population Estimates, Microsoft Internal

    Published on: Aug 01, 2014

    This work showcases two simple models that allow for the construction of confidence intervals for population estimates associated with customer support tickets.
    Why is this important? Because it allows us to separate natural variation in business metrics from abnormal behavior that would warrant further investigation.
    What did we do? We built a model for data loss and a model for label misclassification. These models are used to assess how these two distinct sources of variation affect population estimates such as total minutes spent in customer service.

  • Probabilistic Performance Guarantees for Oversubscribed Resources , Inferentialist

    Published on: Nov 01, 2013

    The paper examines the risk associated with resource allocation in the case of oversubscription. We formulate the problem in a mathematical context and provide a business level parameterization that bounds, in probability, the rate of resource exhaustion.
    To validate the procedure, we run a simulation over different resource consumption scenarios. In the traditional case, we obtain a 26.4% usage rate; hence, 73.6% of our resource pool goes unused. Using the strategy described in the paper, we can guarantee, with 95% confidence, that resources will be available 99% of the time. This relaxation provides a 2.5x increase in utilization, and the median usage rate jumps to 66.7%.

  • Optimal Lending Club Portfolios , Inferentialist

    Published on: Oct 01, 2013

    This paper extends the concept of an actively managed, Lending Club portfolio. It introduces a novel, random forest type algorithm that treats portfolio assets in a survival context. Using historical data provided by the company, we use our algorithm to constructing optimial portfolios of Lending Club loans.
    Our results, driven by expected returns, compare favorably to investment strategies based solely on the loan grade assigned by Lending Club. Our optimal, actively managed portfolios have an expected return exceeding 12% annually. In contrast, portfolios constructed on A-grade loans return 6.68%; B-grade loans, 7.49%; and C-grade loans, 8.11%.

  • Measuring Microsatellite Conservation in Mammalian Evolution with a Phylogenetic Birth-Death Model , Genome Biology and Evolution

    Published on: May 16, 2012

    Microsatellites make up about three percent of the human genome, and there is increasing evidence that some microsatellites can have important functions and can be conserved by selection. To investigate this conservation, we performed a genome-wide analysis of human microsatellites and measured their conservation using a binary character birth-death model on a mammalian phylogeny. Using a maximum likelihood method to estimate birth and death rates for different types of microsatellites, we show that the rates at which microsatellites are gained and lost in mammals depend on their sequence composition, length, and position in the genome. Additionally, we use a mixture model to account for unequal death rates among microsatellites across the human genome. We use this model to assign a probability-based conservation score to each microsatellite. We found that microsatellites near the transcription start sites of genes are often highly conserved, and that distance from a microsatellite to the nearest transcription start site is a good predictor of the microsatellite conservation score. An analysis of gene ontology terms for genes that contain microsatellites near their transcription start site reveals that regulatory genes involved in growth and development are highly enriched with conserved microsatellites.

Skills

Skills

  • Data Science (Venti)
    Interpretable ModelsTime SeriesVisualization
  • Data Science (Grande)
    Controlled Experiments (A/B Tests)Feature EngineeringData CleaningETL Pipelines
  • Machine Learning (Grande)
    Predictive Models
  • Mathematics (Tall)
    Linear AlgebraOptimizationNumerical Analysis
  • Software Development (Grande)
    Python
  • Software Development (Tall)
    RC/C++bash:awk/grep/sedPostgreSQL
  • Software Development (Short)
    JavaMapReduce
Education

Education

  • Statistics, M.S., University of Washington

    Jan, 2006 - Jun, 2010

    GPA 3.81/4.00
    STAT504 - Applied RegressionSTAT492 - Stochastic CalculusSTAT516/517 - Stochastic ModelingSTAT581/582/583 - Advanced Theory of Statistical InferenceMATH516 - Numerical OptimizationMATH530 - Convex AnalysisMATH582 - Convex Optimization AlgorithmsSTAT570 - Introduction to Linear ModelsSTAT599 - Statistical ConsultingBIOST579 - Data Analysis
  • Applied Mathematics, M.S., University of Washington

    Sep, 2003 - Dec, 2005

    GPA 3.82/4.00
    AMATH584 - Applied Linear AlgebraAMATH515 - Fundamentals of OptimizationAMATH585/586 - Boundary Value and Time Dependent ProblemsMATH554 - Linear AnalysisEE520 - Spectral Analysis of Time SeriesSTAT538 - Statistical ComputingSTAT530 - Wavelets
  • Computer Science, B.S.E., Princeton University

    Sep, 1997 - May, 2001

    graduated magna cum laude
    COS341 - Discrete MathematicsCOS423 - Theory of AlgorithmsCOS451 - Computational GeometryCOS426 - Computer GraphicsCOS333 - Advanced Programming TechniquesCOS425 - Database SystemsCOS318 - Operating SystemsCOS471 - Computer Architecture and OrganizationELE301 - Circuits and Signal ProcessingELE482 - Digital Signal Processing
Interests

Interests

  • At Work

    interpretable data sciencemachine learningconvex optimizationrandomized algorithmsstatistical computingclean and efficient, intuitive design
  • Away From Work

    familyrock climbingmountain bikingroad cyclinghiking