Dustin Lennon

rocket_launch

Applied Scientist

https://dlennon.org/resume
dustin.lennon@gmail.com
206-291-8893

Introduction

sentiment_satisfied

Hello, reader.

You've found your unicorn! An applied math, statistics, computer science trifecta. I've spent the last twenty years working on all sorts of data and applied science problems, building frameworks that deliver cogent and actionable insights.

Before we dive in, a quick note on this website. It's designed to deliver an adaptive granularity experience; that is, you select the level of detail.

toc

introduction

employers

roles

projects

articles

education

degrees

Employers

group

bill and melinda gates foundation

overview

The Institute for Disease Modeling uses sophisticated statistical machinery to track polio outbreaks and forecast vaccine demand across Africa and the world.

roles

Dec 2023 to Apr 2024 -- Contractor (Slalom)

group

conviva

overview

Conviva is a B2B, privately held company in the streaming video analytics space. They provide a client-side, QoE reporting layer, and a corresponding, backend analytics service, for many of the streaming video platforms in use today. One of Conviva's core products is Stream ID which aggregates devices into households. At a high level, it addresses a community detection problem involving devices, IP addresses, and the inherently unstable labels used to identify these entities.

roles

Jun 2021 to Dec 2022 -- Principal Data Scientist

group

microsoft

roles

May 2014 to Dec 2014 -- Researcher / Senior Data Scientist

Jan 2015 to Oct 2015 -- Senior Data Scientist

Dec 2019 to Jul 2020 -- Contractor (TekSystems)

group

servicenow

roles

Dec 2016 to Oct 2017 -- Senior Data Scientist

group

inferentialist llc

overview

I hung out my own shingle. This was my consulting company.

roles

May 2012 to Jun 2021 -- Lead Statistician

group

zillow

overview

In 2011, while preparing for an IPO, Zillow released a new and improved Zestimate algorithm for assessing single family homes. However, the algorithm proved to be unstable, and the blowback in the press was severe.

roles

Jun 2011 to Feb 2012 -- Quantitative Modeler

group

globys

overview

In 2010, Globys was in the mobile marketing space, providing software for up- and cross-sell opportunities.

roles

Oct 2010 to Jun 2011 -- Machine Learning Intern

group

numerix

overview

In 2006, Numerix offered a product that allowed financial institutions to price exotic derivatives based on interest and foreign exchange rates. Underpinning these complex financial assets were arbitrage-free (martingale) measures and stochastic differential equations, and the raison d'être of their software was to expose an API to this numerical machienry in a familiar, Excel workbook.

roles

Jun 2006 to Sep 2007 -- Senior Software Developer

Jun 2009 to Aug 2010 -- Consultant

group

university of washington

roles

Sep 2004 to Jun 2006 -- Graduate Teaching Assistant

Sep 2007 to Jun 2010 -- Graduate Teaching / Research Assistant

group

mit lincoln laboratory

overview

MIT Lincoln Laboratory is a government lab that researches and develops RADAR technologies.

roles

Sep 2001 to May 2002 -- Technical Staff

Roles

computer

contractor (slalom) Bill and Melinda Gates Foundation

I reviewed the team's existing code and data pipelines and worked with the principal investigators to identify technical debt and stabilize infrastructure.

projects

Leveraging Hierarchical Polymorphic Patterns in R

An R Package Artifactory

Reliable Data Ingestion

Abstracting Access to Intermediate Results

computer

principal data scientist Conviva

My technical work focused on improvements to the Stream ID product (community detection). More generally, I tried to socialize data science best practices and build a more data-driven culture.

projects

Customer RFP / Stream Id v2

Metrics for Comparing Incomplete Partitions

Instrumentation as a Service

Synthetic Data Generation

Reproducible Research Tooling In Databricks

computer

contractor (teksystems) Microsoft

I provided data science support to the Xbox Cloud Gaming team.

projects

Capacity Planning with Novelty Effects

The Impact of xCloud on Xbox Console Behavior

computer

senior data scientist ServiceNow

I developed novel statistical algorithms: identifying correlated events in log data; forecasting and alerting for resource-related metrics.

projects

Kalman Filtering with Erratic Noise

Determining Correlated Events in Log Data

computer

senior data scientist Microsoft

I evangelized in-house, A/B testing for partner teams across Microsoft.

description

I was on a team of about a dozen PhDs, mostly statisticians and computer scientists. We evangelized experimentation, motivating partner teams to adopt experimentation as part of their normal release cycle. This involved collaborating with product managers to assess usage metrics and combine these KPIs into an overall evaluation criterion. We identified product features that might be good candidates for running first experiments, and we worked with the feature engineering team to ensure correct instrumentation was in place, that data was being collected, and that the quality of the data was of sufficiently high quality. Then we onboarded them into the Bing Experimentation engine: experimentation as a service.

projects

Onboarding Partner Teams to the Bing Experimentation Platform

Avocado: Big Data Summary Tool

A Prototype Web Service for Launching First Experiments

computer

researcher / senior data scientist Microsoft

I investigated novel statistical and ML models for classifying customer support issues and provided general statistical support to Office 365 business partners.

description

I joined Microsoft as a Researcher during a major restructuring. They were phasing out Test Developer positions and introducing Data Science roles in their stead. Managers didn't necessarily know how to leverage these new skillsets, and I wound up in a data science / business analyst role.

projects

Office 365 Data Quality

Identifying Early Lifecycle Concerns

Office 365 Business Questions

computer

lead statistician Inferentialist LLC

I provided client-facing statistical support and data science expertise across a variety of problem domains.

description

My approach: I offer a hands-on, statistical best-practices approach to data science. This includes expertise in building and hardening data pipelines, designing statistical experiments, and delivering appropriate, interpretable data analyses. I have held senior data scientist roles at ServiceNow and Microsoft and, previously, was a senior software developer at Numerix. My general knowledge of statistical modeling and machine learning is broad and extensive. I have deeper expertise in

Tracking Algorithms,
Large Scale Experimentation, and
Numerical Optimization

projects

Inferentialist Website Development

An Assessment of Optimal Lending Club Portfolios

Publishing Platform for Interactive Jupyter Notebooks

Multivariate Kalman Filter Python Package

Carpool Project: Find a Ride

Twittalytics: Trending on Twitter

Charity Search

computer

quantitative modeler Zillow

I identified valuations of poor quality and applied post hoc corrections. I also worked to identify algorithmic instabilities; built prototypes featuring regularized, interpretable models with spatiotemporal priors; and suggested improvements to existing methodologies.

projects

Characterizing Zestimate Instabilities

Post hoc Zestimate Adjustments

Small Sample Improvements to the ZHVI

computer

machine learning intern Globys

I built statistical models to improve up- and cross-selling of mobile add-on packages.

projects

Statistical Support

computer

consultant Numerix

I continued to provide solutions for numerical stability issues arising in the multi-factor backward lattice algorithm.

computer

graduate teaching / research assistant University of Washington

I supported my PhD studies with teaching and research.

projects

Measuring Microsatellite Conservation in Mammalian Evolution with a Phylogenetic Birth-Death Model

computer

senior software developer Numerix

I worked on numerical codes for pricing exotic financial derivatives.

projects

Fixing Instabilities in the Backward Lattice Algorithm

PDEs for Asian and Lookback Options

Developed Solvers for Calibrating BGM Libor Interest-Rate Models

computer

graduate teaching assistant University of Washington

I was a graduate teaching assistant: college algebra, calculus, introductory statistics courses, and numerical linear algebra.

computer

technical staff MIT Lincoln Laboratory

I implemented backscatter models and tracking algorithms for RADAR applications.

Projects

deployed_code

ghost blog: config + tooling #

I set up a Ghost blog / portfolio for featured articles that I've written. It runs as a systemd service in a Docker container, and there's "one-click" tooling to publish from both jupyter and markdown formats.

sites

The Me Project

repos

GitHub - dustinlennon/ghost-manager: ghost-manager

GitHub - dustinlennon/ghost-publish: ghost-publish facilitate ghost admin api posts

skills

docker
Python
Makefile
shell scripting
nbconvert (jupyter)

deployed_code

metayer: an r toolbox to facilitate projects of larger scope #

Metayer is an R package that addresses a few of the common pain points associated with evolving an R script / one-off analysis into a proper, productionalized, well documented, data science deliverable.

description

Improve the jupyter publication pipeline experience; integrate cli functions into logger; a global, hierarchical storage container with reference semantics: metayer is an eclectic productivity toolbox intended to help get a new project up and running quickly.

sites

An Eclectic Productivity Toolbox (metayer)

repos

GitHub - dustinlennon/metayer: R package - metayer - an eclectic productivity toolbox

skills

R
cli (R)
devtools (R)
pkgdown (R)
fs (R)
logger (R)
magrittr (R)
purrr (R)
knitr (R)
withr (R)
pander (R)
software package
publication pipeline
logging
metaprogramming

deployed_code

abstracting access to intermediate results # bill and melinda gates foundation, contractor (slalom)

I built an alternative to DVC that leveraged upstream configurations and abstracted access to incremental, upstream results. This provided a machinery to write cleaner, stage-focused, client code.

context

The team had adopted DVC as a data versioning tool but wasn't able to use it effectively. For example, it wasn't possible to have results simultaneously available for comparison across multiple runtime configurations.

description

The difficulty of generating simultaneous results for different runtime configurations was, ultimately, a shortcoming of DVC. The issue is that DVC understands DAG linkages in terms of input and output files, which are fixed. If a compute stage changes its runtime behavior and, hence, its output, then the simultaneous existence of both resultsets necessarily requires different file names.

I built an alternative to DVC that solved this problem. It required that stage dependencies be explicitly defined, and stage outputs be named according to a stage prefix and the hash of the upstream portion of the configuration. To accommodate resultsets comprised of heterogeneous data, I utilized the concept of a manifest, and it was this manifest that enjoyed the naming semantics while also providing an entry point / index to any items in the resultset. Much of this was handled seamlessly by the arrow R package.

Finally, I added an abstraction layer that allowed simple access to the upstream and downstream data. A participating function would call "load_result" at the beginning and "save_result" on termination. If upstream data was cached, that data would be loaded; if not, it would be generated from the upstream configuration. This meant that client code was stage-focused and relatively unburdened by any details of upstream data management.

skills

R
data version control (dvc)
apache arrow

deployed_code

an r package artifactory # bill and melinda gates foundation, contractor (slalom)

I built an R package server (an artifactory) so we could fix development environments and replicate them across the team. This served precompiled binaries, so it also provided efficiencies over regularly recompiling source libraries in CI/CD.

context

Onboarding -- in particular, setting up a development machine -- had been painful. During the process, it quickly became evident that maintaining a consistently versioned compute environment hadn't been a concern. This was true despite a difficult, deprecation-related refactoring that had taken place prior to my arrival. One place where this solution had the potential to improve performance was during a git push and the subsequent git actions: the existing process required that all the R package dependencies be rebuilt, and it would routinely take fifteen minutes to check in code.

skills

Python
R
artifactory

deployed_code

leveraging hierarchical polymorphic patterns in r # bill and melinda gates foundation, contractor (slalom)

I built a framework for encapsulating directory-organized code and used chained R environments to provide module-level polymorphism.

context

The source code had evolved without discipline. It had been almost a year since the last merge to main, and multiple files, sharing perturbations of replicated or discarded logic, proliferated in the codebase. The prevailing (anti)pattern was to source one variant or another, often haphazardly, into a cascade of R scripts.

description

I wanted to promote code modularization, and we investigated multiple, existing R packages that could have helped us better organize the code. None of them satisfied the principal investigators, so I made an attempt.

The idea was to expand R's notion of a package -- that is, source code in an ./R directory -- to a directory hierarchy. Using the R native ideas chained namespaces and dynamic scoping, each subdirectory provided scope for the code defined inside it. Further, these were chained up the tree, such that code in ancestor nodes was visible to, if not explicitly shaded by, descendents. This mimics object hierarchies in OOP languages admitting a rudimentary, module-level polymorphism.

Moreover, this can be combined with little additional effort to replicate traditional coding patterns like separation of interface and implementation. It also established a pathway to external runtime parameterization via a config file.

skills

R
object oriented programming

deployed_code

reliable data ingestion # bill and melinda gates foundation, contractor (slalom)

I built a robust data ingestion tool for tables available through the POLIS API.

context

The World Health Organization ("WHO") provides polio data through their POLIS endpoints. The availability of data is lagged, and the historical record is more complete than what would be available at the time of a forecast. The WHO uses a record update strategy rather than an append. Thus, records are not immutable. For backtesting purposes, this means that the enduser must take responsibility for maintaining an accurate historical record. Another difficulty with the POLIS dataset is that it is throttled, and data retrieval requires multiple calls to a finicky endpoint delivering records at a mere trickle, only 2000 per call.

description

I built a general, fault-tolerant, high-throughput, data-retrieval framework that managed the requests to the POLIS API, logged and respawned timed-out queries, validated records, and stored a compressed version of the raw data. The raw data were augmented with additional metadata including time of retrieval, a hash identifier, and an indicator of whether the record validated against the POLIS-provided schema.

While it only collected data from four distinct POLIS tables, the system was designed to be general and extensible. In particular, it used OOP and a class registry paradigm to easily construct new data adaptors.

The file system was abstracted using the fsspec module, and both Azure and local storage options were made available.

Under the hood, I used the asyncio Python package to build an asynchronous work queue that managed concurrent HTTP requests.

skills

azure
Python
object oriented programming
class registry
asyncio (python)
fsspec (python)
requests (python)
data acquisition
data lineage
data versioning

deployed_code

instrumentation as a service # conviva, principal data scientist

I built a web visualization tool--a circular Sankey diagram--to drive a discussion with Product about the benefits of leveraging customer domain knowledge.

context

In 2022, Conviva made an effort to extend their core business beyond streaming video. The goal was to provide instrumentation as a service. Now, any platform would be able to monitor user state by leveraging Conviva's reporting layer in their software stack. I provided data science support for this effort, including developing interactive tools for visualizing state transitions in arbitrary state spaces.

description

I built an interactive webtool for visualizing how users traversed the instrumented state space. Using the D3 javascript library, I designed a circular Sankey diagram. This captured the assumption that users will likely return to various parts of an app during the course of a session. The tool also provided a selection widget, allowing states to be toggled on and off, with the idea that only certain states would be of interest.

For three different partners, I packaged state transition data into a preprocessed file that was consumed by the webtool. The tool was able to show that these state spaces were, in fact, quite large, on the order of hundreds of states.

With a little bit of markov chain math, it was relatively straightforward to pare down the state space to show (effective) transitions between states in the selected subset. Thus, users were able to dynamically target portions of the state space, obtain adjusted transition probabilities, and visualize the results in real time.

skills

visualization
Python
D3 (javascript)
markov chains
web development

deployed_code

reproducible research tooling in databricks # conviva, principal data scientist

I built a reproducible research framework that cached incremental results in an effort to improve data science collaboration and reduce compute costs.

context

Reproducibility and data versioning became elevated concerns when Data Science was unable to verify the correctness of production metrics. I built a Scala/Databricks library that enabled caching of incremental results. This decomposed the monolithic production pipeline into smaller stages and allowed other data science users to collaborate from a consistent, shared starting point.

description

Specifically, the approach was to define a DAG from the pipeline stages using a parameter configuration file. Then, in cases where upstream portions of the parameter set had been previously computed, a cached result was supplied to the user. This provided consistency for data analysis and cost efficiencies because results weren't computed multiple times.

In addition, the library exposed an API for low-overhead, per-task logging through log4j / slf4j. Moreover, the library was designed to be useful across execution context: for local testing, deployed as a JAR in databricks, and as a databricks Job submission tool through the command line.

Abstracting the logging across context was intended to ease adoption amongst data scientists for the purposes of analysis instrumentation.

skills

DAG
scala
databricks
data pipeline
logging
reproducible research
data versioning

deployed_code

customer rfp / stream id v2 # conviva, principal data scientist

I spearheaded the technical work that extended the Stream ID product. This was showcased in an RFP that would have otherwise been outside the scope of our existing product.

context

Conviva wanted to participate in an RFP but the rigidity of the existing, monolithic pipeline made it difficult. In particular, the project required exploratory data analysis, redesigning and generalizing the ingestion portion of the existing pipeline, and implementing a scalable, map-reduce variant of the Louvain community detection algorithm in Scala.

description

The RFP became a pretext for implementing a new and improved Stream ID product. The goals were to introduce a new, more flexible household model; to deliver an easily configurable, highly tunable, and interpretable model; to define a principled approach focused on time granularity and canonical graphical representations; and finally, to produce something that, if correctly parameterized, could recover the behavior in the existing implementation.

These concerns were guided by customer use cases.

In particular, it was important to be able to tune the model around some notion of time scale: should clusterings reflect yesterday's data, last week's data, or last month's data? Alternatively, perhaps the dynamics of the graph are of interest, and older data should be discounted.

It was also clear that customer context should inform whether Stream ID would work with clusters of clients, clusters of addresses, or clusters that include both types of nodes.

At a high level, it made sense to reframe the problem as a clustering on a weighted, colored graph.

But how to define weights? Again, customer context should inform the configurations that would be necessary. I termed these "Association Metrics", and provided implementations for session length (usage), session counts (incidence), time discounted session length and time discounted session counts. Each of these defines a weight between a client and an address. Such weighted, colored graphs can be reduced to single color subgraphs, with new weights induced by those in the larger graph.

Another metric, one that mimicked the existing approach and was applicable for client-client and address-address graphs, was based on the idea of overlapping intervals. Effectively, this approach considered each client-address edge and computed an interval from the first time the pair was observed to the last time the pair was observed. The weight between two clients (or two addresses) was the overlap of their intervals aggregated over all addresses in which the clients were co-incident. Time discounted versions of these metrics were also made available. During development, it was discovered that the overlapping intervals metric tended to create larger subgraphs, one of the most common complaints amongst customers.

Another concern was dealing with the competing objectives: is it more important to associate every entity (node) with a cluster or to produce smaller clusters? We addressed this with a posthoc pruning step.

One line of thinking that was in vogue at the time was that Conviva should purchase third party labels that classified IP addresses as residential or non-residential. That meant the v2 solution needed to provide a mechanism for a-priori filtering.

Postal codes were another feature of interest. Should they inform whether nodes could be members in the same cluster? So, I added a knock-in / knock-out feature which, again, could be toggled to fit the question of interest.

Another feature component evolved from a request to optionally map IPV6 addresses to a network prefix; and to optionally map client identifiers to a device fingerprint.

With edges now having semantic meaning, one defined by the choice of association metric, it made sense to enable thresholding for low value edges.

For the most part, the above amounted to easily configurable preprocessing steps; and the Louvain algorithm would be applied subsequently. However, the knock-in / knock-out feature required an additional blacklisting / whitelisting step inside of each Louvain iteration.

Finally, I instituted a postprocessing pruning step where vertices were removed from a cluster when their removal didn't significantly impact the modularity.

skills

databricks
gcp
scala
louvain algorithm
clustering
map-reduce

deployed_code

metrics for comparing incomplete partitions # conviva, principal data scientist

I improved a family of existing metrics, and introduced some new ones, used to compare Stream ID households against a third party dataset.

context

Historically, Conviva used third party data to assess the correctness of the household assignments generated by its Stream ID product. Due to missing data, this entailed a problematic matching problem. I reviewed the existing assessment metrics and offered improvements.

description

The problem was to compare Conviva's partitioning of IP addresses to a partition produced by an external, third party. Neither source has an exhaustive list of IP addresses, and, while most IP addresses will be present in both partitions, each partition will also contain IP addresses that are not available to the other.

The existing assessment metric classified each equivalence class (household) in the local partition. The classification scheme was based on how the local equivalence class was represented in the external partition; essentially, a covering. If there was a perfect match, it was labeled as one-to-one. If the local equivalence class was properly contained by an external equivalence class, many-to-one. If the local equivalence class was comprised of more than one external equivalence classes, one-to-many. Otherwise, many-to-many.

This worked on the subset of IP addresses that both partitions had in common, and the prevailing logic was that local equivalence classes that weren't in the common collection could be filtered out a priori. Equivalence classes that contained a mix of shared and unshared IP addresses were assumed to be negligible, and IP addresses known only to the external partition were deemed to be of no concern.

So as not to disturb intuition, one requirement was that the new metric not be too different from the existing one. One idea, then, was to construct a bipartite graph of local households and external households, introducing an edge between the two sets when there is a common IP address. Then, compute the connected components and report the cardinality of each side.

Thus, a component labeled as "1-3" would mean one local equivalence class is comprised of three external equivalence classes. In fact, this is a subclass of the one-to-many. A "2-2" would mean that the union of two local equivalence classes was equivalent to the union of two external equivalence classes, but in such a way that the corresponding subpartitions were not equivalent; this, a subclass of many-to-many. However, this approach also admits "0-i" and "j-0" subclasses, which better addresses the missing data concerns.

Another approach, one not taken here, might be to consider a Jaccard distance for each address that occurs in both datasets. Then, each household score would be a functional of the component address scores.

skills

databricks
missing data
partitioning metrics
bipartite graphs

deployed_code

synthetic data generation # conviva, principal data scientist

I built an event-based model to simulate ground truth for the household identification problem.

context

Conviva's Stream ID product is tasked with solving a community detection problem. However, the clustering context is non-standard. In particular, the graph used to induce the clustering has two distinct types of nodes: devices and ip addresses. Moreover, the labels associated with the underlying entities of interest are subject to change without notice. There was no ground truth in the production data, so I built a generative model that produced synthetic data.

description

The goal was to stochastically capture "small-town" behavior.

I started with a fixed number of households, each associated with a randomly generated number of occupants. Occupants had a device which reset its device id at randomly generated intervals. Similarly, the household id was also reset.

Occupants were assumed to consume streaming video at a location, sporadically, throughout the course of their day. The onset and duration of these events was stochastic, as was the time between video sessions. However, an occupant could not consume multiple streams concurrently. This was a structural constraint of the simulation.

To incorporate the location, consumption could occur at home and also at an occupant's place of work. However, when an occupant left their house or workplace, they could choose to stop at one of several third places as well as other homes. This allowed more realistic mixing of devices and locations.

Another structural constraint in the model was that each occupant returned home at the end of the day, perhaps first watching more video, before starting a period of rest where no video was consumed.

Finally, I introduced time-of-day event rates that pushed people to be at work at a reasonable time as well as to leave at a reasonable time.

The generative model was, at a high level, a composition of random variables that characterized time-to-event and event-choice. Priors allowed occupants to behave with some additional variability.

skills

Python
cython
model evaluation
model-based simulation
ground truth
generative model
point processes

deployed_code

capacity planning with novelty effects # microsoft, contractor (teksystems)

I reviewed the existing estimates for xCloud GA resource requirements, built a model that suggested they were too high, and made recommendations for significant reductions.

context

In early 2020, xCloud was preparing for a GA launch. There was an interest in understanding how beta testers and early adopters were using the system.

description

I introduced back testing into an existing methodology which showed that 13% of presumed-to-be-required resources were never needed.

I built and developed a cohort- and market-segmented analysis using a markov chain model designed to capture invitation lag, onset spike, and steady state behaviours. This new model suggested that the existing methodology consistently over-estimated monthly steady state capacity needs by 100% to 300%.

I deployed calibrated models into Excel to enable real-time, what-if scenarios for the finance team. The excel add-in used the Python xlwings package.

skills

Kusto Query Language
SCOPE/COSMOS
markov chains
capacity planning
novelty effects
Python
Excel
xlwings (python)

deployed_code

the impact of xcloud on xbox console behavior # microsoft, contractor (teksystems)

I did a retrospective analysis to determine whether access to the xCloud platform into a user's choice set changed existing behavior with respect to Xbox console.

context

Before the GA release, access to the xCloud platform was by invitation only. In all cases, participants were existing Xbox console users. For this cohort of active gamers, one question was if access to xCloud impacted their usage of other Xbox platforms. If so, an estimate of the effect size was also of interest.

description

Typically, this sort of question is best addressed with randomized controlled experiments. However, randomization requires advanced planning. So the best available course of action was a retrospective analysis.

For each user, I aggregated activity in the 28 days before receiving an xCloud invitation and activity in the subsequent 28 days. The user-level metric was then the difference between these two aggregates. This was identified as a pseudo-treatment group.

In order to make a comparison, we needed a pseudo-control group. This was problematic as there was no comparable subpopulation. Those participating in xCloud were chosen because they were some of the most active gamers. It also didn't make sense to compare those that chose to accept the invitation against the group of invitees that declined: that, too, was going to introduce bias.

Ultimately, we constructed a sample of users who had not been invited to participate. We introduced a change-of-measure / weighting scheme that rebalanced the subpopulations -- non-invitees, invited but declined, and invited and participated -- according to their behavior in the 28 days before invitations were sent out. This provided some comfort that the comparison might now be more meaningful.

The results of the analysis were inconclusive: there was no evidence to support the hypothesis that access to xCloud diminished or increased Xbox console usage. The flip side of this was that users weren't shifting their gaming minutes to xCloud. If anything, our post-mortem deep dives suggested that xCloud activity was largely a novelty effect.

skills

data analysis
Python
self selection bias
weighted sampling

deployed_code

reading group posts #

A collection of posts that fall under the umbrella of textbook annotations.

description

A Bayesian hello world, I reproduced tables and figure in Chapter 3 and Chapter 4 of the Gelman and Hill 2007 text. There was unexpected data ingestion work, paring down lightly documented, CSV datasets for various tasks. It was also a first exposure to the Stan library.

Over the years, I've read Elements of Statistical Learning multiple times. One of those times, I worked through some notational stumbling blocks in one of the book's early chapters.

posts

A Bayesian Hello World

Reproducing Results: ARM, Chapter 3

Reproducing Results: ARM, Chapter 4

Annotations: Elements of Statistical Learning

skills

Python
data ingestion
stan (python)
Book: Elements of Statistical Learning
Book: Data Analysis Using Regression and Multilevel/Hierarchical Models

deployed_code

miscellaneous python library extensions and utility functions #

I have intermittently worked on some small scale, python utility projects.

description

I wrote two python packages containing useful extensions and utility functions.

Metadata By Design

One of the common frustrations of data analysis is confronting poorly documented datasets. Categorical variables can be particularly painful. They are often coded and stored as integers, and without proper metadata or apriori domain knowledge, can easily be ingested as a covariate that wrongly enjoys the same properties as the natural numbers. This has the potential to silently break statistical and machine learning models.

This post introduces a pandas extension that more tightly couples categorical codings with their metadata. It guides users toward better data design.

Dynamic Query Construction

This module provides helper two objects, DatabaseManager and SqlQueryManager. The former provides a simple copyto method for copying pandas.DataFrames to PostgreSQL tables; and a more general, dbexecute method, for sending arbitrary SQL queries to a PostgreSQl server and returning any recordset provided as a pandas.DataFrame. All communication with the PostgreSQL server is via the python package, psycopg2.

SqlQueryManager relies on a DatabaseManager to provide a toolset for iterative, incremental SQL query construction. It effectively manages partial query dependencies via a dynamically constructed with_clause.

Jittered Histograms in Matplotlib

This module provides matplotlib helper functions for jittering data, adjusting colorbar height, and adding captions. There are also simple classes extending locator and formatter objects for clipped data.

repos

GitHub - dustinlennon/pydlennon: A repo for often used code

GitHub - dustinlennon/pytrope: Often reused python code snippets

posts

pydlennon: pandas extensions

pytrope: psycopg2 extensions

pytrope: matplotlib extensions

skills

Python
matplotlib (python)
psycopg2 (python)
pandas (python)
visualization
dynamic SQL queries
PostgreSQL database
data categorical
data cleaning

deployed_code

publishing platform for interactive jupyter notebooks # inferentialist llc, lead statistician

I built a publishing pipeline / platform to host my CV and portfolio.

description

The look and feel was inspired by the jsonresume.org "elegant" theme. The project depends heavily on the Python Klein library, Jinja2 templating engine, and pandoc to transform Markdown into HTML. The platform also permits easy inclusion of external, templated HTML; page-specific JavaScript; and registered python plugins associated with ajax endpoints on the server.

sites

Published Pages

posts

Published Pages Cheatsheet

A Published Pages Example: Recommending a Data Warehouse

skills

Python
jupyter
nbconvert (jupyter)
klein (python)
pandoc
publication pipeline
Javascript
ajax
mathjax
jinja templating (python)

deployed_code

charity search # inferentialist llc, lead statistician

I imported non-profit IRS tax returns into ElasticSearch and built a website to search for local charities.

skills

ElasticSearch

deployed_code

multivariate kalman filter python package # inferentialist llc, lead statistician

I developed a python package that implemented a multivariate Kalman Filter.

description

The featureset:

full multivariate model enabling fast, online analysis
native support for level, trend, and seasonality components
non-informative priors for state space initial conditions
automatically handles missing data
support for modeling intervention effects

skills

Python
numpy (python)
SageMath

deployed_code

twittalytics: trending on twitter # inferentialist llc, lead statistician

The backend monitors the Twitter stream and maintains a dynamic list of trending hashtags; and, for each hashtag, a random sample of relevant tweets. The front end shows the world what's currently popular on Twitter.

skills

Python
reservoir sampling
web development
exponential smoothing

deployed_code

kalman filtering with erratic noise # servicenow, senior data scientist

I used a missing data property of Kalman filters to kick off a noise-or-not detector that enabled more sensible alerting in erratic long-tail scenarios.

context

ServiceNow wanted to monitor noisy network resource metrics and to do so without generating spurious alerts.

description

I built a bivariate Kalman filter with level, trend, and seasonality components and adapted its missing data functionality to monitor potentially anomalous deviations. The key idea was to introduce a second filter at the time of an outlier-sized jump and allow the second filter to evolve as though it was receiving missing data. If the two filters didn't resolve within some specified window, then the jump was determined to be real and the need for an alert could be assessed. Otherwise, the event was considered to have been spurious noise, and no alerts were issued.

patents

US10635565B2 - Systems and methods for robust anomaly detection - Google Patents

skills

forecasting
Kalman filtering
Python
hypothesis testing
noisy data
missing data

deployed_code

determining correlated events in log data # servicenow, senior data scientist

I developed a statistical algorithm to surface event pairings that had highly correlated arrival times. I tested the algorithm on simulated data from a generative model based on branching processes.

context

One of ServiceNow's larger customers wanted to know if we could analyze correlated event data and supplied us with a test dataset.

description

This correlated event data problem was non-standard in the sense that event arrivals were potentially censored and could arrive out of order. So event A might cause event B, but due to logging and cache vagaries, it was possible to observe event B before event A.

The event domain was network resource log data, so we allowed that notable event pairs, were they to co-occur and to be observed, would do so within some fixed window of time. We then applied a sliding window to identify a candidate set of event pairs. Then, the problem to be solved was reduced to finding a subset of the candidate set where each element was likely to be a correlated event pair.

To accommodate the out-of-order scenario, we imagined that a correlated event pair necessarily shared a latent precursor event that induced the correlation. That is, the latent event would occur and each of event A and event B would happen subsequently with random waiting times. However, the observable data wasn't the usual waiting times; rather, it was the difference of the two times at which event A and event B occurred. This was enough to write down a likelihood function for each candidate event pair and recover maximum likelihood estimates of the branching parameter process.

I used the likelihood function evaluated at the MLE as a test statistic and compared it to a null hypothesis where each event type represented in the candidate set of event pairings was a realization of an independent Poisson process. Then, I applied the same machinery as described above to a simulated dataset from this null model, again computing the MLE as a test statistic. Repeating this process, across a collection of simulated datasets, produces a distribution of test statistics under the null at which point a hypothesis test (or p-value) is readily available.

Because the test statistic is evaluated at the MLE, asymptotics would suggest that approximation approaches might be available. However, this flips the rolls of what we've identified as the true and null models. That is, the approximation should be taken about the MLE from the true model, and so the null model provides a test statistic which is now the "unusual" observation. This asymptotic approximation reduces the computational requirements dramatically. As might be expected, it was also found to be most reliable when observed event pair counts were high.

skills

branching processes
hypothesis testing
Python
censored data
correlated event data
generative model
model-based simulation

deployed_code

avocado: big data summary tool # microsoft, senior data scientist

I developed a generic, big data summary tool for columnar data.

context

In my interactions with partner teams, computing simple summary metrics was routine. However, computing any statistics more complicated than means and variances was rarely attempted.

description

I wrote a C# library of custom SCOPE mappers / reducers to compress millions of records into quantile lookup tables. These were suitable for visualization (reinterpolated histograms) as well as for computing more general statistical functionals. After compression, these latter analysis tasks were easily handled with locally running code.

skills

SCOPE/C# UDOs
SCOPE/COSMOS
metaprogramming
data summarization
visualization
map-reduce
R
unit testing

deployed_code

onboarding partner teams to the bing experimentation platform # microsoft, senior data scientist

I developed overall evaluation criteria and helped orchestrate first experiments with Bing partner teams.

context

In the beginning of 2015, the data scientists on the Bing Experimentation team were loaned out to partner teams to help them prepare their product workflows for experimentation.

description

I was tasked to work with the OneNote and Exchange Online teams. We initially worked with PMs to identify usage metrics that could be aggregated into an overall evaluation criteria. Then we engaged with the engineers to ensure that products were instrumented and that data collection was up, running, and of high quality.

skills

cross functional

deployed_code

a prototype web service for launching first experiments # microsoft, senior data scientist

I built a stopgap webtool to ensure that partner teams would be able to launch their first experiments without delay.

context

The vision was that Bing could help Microsoft product teams adopt a culture of controlled experimentation; that the process need not be reinvented but could be outsourced to an existing experimentation platform. We approached a handful of partner teams, offering our collective support and expertise. We asked only that they commit to running at least one experiment. Of course, first experiments are a lot of work, and it took months to moderize existing engineering workflows and cultivate positive momentum with the stakeholders. Unfortunately, on our side, the engineers' delivery timeline slipped. The self-service, programmatic access to the experimentation platform that had been promised wasn't going to be ready for another six months. We wanted to maintain the momentum that we'd developed with our partner teams, so a coworker and I built a bare bones web-service as a stopgap to buy our engineering team more time.

description

Teams within the Bing umbrella would typically launch their experiments in a self-service capacity, and this was the same experience we planned to offer our partner teams. However, there was an unpublished, internal command line tool that could also be used to launch experiments.

I built a Ruby on Rails app that provided a simple web interface to that command line tool. The user would provide an identifier for their experiment. They also needed to provide a SCOPE script that filtered their customers to experimental subjects and computed the metrics they wanted to monitor. However, for partner teams, this was readily available: it was a checkpoint on our engagement roadmap. This script only needed to be mapped into a Bing compatible format. The app provided this adaptor, building a dynamic SCOPE function that could be ingested into the command line tool.

skills

SCOPE/C# UDOs
SCOPE/COSMOS
metaprogramming
web development
Ruby on Rails
Javascript

deployed_code

office 365 data quality # microsoft, researcher / senior data scientist

I was asked to build a forecast model with insufficient data and used it as a teachable moment to drive changes in how the organization managed and communicated changes in their data pipelines.

context

In late 2014, the Office 365 Customer Intelligence Team wanted to understand their growth trajectory but faced issues with low data quality.

description

The request was to produce a five year forecast of active users from the previous three months of data. What do you do in that situation? Upfront, I explained that there was no way to provide a meaningful answer, but was told that it had to be done. So I built a simple time series model and, as expected, it produced an absurd result: with 95% confidence, either there'd be no users or everyone on the planet would be on O365.

For the PMs that had demanded this work be done, it wasn't a satisfying answer. They felt the request hadn't been taken seriously. As a response, I scheduled a follow up where I was able to communicate some of the basic limitations of statistical models. At that meeting, I was also able to point out where irregularities in our dataset contributed to even wider confidence intervals. If we hoped to improve on the initial results, the model would need more context.

That was the a-ha moment. The PMs recognized that one of these unusual periods of growth corresponded with a known event: a developer had pushed a change that introduced more data into the system. This had happened on a ramp, so it wasn't immediately identifiable as a jump -- the model saw it as evidence that a twenty percent increase in weekly usage wasn't implausible. What was interesting was that far more of the irregularities that we reviewed couldn't be accounted for. This was an indicator that we needed to make investments in tracking this sort of metadata.

At that time, there was no channel or process for communicating changes in the behavior of the data pipeline. Rather, these sorts of events were inquired about after the fact, usually because a KPI changed without any other explanation.

My last project with the Customer Intelligence team was to build a change point model, so that PMs could approach engineering to inquire if something had, indeed, changed.

skills

anti-patterns
time series

deployed_code

office 365 business questions # microsoft, researcher / senior data scientist

I introduced basic statistical ideas to business leaders, and this helped reduce managerial randomization across the org.

context

In 2014, Office 365 had just launched, and the Office 365 Customer Intelligence Team needed data science support to help answer their business questions. Top priority: costs associated with customer support tickets appeared to be out of control.

description

The team leadership had been working under the false impression that the volume of support tickets would be a fixed cost and not one that scaled with the size of the userbase or the level of user activity. This was a strongly held belief, and there was resistance to the idea that the data might provide evidence to the contrary. The logic, as I understood it, was that if the fixed cost hypothesis was incorrect, then the business was doomed. Existentially, it needed to be something else.

One of these alternative hypotheses was that customer support simply wasn't doing a good job, and my team allocated a two week block where everyone was asked to listen in to customer support calls and manually map the concerns into a fixed, knowledge base taxonomy. The idea was to compare our manual labelling of issues with those assigned to the ticket by customer support. While the hope was to observe a failing of customer support, the two collections of labels were found to be largely in agreement.

My contribution was to do the analysis that nobody wanted: to decisively show that more customers is correlated with more problems. We were going to have to face a tough truth.

This wasn't the only anti-pattern on the team. Another failure mode was reactive reporting. Every week, the team would produce a list of worst offenders; those ticket classes that dominated the call centers. Then, there'd be an ensuing scramble to find an explanation. As a statistician, it's second nature to ask how much of a thing is simply natural variation. So I attempted to frame the problem in those terms.

In conversations with PMs, two scenarios emerged as candidates for explaining week over week natural variation. The first was that tickets could be mislabeled, even if infrequently. The second was related to intermittent data loss. Both of these would contribute to week over week variation.

I built a stochastic model for each of these scenarios to assess variances. They were conservatively parameterized using domain knowledge, and the results strongly suggested that in most of our weekly reporting postmortems, we'd been out on fishing expeditions.

skills

anti-patterns
data analysis
model-based simulation
statistical regression
R
SCOPE/SQL

deployed_code

identifying early lifecycle concerns # microsoft, researcher / senior data scientist

I proposed a robust analysis plan for characterizing support tickets and subsequently scaled it back to accommodate a changing timeline.

context

In 2014, the focus of the Office 365 Customer Intelligence Team was triaging customer support tickets, specifically runaway costs.

description

The team was given a taxonomy for customer support tickets, and leadership wanted to know if any of the ticket types were correlated with onboarding issues or other early lifecycle concerns. One simple way to address the question is, for each ticket created, to consider the time elapsed since account creation (or, perhaps, first use). Then each leaf node in the taxonomy can be interpreted as a sample from a distribution of a random variables characterizing the elapsed time. The collection of distributions (across nodes in the tree) could then be characterized with a Bayesian prior. A principled approach would incorporate the hierarchical structure, perhaps by treating each internal node of the taxonomy as a mixture of the distributions of its children.

My manager's response to this analysis proposal was profound confusion. The proposal might as well have been communicated in a completely different language. Over the course of that conversation, it became clear that he needed a fast turnaround far more than he needed statistical sophistication.
The new plan, then, was, for each ticket class we'd compute a mean along with student-t confidence intervals and report those that weren't significantly larger than a relevant threshold, a two weeks onboarding window. We talked about the risk of false positives (multiple testing problem), and he communicated that he was willing to ignore that risk.

skills

R
SCOPE/SQL
bayesian statistics
statistical inference

deployed_code

carpool project: find a ride # inferentialist llc, lead statistician

I built an app to track location and collect daily commute data with the intention of helping people find a regular carpool.

sites

Carpool.Project

skills

android development
PostgreSQL database

deployed_code

inferentialist website development # inferentialist llc, lead statistician

I designed and built the Inferentialist website.

description

A secondary page on the site fused geographic (TIGER/Line) and census (ACS,BLS) datasets. This featured an interactive heatmap that reported spatial statistics aggregated at a metro (MSA) level. While simplistic today, in 2013, interactive maps weren't quite as ubiquitous.

sites

Inferentialist LLC

Inferentialist LLC - maps: us census data

skills

ruby
Ruby on Rails
web design
Javascript
bootstrap
TIGER/Line dataset
American Community Survey dataset
Bureau of Labor Statistics dataset
spatial statistics

deployed_code

an assessment of optimal lending club portfolios # inferentialist llc, lead statistician

Using Lending Club data, I built ML optimized portfolios and showed improved performance relative to portfolios based on predetermined loan grades.

context

In 2012, lending Club was a relatively new, and fastly growing, peer to peer lending platform. Using historical data provided by the company, our paper described a method for constructing optimial portfolios of Lending Club loans.

description

One facet of this analysis was assessing the survival risk of a given loan and using this to inform a continuous reinvesting strategy. This helped guide the choice of objective function in the underlying optimization.

The primary contribution was adapting a random forest to this survival paradigm.

Our results, driven by expected returns, compared favorably to investment strategies based solely on the loan grade assigned by Lending Club. Our optimal, actively managed portfolios had an expected return exceeding 12% annually. In contrast, portfolios constructed on A-grade loans return 6.68%; B-grade loans, 7.49%; and C-grade loans, 8.11%.

However, as an investment vehicle, the primary concern was the availability of "good" loans as it was well known that investors had early access to available assets.

deployed_code

characterizing zestimate instabilities # zillow, quantitative modeler

I developed a cross-validated, coefficient-of-variation metric to assess the risk of temporal instability in a home's Zestimate history. This indicated that Zestimates with non-physical behavior were far more prevalent than previously thought.

skills

R
instability metrics
cross validation
coefficient of variation

deployed_code

small sample improvements to the zhvi # zillow, quantitative modeler

The regional and subregional ZHVI performed poorly due to small samples. I developed a performant alternative that estimated regularized discount curves from longitudinal, repeat sale data.

context

In 2011, Zillow published a proprietary home value index--the ZHVI--a then competitor to the Case Shiller home price index.

skills

data analysis
R
splines

deployed_code

post hoc zestimate adjustments # zillow, quantitative modeler

I was tasked to identify and adjust "spikey" Zestimate behavior in a collection of 100 million Zestimate histories. This resulted in post hoc corrections to nearly 4 million time-series.

skills

SQL
R
splines

deployed_code

statistical support # globys, machine learning intern

I provided statistical support to the implementation team.

description

I used an Apriori algorithm to create new features from historical purchases. These attributes had higher predictive power and produced significant lift in our models.

I provided statistical support for implementing a step-wise logistic regression model that was put into production.

skills

data analysis
statistical regression
R
Apriori algorithm
logistic regression

deployed_code

measuring microsatellite conservation in mammalian evolution with a phylogenetic birth-death model # university of washington, graduate teaching / research assistant

I developed a multithreaded code to propagate probability vectors through a phylogenetic tree. This allowed our research team to make inferences on branch length and, consequently, to develop timelines for genome divergence.

description

Microsatellites make up about three percent of the human genome, and there is increasing evidence that some microsatellites can have important functions and can be conserved by selection. To investigate this conservation, we performed a genome-wide analysis of human microsatellites and measured their conservation using a binary character birth-death model on a mammalian phylogeny. Using a maximum likelihood method to estimate birth and death rates for different types of microsatellites, we show that the rates at which microsatellites are gained and lost in mammals depend on their sequence composition, length, and position in the genome. Additionally, we use a mixture model to account for unequal death rates among microsatellites across the human genome. We use this model to assign a probability-based conservation score to each microsatellite. We found that microsatellites near the transcription start sites of genes are often highly conserved, and that distance from a microsatellite to the nearest transcription start site is a good predictor of the microsatellite conservation score. An analysis of gene ontology terms for genes that contain microsatellites near their transcription start site reveals that regulatory genes involved in growth and development are highly enriched with conserved microsatellites.

repos

GitHub - dustinlennon/McSMAC: Multicore Stochastic Mapping Analysis Calculator

publications

Measuring Microsatellite Conservation in Mammalian Evolution with a Phylogenetic Birth–Death Model

skills

statistical genetics
R
C++
statistical inference
multithreaded programming
phylogenetic trees

deployed_code

pdes for asian and lookback options # numerix, senior software developer

I implemented a PDE solver to price Asian and Lookback options with discrete observation dates.

skills

partial differential equations, numerical methods
stochastic differential equations, numerical methods
stochastic differential equations
numerical analysis
martingales
computational finance

deployed_code

fixing instabilities in the backward lattice algorithm # numerix, senior software developer

I reverse engineered a multi-factor, backward-lattice pricing algorithm in order to diagnose and fix numerical instabilities.

skills

stochastic differential equations, numerical methods
stochastic differential equations
martingales
numerical analysis
numerical instability
lattice methods
computational finance
optimization

deployed_code

developed solvers for calibrating bgm libor interest-rate models # numerix, senior software developer

I developed new non-linear optimization solvers for calibrating BGM Libor interest-rate models to market data.

skills

numerical instability
martingales
lattice methods
computational finance
optimization
numerical optimization

Articles

check

easy rsync for local and remote backup #

I maintain several Ubuntu systems and needed a simple bash script that would backup / mirror these machines. Google pointed me to rsync. This blog post describes what I did with it.

posts

Easy rsync for Local and Remote Backup

gists

easy rsync for local and remote backup

skills

shell scripting

check

a fork and monitor pattern using asyncio and named sockets #

A gist, in python, that uses asyncio with named sockets and illustrates a fork and monitor pattern. It's used here for monitoring heartbeats but could easily be adapted for other process health metrics.

description

This is a short piece of code that spawns a child process that handles requests from a named socket. This could be useful for, say, monitoring heartbeats or other process health metrics.

To keep the parent process simple, there is no IPC: the filesystem is used for communication. In particular, the parent need only call a send_heartbeat function at its convenience. The monitor is lazy: it only computes the time since the last heartbeat on request. When the parent terminates, the monitor does too.

This code sets up a named socket in the filespace enabling control from the shell. This has utility when debugging. For example,

echo "hello" | socat - UNIX-CLIENT:monitor.socket
echo -n "" | socat - UNIX-CLIENT:monitor.socket    
socat - UNIX-CLIENT:monitor.socket

asyncio.open_unix_connection can be a bit fussy with being handed a socket. In particular, it expects an already accepted socket on which it could block indefinitely if, say, the client connects and does nothing. So, we provide a safe_unix_connection async context manager to make sure it doesn't get stuck, and that the writer is closed appropriately when a connection is terminated.

Application logic is loosely encapsulated at the end of the script. It should feel similar to the callback function passed to asyncio.start_server.

gists

a fork and monitor pattern using asyncio and named sockets

skills

Python
asyncio (python)
socket networking
asynchronous programming

check

command line qr codes #

A command line script to generate a QR code from a URL.

gists

qr code from url

skills

Python
matplotlib (python)

check

numerical linear algebra: qr, hessenberg, and schur decompositions #

This post follows Golub and Van Loan, introducing Householder reflections and Givens rotations, then using these tools to sketch out implementations of QR, Hessenberg, and Schur decompositions.

posts

Numerical Linear Algebra: QR, Hessenberg, and Schur

skills

numerical linear algebra
matrix decompositions
Python
eigenvalues
matplotlib (python)
numpy (python)

check

personalized point processes: a simple bayesian analysis #

The post describes a homogeneous Poisson process using a Gamma conjugate prior that can be used to estimate a pooled, per-subject intensity given a collection of realizations.

description

A homogeneous Poisson process is the simplest way to describe events that arrive in time. Here, we are interested in a collection of realizations. An example is user transactions in a system. Over time, we expect each user to produce a sequence of transaction events, and we would like to characterize the rate of these events on a per-user basis. In particular, users with more data should expect a more personalized characterization. Statistically, this can be accomplished using a Bayesian framework.

posts

Personalized Point Processes: A Simple Bayesian Analysis

skills

bayesian statistics
point processes
statistical inference
maximum likelihood estimation
probability

check

doubly and randomly censored data #

A derivation of the density functions and likelihood expression associated with doubly and randomly censored data.

description

Censored data is an artifact of partial or incomplete measurements.

A typical scenario would be a survival analysis of time to event data. For example, a study may end before a final measurement is available (right censoring). Another situation might occur when batch processing log file data: the reported timestamp might reflect the time of processing and not the true event time (left censoring).

This post derives the density equations for censored data. Given a parameterization θ, this leads naturally to a log likelihood formulation. As the censoring mechanism is, in general, random, we further allow for the possibility that this too depends on θ.

posts

Doubly and Randomly Censored Data

skills

censored data
statistical inference
Python
numerical analysis
maximum likelihood estimation
probability

check

unlocking truetype fonts: fontforge, matplotlib, and partially ordered occlusions #

I needed to merge the glyphs in two TrueType font files. FontForge, in particular its python extension, was the tool for the job.

posts

Unlocking TrueType Fonts: Fontforge, Matplotlib, and Partially Ordered Occlusions

skills

fontforge (python)
shapely (python)
Python
web design
fonts

check

probability integral transform, a proof #

This post elucidates the connection between the generalized inverse, the cdf, the quantile function, and the uniform distribution.

description

The probability integral transform is a fundamental concept in statistics that connects the cumulative distribution function, the quantile function, and the uniform distribution. We motivate the need for a generalized inverse of the CDF and prove the result in this context.

posts

Probability Integral Transform, A Proof

skills

probability
visualization

check

adaptive rejection sampling #

This post describes and implements an adaptive rejection sampler for log-concave densities.

description

Adaptive rejection sampling is a statistical algorithm for generating samples from a univariate, log-concave density. Because of the adaptive nature of the algorithm, rejection rates are often very low. The exposition of this algorithm follows the example given in Davison’s 2008 text, “Statistical Models.”

posts

Adaptive Rejection Sampling

skills

R
numerical analysis
adaptive rejection sampling
visualization

check

uber interview challenge #

The analysis for a data science interview project at Uber in 2015.

posts

Uber Interview Challenge

skills

data analysis
SQL
adoption metrics
retention metrics
R

check

namecheap, dynamic ip addresses, and hosting multiple sites #

This post shows how to augment the Namecheap ddclient script to support multiple hosts on a dynamic IP.

description

In 2015, I went looking for a solution to the following problem: I have a single Linux server with a dynamically assigned IP address and I want to host several sites on this server. My registrar is Namecheap.com, and their advice is to use a Linux tool called ddclient.

Unfortunately, the example available from Namecheap doesn't cover multiple hosts. A Google search pointed me to thornelabs.net, where the author describes a patch that can be applied to ddclient. Ddclient is written in Perl, so patching is a possibility, but one that feels a bit unsatisfactory.

posts

Namecheap, Dynamic IP Addresses, and Hosting Multiple Sites

skills

web-hosting
dynamic DNS

check

probabilistic performance guarantees for oversubscribed resources #

This paper constructs a model for shared resource utilization, determines stochastic bounds for resource exhaustion, and simulates results.

description

A friend at a large, Seattle-area company recently approached me with the following problem. Suppose we wanted to oversubscribe the shared resources that we lease to our customers. We've noticed that loads are often quite low. In fact, loads are so low that there must be a way to allocate at least some of that unused capacity without generating too much risk of resource exhaustion. If we could manage to do this, we could provide service to more people at a cheaper cost! Sure, they might get dropped service on rare occasions, but anyone that wasn't satisfied with a soft guarantee could still pay a premium and have the full dedicated resource slice to which they may have become accustomed. This seemed like a tractable problem.

Here, we propose a mathematical framework for solving a very simple version of the problem described above. It provides intuitive tuning parameters that allow for business level calibration of risks and the corresponding reliability of the accompanying service guarantees.

After developing the mathematical framework, we put it to work in a simulation context of customer usage behavior. In this experiment, most customers use only a fraction of the resource purchased, but there is a non-negligible group of “power” users that consume almost all of what they request. The results are rather striking. Compared to the dedicated slice paradigm, resource utilization in the oversubscribed case increases by a factor of 2.5, and more than twice as many customers can be served by the same, original resource pool.

The methodology is easily extended to the non-IID case by standard modifications to the sampling scheme. Moreover, even better performance will be likely if a customer segmentation scheme is incorporated into the underlying stochastic assignment problem.

whitepapers

Probabilistic Performance Guarantees for Oversubscribed Resources

posts

Probabilistic Performance Guarantees for Oversubscribed Resources

skills

probability
model-based simulation
capacity planning
R
visualization
stochastic resource planning

Education

account_balance

university of washington

degrees

M.S. Applied Mathematics

M.S. Statistics

Certificate in Computational Finance

account_balance

princeton university

I swam on the varsity men's swim team my freshman and sophomore years. I joined the Cloister Inn eating club my sophomore year and was an elected officer in the club my junior year. My electives were literature courses as often as I could get away with it.

degrees

B.S.E. Computer Science

Program in Applied and Computational Mathematics

Degrees

school

m.s. statistics university of washington / 2012-06-01

I dropped out of my PhD program in statistics. I was enthusiastic about computational finance, but six months after switching departments to pursue this course of research, my advisor left on a two year sabbatical. I developed an interest in optimization and started on another project. That fell apart when the professor I wanted to work with died in a tragic kayaking accident. The last chapter was statistical genetics, but by then, I was out of gas and industry was calling. I determined that it wasn't meant to be, made the tough decision, and changed course.

coursework

STAT504 Applied Regression

STAT492 Stochastic Calculus

STAT516/517 Stochastic Modeling

STAT581/582/583 Advanced Theory of Statistical Inference

MATH530 Convex Analysis

MATH582 Convex Optimization Algorithms

STAT570 Introduction to Linear Models

STAT599 Statistical Consulting

BIOST579 Data Analysis

school

certificate in computational finance university of washington / 2011-06-01

coursework

MATH 516 Numerical Optimization

ECON 583 Econometric Theory

MATH 492 Stochastic Calculus for Option Pricing

STAT 549 Statistical Methods of Portfolios

IND E 599B Financial Engineering

STAT 547 Derivatives: Theory, Statistics, and Computation

school

m.s. applied mathematics university of washington / 2005-12-15

coursework

AMATH584 Applied Linear Algebra

AMATH515 Fundamentals of Optimization

AMATH585/586 Boundary Value and Time Dependent Problems

MATH554 Linear Analysis

EE520 Spectral Analysis of Time Series

STAT538 Statistical Computing

STAT530 Wavelets

school

b.s.e. computer science princeton university / 2001-06-01

coursework

COS341 Discrete Mathematics

COS423 Theory of Algorithms

COS451 Computational Geometry

COS426 Computer Graphics

COS333 Advanced Programming Techniques

COS425 Database Systems

COS318 Operating Systems

COS471 Computer Architecture and Organization

ELE301 Circuits and Signal Processing

ELE482 Digital Signal Processing

school

Small Screen?

DustinLennon

Applied Scientist

Dustin Lennon

Applied Scientist

Introduction

Hello, reader.

Table of Contents

introduction

employers

roles

projects

articles

education

degrees

Employers

bill and melinda gates foundation

overview

roles

conviva

overview

roles

microsoft

roles

servicenow

roles

inferentialist llc

overview

roles

zillow

overview

roles

globys

overview

roles

numerix

overview

roles

university of washington

roles

mit lincoln laboratory

overview

roles

Roles

contractor (slalom) Bill and Melinda Gates Foundation

projects

principal data scientist Conviva

projects

contractor (teksystems) Microsoft

projects

senior data scientist ServiceNow

projects

senior data scientist Microsoft

description

projects

researcher / senior data scientist Microsoft

description

projects

lead statistician Inferentialist LLC

description

projects

quantitative modeler Zillow

projects

machine learning intern Globys

projects

consultant Numerix

graduate teaching / research assistant University of Washington

projects

senior software developer Numerix

projects

graduate teaching assistant University of Washington

technical staff MIT Lincoln Laboratory

Projects

ghost blog: config + tooling #

sites

repos

skills

metayer: an r toolbox to facilitate projects of larger scope #

description

sites