You've found your unicorn! An applied math, statistics, computer science trifecta. I've spent the last twenty years working on all sorts of data and applied science problems, building frameworks that deliver cogent and actionable insights.
Before we dive in, a quick note on this website. It's designed to deliver an adaptive granularity experience; that is, you select the level of detail.
I reviewed the team's existing code and data pipelines and worked with the principal investigators to identify technical debt and stabilize infrastructure.
My technical work focused on improvements to the Stream ID product (community detection). More generally, I tried to socialize data science best practices and build a more data-driven culture.
I provided data science support to the Xbox Cloud Gaming team.
I developed novel statistical algorithms: identifying correlated events in log data; forecasting and alerting for resource-related metrics.
I evangelized in-house, A/B testing for partner teams across Microsoft.
I investigated novel statistical and ML models for classifying customer support issues and provided general statistical support to Office 365 business partners.
I provided client-facing statistical support and data science expertise across a variety of problem domains.
I identified valuations of poor quality and applied post hoc corrections. I also worked to identify algorithmic instabilities; built prototypes featuring regularized, interpretable models with spatiotemporal priors; and suggested improvements to existing methodologies.
I built statistical models to improve up- and cross-selling of mobile add-on packages.
I continued to provide solutions for numerical stability issues arising in the multi-factor backward lattice algorithm.
I supported my PhD studies with teaching and research.
I worked on numerical codes for pricing exotic financial derivatives.
I was a graduate teaching assistant: college algebra, calculus, introductory statistics courses, and numerical linear algebra.
I implemented backscatter models and tracking algorithms for RADAR applications.
I set up a Ghost blog / portfolio for featured articles that I've written. It runs as a systemd service in a Docker container, and there's "one-click" tooling to publish from both jupyter and markdown formats.
Metayer is an R package that addresses a few of the common pain points associated with evolving an R script / one-off analysis into a proper, productionalized, well documented, data science deliverable.
I built an alternative to DVC that leveraged upstream configurations and abstracted access to incremental, upstream results. This provided a machinery to write cleaner, stage-focused, client code.
The team had adopted DVC as a data versioning tool but wasn't able to use it effectively. For example, it wasn't possible to have results simultaneously available for comparison across multiple runtime configurations.
I built an R package server (an artifactory) so we could fix development environments and replicate them across the team. This served precompiled binaries, so it also provided efficiencies over regularly recompiling source libraries in CI/CD.
Onboarding -- in particular, setting up a development machine -- had been painful. During the process, it quickly became evident that maintaining a consistently versioned compute environment hadn't been a concern. This was true despite a difficult, deprecation-related refactoring that had taken place prior to my arrival. One place where this solution had the potential to improve performance was during a git push and the subsequent git actions: the existing process required that all the R package dependencies be rebuilt, and it would routinely take fifteen minutes to check in code.
I built a framework for encapsulating directory-organized code and used chained R environments to provide module-level polymorphism.
The source code had evolved without discipline. It had been almost a year since the last merge to main, and multiple files, sharing perturbations of replicated or discarded logic, proliferated in the codebase. The prevailing (anti)pattern was to source one variant or another, often haphazardly, into a cascade of R scripts.
I built a robust data ingestion tool for tables available through the POLIS API.
The World Health Organization ("WHO") provides polio data through their POLIS endpoints. The availability of data is lagged, and the historical record is more complete than what would be available at the time of a forecast. The WHO uses a record update strategy rather than an append. Thus, records are not immutable. For backtesting purposes, this means that the enduser must take responsibility for maintaining an accurate historical record. Another difficulty with the POLIS dataset is that it is throttled, and data retrieval requires multiple calls to a finicky endpoint delivering records at a mere trickle, only 2000 per call.
I built a web visualization tool--a circular Sankey diagram--to drive a discussion with Product about the benefits of leveraging customer domain knowledge.
In 2022, Conviva made an effort to extend their core business beyond streaming video. The goal was to provide instrumentation as a service. Now, any platform would be able to monitor user state by leveraging Conviva's reporting layer in their software stack. I provided data science support for this effort, including developing interactive tools for visualizing state transitions in arbitrary state spaces.
I built a reproducible research framework that cached incremental results in an effort to improve data science collaboration and reduce compute costs.
Reproducibility and data versioning became elevated concerns when Data Science was unable to verify the correctness of production metrics. I built a Scala/Databricks library that enabled caching of incremental results. This decomposed the monolithic production pipeline into smaller stages and allowed other data science users to collaborate from a consistent, shared starting point.
I spearheaded the technical work that extended the Stream ID product. This was showcased in an RFP that would have otherwise been outside the scope of our existing product.
Conviva wanted to participate in an RFP but the rigidity of the existing, monolithic pipeline made it difficult. In particular, the project required exploratory data analysis, redesigning and generalizing the ingestion portion of the existing pipeline, and implementing a scalable, map-reduce variant of the Louvain community detection algorithm in Scala.
I improved a family of existing metrics, and introduced some new ones, used to compare Stream ID households against a third party dataset.
Historically, Conviva used third party data to assess the correctness of the household assignments generated by its Stream ID product. Due to missing data, this entailed a problematic matching problem. I reviewed the existing assessment metrics and offered improvements.
I built an event-based model to simulate ground truth for the household identification problem.
Conviva's Stream ID product is tasked with solving a community detection problem. However, the clustering context is non-standard. In particular, the graph used to induce the clustering has two distinct types of nodes: devices and ip addresses. Moreover, the labels associated with the underlying entities of interest are subject to change without notice. There was no ground truth in the production data, so I built a generative model that produced synthetic data.
I reviewed the existing estimates for xCloud GA resource requirements, built a model that suggested they were too high, and made recommendations for significant reductions.
In early 2020, xCloud was preparing for a GA launch. There was an interest in understanding how beta testers and early adopters were using the system.
I did a retrospective analysis to determine whether access to the xCloud platform into a user's choice set changed existing behavior with respect to Xbox console.
Before the GA release, access to the xCloud platform was by invitation only. In all cases, participants were existing Xbox console users. For this cohort of active gamers, one question was if access to xCloud impacted their usage of other Xbox platforms. If so, an estimate of the effect size was also of interest.
I have intermittently worked on some small scale, python utility projects.
I built a publishing pipeline / platform to host my CV and portfolio.
I imported non-profit IRS tax returns into ElasticSearch and built a website to search for local charities.
I developed a python package that implemented a multivariate Kalman Filter.
The backend monitors the Twitter stream and maintains a dynamic list of trending hashtags; and, for each hashtag, a random sample of relevant tweets. The front end shows the world what's currently popular on Twitter.
I used a missing data property of Kalman filters to kick off a noise-or-not detector that enabled more sensible alerting in erratic long-tail scenarios.
ServiceNow wanted to monitor noisy network resource metrics and to do so without generating spurious alerts.
I developed a statistical algorithm to surface event pairings that had highly correlated arrival times. I tested the algorithm on simulated data from a generative model based on branching processes.
One of ServiceNow's larger customers wanted to know if we could analyze correlated event data and supplied us with a test dataset.
I developed a generic, big data summary tool for columnar data.
In my interactions with partner teams, computing simple summary metrics was routine. However, computing any statistics more complicated than means and variances was rarely attempted.
I developed overall evaluation criteria and helped orchestrate first experiments with Bing partner teams.
In the beginning of 2015, the data scientists on the Bing Experimentation team were loaned out to partner teams to help them prepare their product workflows for experimentation.
I built a stopgap webtool to ensure that partner teams would be able to launch their first experiments without delay.
The vision was that Bing could help Microsoft product teams adopt a culture of controlled experimentation; that the process need not be reinvented but could be outsourced to an existing experimentation platform. We approached a handful of partner teams, offering our collective support and expertise. We asked only that they commit to running at least one experiment. Of course, first experiments are a lot of work, and it took months to moderize existing engineering workflows and cultivate positive momentum with the stakeholders. Unfortunately, on our side, the engineers' delivery timeline slipped. The self-service, programmatic access to the experimentation platform that had been promised wasn't going to be ready for another six months. We wanted to maintain the momentum that we'd developed with our partner teams, so a coworker and I built a bare bones web-service as a stopgap to buy our engineering team more time.
I was asked to build a forecast model with insufficient data and used it as a teachable moment to drive changes in how the organization managed and communicated changes in their data pipelines.
In late 2014, the Office 365 Customer Intelligence Team wanted to understand their growth trajectory but faced issues with low data quality.
I introduced basic statistical ideas to business leaders, and this helped reduce managerial randomization across the org.
In 2014, Office 365 had just launched, and the Office 365 Customer Intelligence Team needed data science support to help answer their business questions. Top priority: costs associated with customer support tickets appeared to be out of control.
I proposed a robust analysis plan for characterizing support tickets and subsequently scaled it back to accommodate a changing timeline.
In 2014, the focus of the Office 365 Customer Intelligence Team was triaging customer support tickets, specifically runaway costs.
I built an app to track location and collect daily commute data with the intention of helping people find a regular carpool.
I designed and built the Inferentialist website.
Using Lending Club data, I built ML optimized portfolios and showed improved performance relative to portfolios based on predetermined loan grades.
In 2012, lending Club was a relatively new, and fastly growing, peer to peer lending platform. Using historical data provided by the company, our paper described a method for constructing optimial portfolios of Lending Club loans.
I developed a cross-validated, coefficient-of-variation metric to assess the risk of temporal instability in a home's Zestimate history. This indicated that Zestimates with non-physical behavior were far more prevalent than previously thought.
The regional and subregional ZHVI performed poorly due to small samples. I developed a performant alternative that estimated regularized discount curves from longitudinal, repeat sale data.
In 2011, Zillow published a proprietary home value index--the ZHVI--a then competitor to the Case Shiller home price index.
I was tasked to identify and adjust "spikey" Zestimate behavior in a collection of 100 million Zestimate histories. This resulted in post hoc corrections to nearly 4 million time-series.
I provided statistical support to the implementation team.
I developed a multithreaded code to propagate probability vectors through a phylogenetic tree. This allowed our research team to make inferences on branch length and, consequently, to develop timelines for genome divergence.
I implemented a PDE solver to price Asian and Lookback options with discrete observation dates.
I reverse engineered a multi-factor, backward-lattice pricing algorithm in order to diagnose and fix numerical instabilities.
I developed new non-linear optimization solvers for calibrating BGM Libor interest-rate models to market data.
I maintain several Ubuntu systems and needed a simple bash script that would backup / mirror these machines. Google pointed me to rsync. This blog post describes what I did with it.
A gist, in python, that uses asyncio with named sockets and illustrates a fork and monitor pattern. It's used here for monitoring heartbeats but could easily be adapted for other process health metrics.
This post follows Golub and Van Loan, introducing Householder reflections and Givens rotations, then using these tools to sketch out implementations of QR, Hessenberg, and Schur decompositions.
The post describes a homogeneous Poisson process using a Gamma conjugate prior that can be used to estimate a pooled, per-subject intensity given a collection of realizations.
A derivation of the density functions and likelihood expression associated with doubly and randomly censored data.
I needed to merge the glyphs in two TrueType font files. FontForge, in particular its python extension, was the tool for the job.
This post elucidates the connection between the generalized inverse, the cdf, the quantile function, and the uniform distribution.
This post describes and implements an adaptive rejection sampler for log-concave densities.
This post shows how to augment the Namecheap ddclient script to support multiple hosts on a dynamic IP.
This paper constructs a model for shared resource utilization, determines stochastic bounds for resource exhaustion, and simulates results.