You've found your unicorn! An applied math, statistics, computer science trifecta. I've spent the last twenty years working on all sorts of data and applied science problems, building frameworks that deliver cogent and actionable insights.
Before we dive in, a quick note on this website. It's designed to deliver an adaptive granularity experience; that is, you select the level of detail.
I reviewed the team's existing code and data pipelines and worked with the principal investigators to identify technical debt and stabilize infrastructure.
My technical work focused on improvements to the Stream ID product (community detection). More generally, I tried to socialize data science best practices and build a more data-driven culture.
I provided data science support to the Xbox Cloud Gaming team.
I developed novel statistical algorithms: identifying correlated events in log data; forecasting and alerting for resource-related metrics.
I evangelized in-house, A/B testing for partner teams across Microsoft.
I investigated novel statistical and ML models for classifying customer support issues and provided general statistical support to Office 365 business partners.
I provided client-facing statistical support and data science expertise across a variety of problem domains.
I identified valuations of poor quality and applied post hoc corrections. I also worked to identify algorithmic instabilities; built prototypes featuring regularized, interpretable models with spatiotemporal priors; and suggested improvements to existing methodologies.
I built statistical models to improve up- and cross-selling of mobile add-on packages.
I continued to provide solutions for numerical stability issues arising in the multi-factor backward lattice algorithm.
I supported my PhD studies with teaching and research.
I worked on numerical codes for pricing exotic financial derivatives.
I was a graduate teaching assistant: college algebra, calculus, introductory statistics courses, and numerical linear algebra.
I implemented backscatter models and tracking algorithms for RADAR applications.
I set up a Ghost blog / portfolio for featured articles that I've written. It runs as a systemd service in a Docker container, and there's "one-click" tooling to publish from both jupyter and markdown formats.
Metayer is an R package that addresses a few of the common pain points associated with evolving an R script / one-off analysis into a proper, productionalized, well documented, data science deliverable.
I built an alternative to DVC that leveraged upstream configurations and abstracted access to incremental, upstream results. This provided a machinery to write cleaner, stage-focused, client code.
I built an R package server (an artifactory) so we could fix development environments and replicate them across the team. This served precompiled binaries, so it also provided efficiencies over regularly recompiling source libraries in CI/CD.
I built a framework for encapsulating directory-organized code and used chained R environments to provide module-level polymorphism.
I built a robust data ingestion tool for tables available through the POLIS API.
I built a web visualization tool--a circular Sankey diagram--to drive a discussion with Product about the benefits of leveraging customer domain knowledge.
I built a reproducible research framework that cached incremental results in an effort to improve data science collaboration and reduce compute costs.
I spearheaded the technical work that extended the Stream ID product. This was showcased in an RFP that would have otherwise been outside the scope of our existing product.
I improved a family of existing metrics, and introduced some new ones, used to compare Stream ID households against a third party dataset.
I built an event-based model to simulate ground truth for the household identification problem.
I reviewed the existing estimates for xCloud GA resource requirements, built a model that suggested they were too high, and made recommendations for significant reductions.
I did a retrospective analysis to determine whether access to the xCloud platform into a user's choice set changed existing behavior with respect to Xbox console.
A collection of posts that fall under the umbrella of textbook annotations.
I have intermittently worked on some small scale, python utility projects.
I built a publishing pipeline / platform to host my CV and portfolio.
I imported non-profit IRS tax returns into ElasticSearch and built a website to search for local charities.
I developed a python package that implemented a multivariate Kalman Filter.
The backend monitors the Twitter stream and maintains a dynamic list of trending hashtags; and, for each hashtag, a random sample of relevant tweets. The front end shows the world what's currently popular on Twitter.
I used a missing data property of Kalman filters to kick off a noise-or-not detector that enabled more sensible alerting in erratic long-tail scenarios.
I developed a statistical algorithm to surface event pairings that had highly correlated arrival times. I tested the algorithm on simulated data from a generative model based on branching processes.
I developed a generic, big data summary tool for columnar data.
I developed overall evaluation criteria and helped orchestrate first experiments with Bing partner teams.
I built a stopgap webtool to ensure that partner teams would be able to launch their first experiments without delay.
I was asked to build a forecast model with insufficient data and used it as a teachable moment to drive changes in how the organization managed and communicated changes in their data pipelines.
I introduced basic statistical ideas to business leaders, and this helped reduce managerial randomization across the org.
I proposed a robust analysis plan for characterizing support tickets and subsequently scaled it back to accommodate a changing timeline.
I built an app to track location and collect daily commute data with the intention of helping people find a regular carpool.
I designed and built the Inferentialist website.
Using Lending Club data, I built ML optimized portfolios and showed improved performance relative to portfolios based on predetermined loan grades.
I developed a cross-validated, coefficient-of-variation metric to assess the risk of temporal instability in a home's Zestimate history. This indicated that Zestimates with non-physical behavior were far more prevalent than previously thought.
The regional and subregional ZHVI performed poorly due to small samples. I developed a performant alternative that estimated regularized discount curves from longitudinal, repeat sale data.
I was tasked to identify and adjust "spikey" Zestimate behavior in a collection of 100 million Zestimate histories. This resulted in post hoc corrections to nearly 4 million time-series.
I provided statistical support to the implementation team.
I developed a multithreaded code to propagate probability vectors through a phylogenetic tree. This allowed our research team to make inferences on branch length and, consequently, to develop timelines for genome divergence.
I implemented a PDE solver to price Asian and Lookback options with discrete observation dates.
I reverse engineered a multi-factor, backward-lattice pricing algorithm in order to diagnose and fix numerical instabilities.
I developed new non-linear optimization solvers for calibrating BGM Libor interest-rate models to market data.
I maintain several Ubuntu systems and needed a simple bash script that would backup / mirror these machines. Google pointed me to rsync. This blog post describes what I did with it.
A gist, in python, that uses asyncio with named sockets and illustrates a fork and monitor pattern. It's used here for monitoring heartbeats but could easily be adapted for other process health metrics.
This is a short piece of code that spawns a child process that handles requests from a named socket. This could be useful for, say, monitoring heartbeats or other process health metrics.
To keep the parent process simple, there is no IPC: the filesystem is used for communication. In particular, the parent need only call a send_heartbeat
function at its convenience. The monitor is lazy: it only computes the time since the last heartbeat on request. When the parent terminates, the monitor does too.
This code sets up a named socket in the filespace enabling control from the shell. This has utility when debugging. For example,
echo "hello" | socat - UNIX-CLIENT:monitor.socket
echo -n "" | socat - UNIX-CLIENT:monitor.socket
socat - UNIX-CLIENT:monitor.socket
asyncio.open_unix_connection
can be a bit fussy with being handed a socket. In particular, it expects an already accepted socket on which it could block indefinitely if, say, the client connects and does nothing. So, we provide a safe_unix_connection
async context manager to make sure it doesn't get stuck, and that the writer is closed appropriately when a connection is terminated.
Application logic is loosely encapsulated at the end of the script. It should feel similar to the callback function passed to asyncio.start_server.
This post follows Golub and Van Loan, introducing Householder reflections and Givens rotations, then using these tools to sketch out implementations of QR, Hessenberg, and Schur decompositions.
The post describes a homogeneous Poisson process using a Gamma conjugate prior that can be used to estimate a pooled, per-subject intensity given a collection of realizations.
A homogeneous Poisson process is the simplest way to describe events that arrive in time. Here, we are interested in a collection of realizations. An example is user transactions in a system. Over time, we expect each user to produce a sequence of transaction events, and we would like to characterize the rate of these events on a per-user basis. In particular, users with more data should expect a more personalized characterization. Statistically, this can be accomplished using a Bayesian framework.
A derivation of the density functions and likelihood expression associated with doubly and randomly censored data.
Censored data is an artifact of partial or incomplete measurements.
A typical scenario would be a survival analysis of time to event data. For example, a study may end before a final measurement is available (right censoring). Another situation might occur when batch processing log file data: the reported timestamp might reflect the time of processing and not the true event time (left censoring).
This post derives the density equations for censored data. Given a parameterization θ, this leads naturally to a log likelihood formulation. As the censoring mechanism is, in general, random, we further allow for the possibility that this too depends on θ.
I needed to merge the glyphs in two TrueType font files. FontForge, in particular its python extension, was the tool for the job.
This post elucidates the connection between the generalized inverse, the cdf, the quantile function, and the uniform distribution.
The probability integral transform is a fundamental concept in statistics that connects the cumulative distribution function, the quantile function, and the uniform distribution. We motivate the need for a generalized inverse of the CDF and prove the result in this context.
This post describes and implements an adaptive rejection sampler for log-concave densities.
Adaptive rejection sampling is a statistical algorithm for generating samples from a univariate, log-concave density. Because of the adaptive nature of the algorithm, rejection rates are often very low. The exposition of this algorithm follows the example given in Davison’s 2008 text, “Statistical Models.”
This post shows how to augment the Namecheap ddclient script to support multiple hosts on a dynamic IP.
In 2015, I went looking for a solution to the following problem: I have a single Linux server with a dynamically assigned IP address and I want to host several sites on this server. My registrar is Namecheap.com, and their advice is to use a Linux tool called ddclient.
Unfortunately, the example available from Namecheap doesn't cover multiple hosts. A Google search pointed me to thornelabs.net, where the author describes a patch that can be applied to ddclient. Ddclient is written in Perl, so patching is a possibility, but one that feels a bit unsatisfactory.
This paper constructs a model for shared resource utilization, determines stochastic bounds for resource exhaustion, and simulates results.
A friend at a large, Seattle-area company recently approached me with the following problem. Suppose we wanted to oversubscribe the shared resources that we lease to our customers. We've noticed that loads are often quite low. In fact, loads are so low that there must be a way to allocate at least some of that unused capacity without generating too much risk of resource exhaustion. If we could manage to do this, we could provide service to more people at a cheaper cost! Sure, they might get dropped service on rare occasions, but anyone that wasn't satisfied with a soft guarantee could still pay a premium and have the full dedicated resource slice to which they may have become accustomed. This seemed like a tractable problem.
Here, we propose a mathematical framework for solving a very simple version of the problem described above. It provides intuitive tuning parameters that allow for business level calibration of risks and the corresponding reliability of the accompanying service guarantees.
After developing the mathematical framework, we put it to work in a simulation context of customer usage behavior. In this experiment, most customers use only a fraction of the resource purchased, but there is a non-negligible group of “power” users that consume almost all of what they request. The results are rather striking. Compared to the dedicated slice paradigm, resource utilization in the oversubscribed case increases by a factor of 2.5, and more than twice as many customers can be served by the same, original resource pool.
The methodology is easily extended to the non-IID case by standard modifications to the sampling scheme. Moreover, even better performance will be likely if a customer segmentation scheme is incorporated into the underlying stochastic assignment problem.