Published Pages, Publishing Platform
Implement code, analyze data, communicate insights, repeat: the job of an Applied Scientist is about as cross functional as it gets. In my experience, this has meant delivering interpretable statistical analyses, building reliable ETL pipelines, and, on occasion, even devloping new algorithms; whatever it takes to pragmatically make and support well-informed business decisions.
Below, please enjoy my extended form resume. If you'd prefer a more curated version applicable to your particular scenario, please reach out. My email is firstname.lastname@example.org.
Published Pages, Publishing Platform
Charity Search, Web Application
Multivariate Kalman Filter, Python Package
Twittalytics, Web Application
Carpool Project, Android Application
Interactive Spatial Heatmaps, Web Application
Consultant, Inferentialist LLC
May, 2012 - Present
Promote, design, and deliver controlled experiments (A/B tests) where possible.
Marketing analyses including engagement ladders.
Customer retention models and churn analysis.
Data integrity, data consistency, and data quality assessments.
Founded the company: developed the branding, built the website, contributed to the blog.
Senior Data Scientist, ServiceNow
Dec, 2016 - Oct, 2017
Developed a framework for simulating correlated events from (randomly generated) branching processes that included (randomly) censored event data.
Developed a novel event correlation algorithm which recovered 98 percent of correlated event pairs in the simulated datasets and generated new insights in unlabeled customer datasets.
Implemented a bivariate Kalman Filter with level, trend, and seasonality components and which allowed for missing data; combined this with a novel, 'split-hypothesis' paradigm to detect anomalous jumps in the state space.
Researcher / Senior Data Scientist, Microsoft
May, 2014 - Oct, 2015
Extended existing CLI software, and incorporated the new tooling into a web service. This allowed external partner teams to preview the ExP platform while waiting for a more formal migration.
Designed and built a generic, extensible data summary library in SCOPE/C#. The library was widely adopted, and the Avocado Team internalized it as a core part of a web-facing, 'deep-dive' toolset.
Worked cross-functionally with PMs and developers on the OneNote and Exchange Online teams to enable A/B experimentation on the ExP platform. Designed first-run experiments for these teams.
Implemented a change point detection algorithm for support ticket volumes which was used to identify unannounced deployments of new instrumentation. Corroborating these identifications allowed us to improve the documentation of, and communication around, the deployment process. It also allowed us to better clean the data, drastically reducing the variability of statistical estimates.
Identified classes of support tickets that were strongly correlated with early tenant lifecycle / onboarding issues.
Developed an interpretable forecast model for customer support tickets which informed budget allocations related to future staffing requirements.
Quantitative Modeler, Zillow
Jun, 2011 - Feb, 2012
Advocated for interpretable home valuation models that incorporated spatial and temporal structures (in contrast to off-the-shelf random forests).
Developed a cross-validated, coefficient-of-variation metric to assess the risk of temporal instability in a home's Zestimate history. This indicated that Zestimates with non-physical behavior were far more prevalent than previously thought.
Developed an alternative to the ZHVI--Zillow's proprietary home value index--based on estimating discount curves from longitudinal, repeat sales. This provided an improvement in the estimator for small samples.
Identified, and removed (post-hoc), 'spikey' Zestimate behavior in a collection of 100 million Zestimate histories. This resulted in 'corrections' to nearly 4 million time-series.
Machine Learning Intern, Globys
Oct, 2010 - Jun, 2011
I used an Apriori algorithm to create new features from historical purchases. These attributes had higher predictive power and produced significant lift in our models.
"I provided statistical support for implementing a step-wise, logistic regression model in production.
Jun, 2009 - Aug, 2010
Senior Software Developer, Numerix
Jun, 2006 - Aug, 2007
Reverse engineered a multi-factor, backward-lattice pricing algorithm to diagnose and fix numerical instabilities.
Wrote new solvers for calibrating BGM Libor interest-rate models to market data.
Implemented a PDE solver to price Asian and Lookback options with discrete observation dates.
Technical Staff, MIT Lincoln Laboratory
Sep, 2001 - May, 2002
Implemented backscatter models and tracking algorithms for RADAR applications.
Statistics, M.S., University of Washington
Jan, 2006 - Jun, 2010GPA 3.81/4.00
Applied Mathematics, M.S., University of Washington
Sep, 2003 - Dec, 2005GPA 3.82/4.00
Computer Science, B.S.E., Princeton University
Sep, 1997 - May, 2001graduated magna cum laude
The Effect of Active Users on Support Tickets, Microsoft Internal
Published on: Oct 01, 2014
This work presents a simple statistical analysis characterizing the relation between the number of active / engaged users in the system and the rate at which service request tickets are created.
Support Tickets: Confidence Intervals for Population Estimates, Microsoft Internal
Published on: Aug 01, 2014
This work showcases two simple models that allow for the construction of confidence intervals for population estimates associated with customer support tickets. Why is this important? Because it allows us to separate natural variation in business metrics from abnormal behavior that would warrant further investigation. What did we do? We built a model for data loss and a model for label misclassification. These models are used to assess how these two distinct sources of variation affect population estimates such as total minutes spent in customer service.
Probabilistic Performance Guarantees for Oversubscribed Resources , Inferentialist
Published on: Nov 01, 2013
The paper examines the risk associated with resource allocation in the case of oversubscription. We formulate the problem in a mathematical context and provide a business level parameterization that bounds, in probability, the rate of resource exhaustion. To validate the procedure, we run a simulation over different resource consumption scenarios. In the traditional case, we obtain a 26.4% usage rate; hence, 73.6% of our resource pool goes unused. Using the strategy described in the paper, we can guarantee, with 95% confidence, that resources will be available 99% of the time. This relaxation provides a 2.5x increase in utilization, and the median usage rate jumps to 66.7%.
Optimal Lending Club Portfolios , Inferentialist
Published on: Oct 01, 2013
This paper extends the concept of an actively managed, Lending Club portfolio. It introduces a novel, random forest type algorithm that treats portfolio assets in a survival context. Using historical data provided by the company, we use our algorithm to constructing optimial portfolios of Lending Club loans. Our results, driven by expected returns, compare favorably to investment strategies based solely on the loan grade assigned by Lending Club. Our optimal, actively managed portfolios have an expected return exceeding 12% annually. In contrast, portfolios constructed on A-grade loans return 6.68%; B-grade loans, 7.49%; and C-grade loans, 8.11%.
Measuring Microsatellite Conservation in Mammalian Evolution with a Phylogenetic Birth-Death Model , Genome Biology and Evolution
Published on: May 16, 2012
Microsatellites make up about three percent of the human genome, and there is increasing evidence that some microsatellites can have important functions and can be conserved by selection. To investigate this conservation, we performed a genome-wide analysis of human microsatellites and measured their conservation using a binary character birth-death model on a mammalian phylogeny. Using a maximum likelihood method to estimate birth and death rates for different types of microsatellites, we show that the rates at which microsatellites are gained and lost in mammals depend on their sequence composition, length, and position in the genome. Additionally, we use a mixture model to account for unequal death rates among microsatellites across the human genome. We use this model to assign a probability-based conservation score to each microsatellite. We found that microsatellites near the transcription start sites of genes are often highly conserved, and that distance from a microsatellite to the nearest transcription start site is a good predictor of the microsatellite conservation score. An analysis of gene ontology terms for genes that contain microsatellites near their transcription start site reveals that regulatory enes involved in growth and development are highly enriched with conserved microsatellites.
Away From Work