I am a classically trained statistician with a background in consulting and computational software development. My value derives from framing business questions in a mathematical context, carrying out appropriate analyses, and delivering interpretable results based on data.
I have worked on projects using R, SQL, C/C++, bash: awk/grep/sed, Rails, Python, Java, and C#/SCOPE (Microsoft's version of MapReduce).
In December 2016, I joined the Operational Intelligence team where I focused on developing and testing algorithms for event correlation and anomaly detection in an ITOM environment.
I left because, despite promises to the contrary, we never had any data.
In May 2012, I started Inferentialist LLC, with the belief that the tech industry was missing opportunities to leverage statistical best practices.
In October 2014, I was transitioned to a data reporting role when the O365 Customer Intelligence Team was re-organized under new management. By January, the CI team had effectively collapsed, and I found myself on the Bing Analysis and Experimentation Team. My new role was as an internal consultant, to provide analytics support to external partner teams, across the company, that had expressed an interest in onboarding to Bing's existing experimentation platform.
In my 18 months at Microsoft, I had seven different managers.
O365 Customer Intelligence Team
In May 2014, I accepted a research position at Microsoft on the O365 Customer Intelligence Team. Our mandate was to develop machine learning tools that would detect trends in customer service tickets; the goal was to identify common customer complaints and, in an automated fashion, propose relevant solutions.
I was hired to work on, and improve, the Zestimate algorithm. I argued against the existing, off-the-shelf machine-learning approach and in favor of building an interpretable model with spatial and temporal correlation structures.
I joined a team at Globys that was tasked with improving upsell stratgies for mobile add-on packages. The initial goal was to derive predictors for upsell from existing, retrospective data. While I was there, I saw the strategy shift toward controlled experiments, the gold standard in assessing the efficacy of online marketing campaigns.
I continued a summer internship, set up by my academic advisor while on sabbatical, with NumeriX. I focused on two sides of a multi-factor SDE, derivatives pricing model: calibration to market prices and numerical pricing algorithms.
My PhD-level coursework was in statistical theory, optimization, stochastic modeling, and computing.
My advisors were Doug Martin, computational finance; Paul Tseng, semi-definite programming; Vladimir Minin, statistical genetics.
I received a Masters degree from the department in 2012.
The Computational Finance Certificate is an interdisciplinary program that requires several finance courses as well as a "capstone" project. In my case, I took courses in optimization, econometric theory, stochastic calculus, modern portfolio theory, and financial derivatives.
My PhD-level coursework was in numerical analysis for ordinary and partial differential equations with applications in computation fluid dynamics.
My advisor was Randall Leveque.
I received a Masters degree from the department in 2005.
My undergraduate coursework focused on theory and algorithms. I also completed the undergraduate certificate program in applied and computational mathematics.
My advisors were Bernard Chazelle and Brian Kernighan.
I received my Bachelor of Science in Engineering from the university in 2001, graduating magna cum laude.
I spent some time in early 2016 developing an Android app to make carpooling recommendations. The idea was that the app would monitor a user for a week, assess a user's commuting schedule, and make introductions to other users with similar commute patterns.
Sadly, the project eventually stalled due to a lack of funding and a (then) generally held opinion that UberPool would be taking over the market. However, I did build a functional prototype up and running for Android Lollipop (5.0) devices. The APK and instructions for sideloading the app are available from carpoolproject.org.
I worked briefly at Zillow, where I was tasked with stabilizing the Zestimate algorithm. At the time, Zillow had just filed a patent for using Random Forests and K-nearest neighbors in their Zestimate algorithm. Not surprisingly, their off-the-shelf approach missed a number of important physical constraints inherent in the data generating process.
In particular, the company was often asked to explain inaccurate Zestimate histories. Perusing the Consumer Reports complaints, it seems this is still a major issue for the company.
Since leaving Zillow, I found time to revisit the home valuation problem. Luckily, King County has made housing sales data publically available. So, apart from the usual ETL, it was not too difficult to recreate a viable dataset.
It turns out that by introducing appropriate correlation structures, one can do far better in terms of plausible estimates. For example, the following heatmap shows the relative price-per-square-foot home costs in Seattle and aligns well with known 'hot' neighborhoods.
Also included is a relative error plot--for the worst offenders--showcasing various models. Empirically, our model, labeled "Bayesian Hedonic" in the figure, has the best-behaved worst-case behavior.
I periodically contribute to a blog for Inferentialist Consulting documenting my day-to-day experiences as a statistical consultant.
In 2012, I started Inferentialist LLC. Our mission is to provide statistical and data science best practices to Seattle-area companies that might otherwise not have the in-house expertise, or resources, necessary to deploy well-principled solutions.
Since its inception, Inferentialist Consulting has engaged with a number of local startups. The company has also published several independent white papers.
In mid-2013, I was introduced to Lending Club as a vehicle for crowd sourced lending. The idea is that the company creates a platform where customers apply for loans and investors can fund those loans. Most interestingly, Lending Club provided access to all the of the loan origination data. Over the next few months, I did an analysis showing to to construct optimal portfolios of such loans.
The outcome of the work was twofold. First, it proved to be far better to eschew the given grades in favor of the new optimal strategy. Second, the preferred loans, under the optimal strategy were rarely available. It was conjectured that preferred customers had access to loans before retail customers. So while the loans were present in the data, they were always bought out in reality.
The figure contrasts returns based on an optimally constructed portfolio with existing loan classes. For example, a portfolio construction strategy that purchased all Grade B loans would have had an annual return of 8%; an optimal strategy that chose the top 10% of loans based on optimal scores would have returned just shy of 12% annually.
The algorithm I developed used a random forest paradigm but was customized to work with survival data.
Out of bag estimates were used to establish confidence bands on individual loans. This meant that risk-return estimates could be computed on a per-loan basis; this is shown in the second figure below where standard deviation is plotted on the x-axis and expected return on the y-axis.
The full analysis is available in the white paper.