Principal Data Scientist, Conviva
Jun, 2021 - Dec, 2022
My technical work has focused on improvements to the Stream ID product (community detection). More generally, I've tried to socialize data science best practices and build a more data-driven culture. Scala, python, databricks, SBT, GCP.
-
I designed and built a Scala implementation of Conviva's next generation household id algorithm. This used a more modern, flexible statistical machinery to preprocess and detect communities in unreliably labeled data.
The prototype was used for an RFP on a production sized dataset that required functionality outside the scope of our production algorithm. -
I designed and built a framework to enable reproducible research. The idea is that an analysis is typically comprised of multiple tasks that can be described by a DAG, and, in general, each node of the DAG will use a different parameter set. Typically, one wants to build and test the functionality of each node independently, and upstream calculations should be cached when possible. The framework manages these incremental results. It furthermore enables low-overhead, per-task logging through log4j / slf4j. It is designed to be useful for local testing, deployed as a JAR in databricks, and as a databricks Job submission tool through the command line. It centralizes the implementation of common tasks and largely solves the problem of databricks notebook drift. This enabled a shift where databricks is used to prototype an idea that is subsequently properly codified within the framework.
-
I'm currently providing data science support / consulting to a new product that hopes to provide instrumentation as a service beyond the streaming video space.