You've found your unicorn! An applied math, statistics, computer science trifecta. I've spent the last twenty years working on all sorts of data and applied science problems, building frameworks that deliver cogent and actionable insights.
Before we dive in, a quick note on this website. It's designed to deliver an adaptive granularity experience; that is, you select the level of detail.
Conviva is a B2B, privately held company in the streaming video analytics space. They provide a client-side, QoE reporting layer, and a corresponding, backend analytics service, for many of the streaming video platforms in use today. One of Conviva's core products is Stream ID which aggregates devices into households. At a high level, it addresses a community detection problem involving devices, IP addresses, and the inherently unstable labels used to identify these entities.
In 2006, Numerix offered a product that allowed financial institutions to price exotic derivatives based on interest and foreign exchange rates. Underpinning these complex financial assets were arbitrage-free (martingale) measures and stochastic differential equations, and the raison d'être of their software was to expose an API to this numerical machienry in a familiar, Excel workbook.
This is a short piece of code that spawns a child process that handles requests from a named socket. This could be useful for, say, monitoring heartbeats or other process health metrics.
To keep the parent process simple, there is no IPC: the filesystem is used for communication. In particular, the parent need only call a send_heartbeat
function at its convenience. The monitor is lazy: it only computes the time since the last heartbeat on request. When the parent terminates, the monitor does too.
This code sets up a named socket in the filespace enabling control from the shell. This has utility when debugging. For example,
echo "hello" | socat - UNIX-CLIENT:monitor.socket
echo -n "" | socat - UNIX-CLIENT:monitor.socket
socat - UNIX-CLIENT:monitor.socket
asyncio.open_unix_connection
can be a bit fussy with being handed a socket. In particular, it expects an already accepted socket on which it could block indefinitely if, say, the client connects and does nothing. So, we provide a safe_unix_connection
async context manager to make sure it doesn't get stuck, and that the writer is closed appropriately when a connection is terminated.
Application logic is loosely encapsulated at the end of the script. It should feel similar to the callback function passed to asyncio.start_server.
A homogeneous Poisson process is the simplest way to describe events that arrive in time. Here, we are interested in a collection of realizations. An example is user transactions in a system. Over time, we expect each user to produce a sequence of transaction events, and we would like to characterize the rate of these events on a per-user basis. In particular, users with more data should expect a more personalized characterization. Statistically, this can be accomplished using a Bayesian framework.
Censored data is an artifact of partial or incomplete measurements.
A typical scenario would be a survival analysis of time to event data. For example, a study may end before a final measurement is available (right censoring). Another situation might occur when batch processing log file data: the reported timestamp might reflect the time of processing and not the true event time (left censoring).
This post derives the density equations for censored data. Given a parameterization θ, this leads naturally to a log likelihood formulation. As the censoring mechanism is, in general, random, we further allow for the possibility that this too depends on θ.
The probability integral transform is a fundamental concept in statistics that connects the cumulative distribution function, the quantile function, and the uniform distribution. We motivate the need for a generalized inverse of the CDF and prove the result in this context.
Adaptive rejection sampling is a statistical algorithm for generating samples from a univariate, log-concave density. Because of the adaptive nature of the algorithm, rejection rates are often very low. The exposition of this algorithm follows the example given in Davison’s 2008 text, “Statistical Models.”
In 2015, I went looking for a solution to the following problem: I have a single Linux server with a dynamically assigned IP address and I want to host several sites on this server. My registrar is Namecheap.com, and their advice is to use a Linux tool called ddclient.
Unfortunately, the example available from Namecheap doesn't cover multiple hosts. A Google search pointed me to thornelabs.net, where the author describes a patch that can be applied to ddclient. Ddclient is written in Perl, so patching is a possibility, but one that feels a bit unsatisfactory.
A friend at a large, Seattle-area company recently approached me with the following problem. Suppose we wanted to oversubscribe the shared resources that we lease to our customers. We've noticed that loads are often quite low. In fact, loads are so low that there must be a way to allocate at least some of that unused capacity without generating too much risk of resource exhaustion. If we could manage to do this, we could provide service to more people at a cheaper cost! Sure, they might get dropped service on rare occasions, but anyone that wasn't satisfied with a soft guarantee could still pay a premium and have the full dedicated resource slice to which they may have become accustomed. This seemed like a tractable problem.
Here, we propose a mathematical framework for solving a very simple version of the problem described above. It provides intuitive tuning parameters that allow for business level calibration of risks and the corresponding reliability of the accompanying service guarantees.
After developing the mathematical framework, we put it to work in a simulation context of customer usage behavior. In this experiment, most customers use only a fraction of the resource purchased, but there is a non-negligible group of “power” users that consume almost all of what they request. The results are rather striking. Compared to the dedicated slice paradigm, resource utilization in the oversubscribed case increases by a factor of 2.5, and more than twice as many customers can be served by the same, original resource pool.
The methodology is easily extended to the non-IID case by standard modifications to the sampling scheme. Moreover, even better performance will be likely if a customer segmentation scheme is incorporated into the underlying stochastic assignment problem.