Probabilistic Performance Guarantees for Oversubscribed Resources
This work was originally published as an Inferentialist white
paper.
Introduction
A friend, also an employee at a large, Seattle-area company, recently
approached me with the following problem. Suppose we wanted to
oversubscribe the shared resources that we lease to our customers. We’ve
noticed that loads are often quite low. In fact, loads are so low that
there must be a way to allocate at least some of that unused capacity
without generating too much risk of resource exhaustion. If we could
manage to do this, we could provide service to more people at a cheaper
cost! Sure, they might get dropped service on rare occasions, but anyone
that wasn’t satisfied with a soft guarantee could still pay a premium
and have the full dedicated resource slice to which they may have become
accustomed. This seemed like a tractable problem. So after closing out
the bar tab, I went home and sketched out the ideas for this paper on my
markerboard.
Here, we propose a mathematical framework for solving a very simple
version of the problem described above. It provides intuitive tuning
parameters that allow for business level calibration of risks and the
corresponding reliability of the accompanying service guarantees.
After developing the mathematical framework, we put it to work in a
simulation context of customer usage behavior. In this experiment, most
customers use only a fraction of the resource purchased, but there is a
non-negligible group of “power” users that consume almost all of what
they request. The results are rather striking. Compared to the dedicated
slice paradigm, resource utilization in the oversubscribed case
increases by a factor of 2.5, and more than twice as many customers can
be served by the same, original resource pool.
The methodology is easily extended to the non-IID case by standard
modifications to the sampling scheme. Moreover, even better performance
will be likely if a customer segmentation scheme is incorporated into
the underlying stochastic assignment problem.
Mathematical Formulation of
the Problem
Suppose individual consumes
units of the resource where
is a random variable taking
values in . Thus,
individual has paid for access to
units but consumes some unknown
quantity less than or equal to .
Let , and
be the total
resource purchased and total resource consumed respectively. must, necessarily, take values in
. Without loss
of generality, we can assume that the resource is exhausted when .
For a given , choose an
acceptable , the risk
tolerance for resource exhaustion, such that
i.e., the probability of resource exhaustion is less than or equal to
.
Then we seek a solution to the following:
Namely, find the largest that
doesn’t violated the stated risk tolerance.
In order to obtain , we
need to compute an estimate of the objective function,
The most straightforward approach is to takes samples from the usage
distribution. A single sample, say the one of total, will be comprised of observations and will give rise to the
following statistics:
where these statistics are characterized by the usual
distributions.
Note that is tacitly a
function of the population of interest from which the samples are chosen. It is worth keeping
in mind that this population should remain constant across the samples. In the current use case, one
could imagine different populations corresponding to computational use,
web site hosting, etc, or even blends of these. In fact, blends will be
particularly useful when dealing with the stochastic assignment problem
that would arise when needing to match new users to existing, active
resources.
The plan, then, will be to compute a sequence of estimates,
increasing until we find that we
have violated the risk tolerance. At each iteration, take samples of size and compute the statistics given in
Equations , and .
The key step is to define a stopping criterion that gives us control
over the probability that we accept a non-violation when, in fact, the
risk tolerance probability was exceeded. This is accomplished by
choosing a threshold such that,
if we observe we reject
the hypothesis that is less
than . In particular, we
choose such that
Equation should
be interpreted as establishing a bound on our willingness to be wrong in
the following sense: with probability , we will incorrectly identify
to be within risk tolerance when,
in fact, this is not the case. So, and are business level decisions.
is the guarantee on the
frequency of resource exhaustion and is how confident we are that such
a guarantee can be made.
The context implies a need for estimates to be conservative; avoiding
resource exhaustion is certainly preferable to customer fallout from
consistently missing resource guarantees. To that end, note that it is
sufficient to enforce the following:
Effectively, we only need to consider Equation in the most difficult
case, namely when .
See Table 1 for calculations of
when and . For a collection of
samples, the value of would be the maximum number of
resource exceedences allowed in order to give an -level guarantee that resources
exceed tolerance less than percent of the time.
K
c*
1,000
4
10,000
83
100,000
948
1,000,000
9,836
Table 1: The claim that a resource will remain unexhausted 99% of the
time can be made with 95% confidence if the observed c is less than
c*.
Simulation Example
Without real data, the best that can be done is through simulation.
We assume the following:
Each user is sold access to a 10% slice of a given resource.
The usage distribution is a mixture of beta distributions. 90% of
users have low resource needs, consuming an average of 20% of their
allocated slice. The remaining 10% of users have high resource needs
and, on average, consume 90% of their allocation.
Figure 1 shows the probability density functions.
For a resource that does not allow oversubscription, our scenario
would allow 10 users to share the resource. A histogram (Figure 2) of
usage levels shows that the this resource is woefully underutilized. The
guarantee of no resource exhaustion means that we sacrifice 74% of
potential resource utilization.
Next, we explore the potential benefits of allowing oversubscription.
For a sequence of increasing and
taking , we simulate samples, each of size drawn from the mixture distribution.
For each sample, we compute and
compare to . The first for which is the stopping point.
The results in Figure 3 show that the optimal that enforces our risk tolerance is
. Resource utilization
under this scheme is shown in Figure 4 where median resource utilization
is now at 66.7%. The gray, vertical line shows the 99% quantile is .994,
just shy of 1. This is exactly what we would hope: 99% of the time,
resource utilization is less than 0.994. Thus, we can more than double
the number of happy users while substantially increasing resource
utilization.
In Figure 3, the confidence region reports the quantile (divided by ) of a random variable distributed as
where is the MLE. This is a one-sided
confidence interval and based on the exact binomial distribution rather
than a normal approximation.
Further Development
Changing the sampling scheme to account for live data should be
straightforward. Moreover, this scheme could easily be adapted to handle
different usage populations. For example, the simplest scenario might be
to use a classification engine to identify which usage distribution
might best apply to a new customer entering the system.
A more interesting use case of the methodology described here would
be as a component of an algorithm that matches new customers to
resources already under load. This is a non-trival, stochastic optimal
packing problem and would certainly be an interesting project with which
to get involved.