context
Conviva wanted to participate in an RFP but the rigidity of the existing, monolithic pipeline made it difficult. In particular, the project required exploratory data analysis, redesigning and generalizing the ingestion portion of the existing pipeline, and implementing a scalable, map-reduce variant of the Louvain community detection algorithm in Scala.
description
The RFP became a pretext for implementing a new and improved Stream ID product. The goals were to introduce a new, more flexible household model; to deliver an easily configurable, highly tunable, and interpretable model; to define a principled approach focused on time granularity and canonical graphical representations; and finally, to produce something that, if correctly parameterized, could recover the behavior in the existing implementation.
These concerns were guided by customer use cases.
In particular, it was important to be able to tune the model around some notion of time scale: should clusterings reflect yesterday's data, last week's data, or last month's data? Alternatively, perhaps the dynamics of the graph are of interest, and older data should be discounted.
It was also clear that customer context should inform whether Stream ID would work with clusters of clients, clusters of addresses, or clusters that include both types of nodes.
At a high level, it made sense to reframe the problem as a clustering on a weighted, colored graph.
But how to define weights? Again, customer context should inform the configurations that would be necessary. I termed these "Association Metrics", and provided implementations for session length (usage), session counts (incidence), time discounted session length and time discounted session counts. Each of these defines a weight between a client and an address. Such weighted, colored graphs can be reduced to single color subgraphs, with new weights induced by those in the larger graph.
Another metric, one that mimicked the existing approach and was applicable for client-client and address-address graphs, was based on the idea of overlapping intervals. Effectively, this approach considered each client-address edge and computed an interval from the first time the pair was observed to the last time the pair was observed. The weight between two clients (or two addresses) was the overlap of their intervals aggregated over all addresses in which the clients were co-incident. Time discounted versions of these metrics were also made available. During development, it was discovered that the overlapping intervals metric tended to create larger subgraphs, one of the most common complaints amongst customers.
Another concern was dealing with the competing objectives: is it more important to associate every entity (node) with a cluster or to produce smaller clusters? We addressed this with a posthoc pruning step.
One line of thinking that was in vogue at the time was that Conviva should purchase third party labels that classified IP addresses as residential or non-residential. That meant the v2 solution needed to provide a mechanism for a-priori filtering.
Postal codes were another feature of interest. Should they inform whether nodes could be members in the same cluster? So, I added a knock-in / knock-out feature which, again, could be toggled to fit the question of interest.
Another feature component evolved from a request to optionally map IPV6 addresses to a network prefix; and to optionally map client identifiers to a device fingerprint.
With edges now having semantic meaning, one defined by the choice of association metric, it made sense to enable thresholding for low value edges.
For the most part, the above amounted to easily configurable preprocessing steps; and the Louvain algorithm would be applied subsequently. However, the knock-in / knock-out feature required an additional blacklisting / whitelisting step inside of each Louvain iteration.
Finally, I instituted a postprocessing pruning step where vertices were removed from a cluster when their removal didn't significantly impact the modularity.