Cross-device tracking or Record linkage (RL) is one of the biggest challenges for digital marketers today. Being able to correctly identify the same users behind millions of distinct devices is not only a problem of identifying and applying the correct Machine Learning (ML) technique but also doing it in a feasible way, since the vast amount of data associated with problems like this one can even challenge the boundaries of technologies like Apache Spark.
In the context I want to write about today, record linkage is the task of finding records in a data set that refer to the same user across different data sources (e.g., data files, books, websites, databases). It is a vital task when joining data sets based on users that may or may not share common identifiers (e.g., database key, URI, National identification number ).
A data set that has undergone RL-oriented reconciliation may be referred to as being cross-linked. Record Linkage is called Data Linkage in many jurisdictions, but is the same process.
Suppose you are given a dataset characterised by the following specification:
The question is: what kind of techniques would you use in order to identify the same user behind different devices, i.e. to correctly group devices that belong to the same user?
This is a multifaceted problem with no one solution. Actually, there is no optimal solution to this problem and this is because of the complexity of the available information and of the possible ways one can use that data in order to identify the same users behind distinct devices. Arguably, there is enough information within the above dataset in order to achieve a highly accurate solution by simply considering a brute force approach through comparing and analysing all pair-combinations of all records (cartesian product). However, if we are talking about processing millions of records with numerous features, and doing so on a daily basis through a batch process, such a solution seems infeasible. And in reality, as we will cover later in this article, we would typically wish to enrich this data set with additional data as well as a number of feature engineering techniques - compounding the computational challenge.
The real challenge here is finding a shortcut: how to avoid unnecessarily comparing records that are very unlikely to be linked in any way. Generally, one needs to come up with a tractable solution that would optimally employ available resources and technologies to tackle the problem, and this is what this article is about.
SO, SHOULD YOU GO WITH A SUPERVISED OR WITH AN UNSUPERVISED ML APPROACH?
Choosing between these two depends on the available information. In the case of supervised learning approaches, one requires a labelled dataset. In other words, one needs a reliable way of identifying records of devices that belong to the same user. If that is somehow acquired then those records can be used as a training set in order to produce a supervised learning model. In the case of our data, a reliable way of identifying the same user behind different devices could be based on IP information. Simply put, if two devices share multiple, common IPs, then it is very likely that they belong to the same user. However, IPs cannot be the only relationship indicator, while even if it serves as a credible way of "labelling" records in order to produce a reliable training set for modelling purposes, it is very often the case that only a small amount of records satisfy this requirement. This implies that the training sample would be significantly small and you would end up with an overfitting model-not able to generalise. In addition of course, IP is only a proxy for user - for example, a single public IP may be shared across many private IPs (via NAT) for which we do not have visibility.
In this respect, an unsupervised approach is more reasonable though very challenging in terms of its complexity — i.e. the number of combinations one needs to consider. In order to minimise that number one has to rely on numerous statistics as well as common sense.
REDUCING THE COMPLEXITY OF THE PROBLEM: A PRAGMATIC ANALYSIS
For instance, it is relatively safe to assume that most users only have a single mobile device; ergo, mobile devices can serve as centroids (distinct users) in an unsupervised clustering approach. Therefore, it seems redundant (or unnecessarily pedantic) to merge mobile devices. It is worth noting, that tablets may operate as mobile devices but here I am only referring to mobile phones. Business phones, however, form a problem. Nevertheless, it all comes down to how we use each device. It is very common for people to not use business mobiles for personal purposes, so at worst, these would be picked up as separate users. There are ways to counter this problem, but an analysis that deep would deviate from the purpose of this post, and thus we choose to ignore this problem.
Further in relation to the processing challenges, one has to account for two things.
First, that a considerable amount of data is available for every device. It is obvious that the initial clustering steps will start after a GROUPBY operation where records would be grouped based on unique device ids (MAC Addresses). For every device, a large number of features would emerge relevant to:
Second, that information has to be processed in a way so as to allow for an initial clustering step necessary for splitting the initial set into smaller, more manageable, subsets with relevant information. This processing involves identifying and quantifying features relevant to each unique device (e.g. related to location or web-content) which would in turn serve as the input for the ML algorithms we'd apply later. In short, the overall approach follows from the classic “divide and conquer” paradigm (dynamic programming) whereby a problem is divided into a set of more manageable and less complex sub-problems, the solution of which results to the overall solution of the problem.
DATA ENRICHMENT & FEATURE ENGINEERING
Let's start by assuming the raw dataset consists of the information analysed in the above description. At this stage, every record concerns a unique device and an acitvity along with an IP and possibly with web-content information (or a mobile application used). The first step would be to enrich these records with additional location information.
By running a simple curl command, as follows:
christoshadjinikolis$ curl ipinfo.io/86.23.38.108 { "ip": "96.33.48.118", "hostname": "cpc1-finc16-2-0-cust.xxxxx.xxxx-.virginm.net", "city": "Edgware", "region": "England", "country": "GB", "loc": "49.6710,-0.5687", "org": "Sky Limited", "postal": "SE17"
One can easily obtain valuable information relevant to the location from where the url/mobile app was accessed. Each of the newly acquired data can extend the record's existing columns/features.
The next step would be to use a GROUPBY unique (device) id condition (at this point I am assuming use of the Apache Spark framework and that all combinations are based on operations applied on Spark Dataframes). For every device, all information related to the same features would be intergradeinto a list, e.g.: a list of IPs, a list of locs and so on.
However, in the case of websites/applications accessed by a device, those would be kept as distinct features. For example, if a device has accessed facebook, then facebook would be a feature and its value would be equal to the number of times that the device has accessed it, e.g. if it had accessed facebook 4 times that value would be 4, 10 for google, 5 for expedia etc. As a result, a significant number of features would emerge for every device.
We are all creatures of routine. As such, we follow very specific patterns in our everyday lives. We go to work in the morning where we spend most of our day and return home at night where we eat and sleep. It is, therefore, generally expected that if a device has multiple IPs those would orient at most around two main locations. Therefore for the list of location data under the loc feature I would suggest the following device-centric approach:
The process would lead to averaging to new centroid values in every iteration, therefore concluding with only two ‘averaged’ loc values that would concern the two most frequently visited locations for the device.
A similar approach can be employed to identify activity periods for each device. For example, break the 24h day cycle into 3, 8h-long parts, and use the device activity to identify the sleeping/inactive part. These three parts can be used as additional binary features. One can of course break these "activity periods" in more than just 3. Also note that we could allow these time ranges to overlap. However, this would lead to increasing the solution's complexity.
THE FILTERING PRAGMATISM
Next comes the filtering pragmatism. At this point a catalytic approach is necessary for significantly reducing the number of comparisons one has to attempt in order to identify devices possibly owned by the same users. This step must be based on various statistical/demographic characteristics of how people use the internet and mobile applications. Here, I present only one such characteristic which is rather simple yet adequate for achieving a concrete subset division. After all:
"...when clustering data, no more assumptions should be made than necessary!" (yes—I am paraphrasing Ocam's razor).
So, the concerned characteristic is related to how often users access the same web-sites throughout their day irrespective of whether they do so from a tablet or from a mobile phone or laptop, i.e. "what" they access and "when". In addition to this, it is generally the case that people, due to routine, operate mostly around two central location cores, i.e. the "where". Hence, location data can also be used for dividing the original set into smaller, internally more relevant, and more manageable subsets. Thus the frequency by which a user accesses a certain site as well as the location and timestamp data available can be very helpful.
I would start by splitting data into sets where the country is the same. If a device has multiple country values I would assume that the device country is the most frequent one.
I would then apply a DIvisive ANAlysis clustering (DIANA) or the k-means algorithm (with k = 2, iteratively applied until satisfyingly small subsets are produced) applied only on the URL frequency features and the time-ranges features (note that time-ranges can also be modelled to represent the frequency by which one is active during certain periods within a day rather than using a binary representation). This is a challenging step due to the immense number of possible websites accessed by each device. To reduce that number one may impose filtering restrictions: sites accessed less frequently than a certain threshold are filtered out. Furthermore, as there are 2 to the power of n ways of splitting a cluster into singular records, heuristics are necessary. A reasonable approach would be to prioritise the most popular websites first (based on relevant statistics) and extend the search beyond those to a desirable depth. Obviously, this process needs to be terminated prior to exhaustively reducing the subdivided clusters into singularities.
I would then use the loc data related with the device’s two main geo-coordinates identified in previous steps. In this step the main objective would again be to further split devices into sub-sets that share common general location data. Focus, however, would concentrate around the two mean geo-coordinates assumed to represent the two most frequent locations where a device is usually found. Note that the overall accuracy of the approach could increase if this number is increased to 3 locations. Notice: On the order of steps. Steps 2 and 3 could be executed in a reverse order and produce better or worse results. However, step 3 is more expensive and trying it before step 2 would increase the complexity of the solution.
A PROBABILISTIC CLUSTERING APPROACH
In this final step, assuming that the previous steps would have yielded relatively small and thus more manageable subsets, I would apply a k-means clustering algorithm with a custom-made distance formula for merging devices (every record is a device at this stage) into a single device thus representing a single user. The candidate centroids in every produced subset would be all of the mobile devices (i.e. k = all of the mobile devices). In every iteration the comparison formula would yield a match probability percentage for all pairs of centroids with un-matched records, which iff above a certain threshold would be accepted for merging. The iterations would continue until no more merging is triggered. Finally, the remaining (non-centroid) records (devices) that remain singular could enter a new clustering round repeating all previous steps and finally trying to join a new centroid from the newly formed ones (centroids produced in previous rounds).
EXPECTED ACCURACY & OTHER CONSIDERATIONS
There are still numerous factors that can be taken into account. In this analysis, I focused on the most obvious or easy to work-with ones, in order to provide an abstract (high level) analysis of the problem. A more concrete approach would consider additional features or user characteristics related to our behavioural patterns:
Furthermore, simple rules can also be employed to assist with the clustering approach. For instance, it is highly unlikely that one would access two websites at the same time or even with relatively similar timestamps (unless they are characterised by advanced multitasking skills!). Also, timestamps are something I barely touched in my analysis and it can definitely be exploited in more than one ways.
In terms of the overall accuracy of the solution, I would expect fairly reasonable results despite the simplicity of the approach. I would also expect, however, that the process would be very educational in terms of identifying and validating significant factors in this clustering process; evaluating the credibility of certain assumptions and thus furthering and accordingly improving the proposed solution. After all, Machine Learning is partly about trial and error! In this sense, failing fast can lead to a good enough approximation in a later stage, faster than anticipated.
Hopefully this post will prompt some discussion. I'd love to hear from anyone thinking of (or actually) applying this type of approach, or an alternative. Get in touch!