Today's post stems from a conversation I had with a colleague concerning the tradeoff between digital identity and digital privacy. Privacy and ownership of your own data in the technology market is being increasingly pushed both as a regulatory issue as well as a product people will actually pay a premium for.
This however clashes with two key pillars of the current Internet.
The first problem is that the Internet is free. Free content on the internet is subsidized by advertising - an industry which generates around 200 billion USD of revenues per year, and which is expected to grow to half a trillion by 2025. For that to happen, our ads overlords need to know what you look at and how that influences your spending decisions.
The second pillar is that the internet is anonymous. And anonymity is a boon for bad actors - such as scammers, fraudsters, and thieves. Around 25 billion USD a year are lost in fraudulent credit card transactions - just the tip of the iceberg when you consider the industry at large doesn't care (and doesn't track) rampant fraud on so-called non-guaranteed payment methods, identity thefts, online scams, etc.
With this much money in play, I seriously couldn't believe a cookie wall or toothless iOS update would ever prevent us from being fingerprinted wherever we surf.
And sure enough, it's plenty possible to uniquely identify your device without triggering any national or international regulation, without asking for consent, and without triggering any privacy measure on any platform (including Apple's).
Here's how.
Recipe difficulty:
- Statistics: 🔥🔥- A short digression on how to calculate hash collision probabilities, but you can safely skip it.
- Technical: 🔥🔥 - A few things to wrap your head around.
- Time required: 🕜 - A couple hours, longer if you start researching more clever ways of tracing people online.
What it is:
A short exploration of client-side techniques to profile your devices in a unique way, entirely legally, but without ever asking for permission.
Why it matters:
Mostly to make the reader realize that Privacy as a Product is largely a cover. Also, if you wanted to generate unique device identifiers, say, for a risk engine, analytics suite, customer profiling or backend performance tuning - this is how you do it.
How it works:
I set out to investigate a set of fingerprinting techniques with the following constraints
- No requirement for the user to accept, input, or install anything.
- Does not breach privacy regulation.
- Difficult to tamper with without significant technical effort.
- Identifies a single device with high likelihood, with significant stability across multiple client visits.
To understand how this can be accomplished, here's a very short overview using the simplest identifier of all: your device's screen size in pixels.
We can use javascript to query your browser for your screen's size, like this:
var screenSize = String('width: '+screen.width+', height: '+screen.height);
This is done in your browser, without any specific permission request. Constraint #1 is therefore fulfilled, as this is silent fingerprinting. For instance, your very own browser reports to have a width of '+screen.width+' px and height of '+screen.height+' px.
Storing this data does not breach any GDPR-class regulation, since your device's size is not uniquely Personally Identifiable Data - it is just likely unique. And it does mantain privacy: all we need to do for that is take the measurement made, hash it (more on this later) into a fixed-length element that cannot be reversed back into raw data, and submit it for storage.
The third and fourth constraints; however, are not yet met: first of all, this fingerprint is relatively easy to tamper: just change your screen resolution and you have a brand new device fingerprint. More importantly, there's a very limited number of possible screen size combination, and a very large number of devices, which means we wouldn't be able to identify your device with a high likelihood against other devices with the same screen.
We can fix both issues by adding more fingerprint datapoints. The more measurements are added, the more unique your set of fingerprints become. And of course, the harder it becomes to tamper with your device to assume an entirely new identity.
There's an entire cottage industry devoted to find more and more clever ways of fingerprinting devices: for this article, I'm going to focus on three general classes: the browser data class, the audio class and the video class.
Browser data
The first and most obvious way to uniquely identify the device and make the fingerprint tamper-proof, is to add more data points. The browser itself can leak an extensive number of data items to the client app. Most of these are pretty trivial to fingerprint - the really interesting part is that the breadth of data available makes it very easy to cross-reference the data received.
For instance, we can obtain the browser's user agent and a list of installed plugins using the navigator object, like this.
var userAgent = navigator.userAgent;
var installedPlugins = navigator.plugins;
Since some plugins are only available on certain platforms, this gives us a way to detect a device tamper, for instance if the User Agent is being spoofed but the list of plugins points to the actual browser being used. Of course, you can also mantain a blacklist of suspicious plugins and even Chrome/Firefox extensions that can be used to accomplish that.
In your case, the browser reports the following plugins as installed: ???. Recognize any?
As the number of data items collected increases, the chances of having a full spoof decrease considerably. But we can do even worse, by profiling the hardware itself.
Audio data
The first way to do so, is to use the Audio API, which is supported in virtually all modern browsers. By creating a new AudioContext like this:
var audioCtx = new window.AudioContext || new window.webkitAudioContext;
we are able to profile a system's audio information and extract unique information about your audio hardware. This is of course a lot more unique, and more difficult to spoof than the old browser data, while still being privacy-proof and requiring no consent.
Video data
We can do the same to a computer's GPU, by profiling the unique way your computer draws a certain image using the Canvas graphics environment, like this:
var graphicsCtx = document.createElement('canvas').getContext('2d');
by drawing some pixels and reading the canvas content back to a string format, it's possible to uniquely profile your graphics system.
Hashing
The output from any profiling we've done so far is a string containing a set of likely unique identifiers about your system. What we want now is to take these very long fingerprints and reduce them to a (likely unique) set of device fingerprint hashes.
A hash is basically the output of a function that maps arbitrarily-size data onto a fixed-length identifier. This identifier is likely shorter than the input data, which means it carries significant positive effects:
- A shorter hash is easier and faster to transmit back to the backend for storage
- Regardless of the amount of data collected during fingerprinting, the storage and comparison of hashed data will always have a fixed upper processing bound
- A fixed hash length effectively maps data down to a lower dimension, which can be quite useful i.e. if we want to feed our data to a machine learning engine.
- A cryptographic, non-reversible hash allows for storage and handling of sensitive data.
The disadvantage of a hash is the risk of collisions; that is, two completely different devices with different fingerprints yielding exactly the same hash. This is a statistical trade-off, and the likelihood of two hashes colliding (assuming the hash is generated from a uniform distribution) can be derived as follows:
Imagine we have a hash function that maps arbitrary values onto N unique hashes. The first input we hash is unique with absolute certainty. The second value we hash now has
\[ \frac{N-1}{N} \]
chances of being unique. As we pick a third, a fourth, ... up to a k-th element, the probability of the k-th hash being unique becomes the expression:
\[ \frac{N-1}{N}\times\frac{N-2}{N}\times\dots\times\frac{N-(k-2)}{N}\times\frac{N-(k-1)}{N} \]
To simplify this dot product we go back to the Taylor expansion for the exponential function:
\[ e^{x} = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + ... \]
When x is << than 1, of course exponentiating it to a high power will make it go to zero. Hence, \(\text{for x << 1 } \rightarrow e^{x} = 1 + x \).
This means we can simplify certain fractions to an exponential
\[ \frac{b-a}{b} = 1 + (-\frac{a}{b}) = e^{-\frac{a}{b}} \]
Taking this back to our dot product, we see that:
\[ (\frac{N}{N}\times\frac{N-1}{N}\times\dots\times\frac{N-(k-1)}{N}) \approx (e^{-\frac{0}{N}} \times e^{-\frac{1}{N}}\times\dots\times e^{-\frac{N-(k-1)}{N}}) \]
Which means the probability of the k-th hashed value being unique for N total hashes can be simplified as:
\[ e^{-\frac{k^2}{2N}} \]
The probability of having at least one collision is \( 1 - e^{-\frac{k^2}{2N}} \)
In our case, I have chosen for simplicity a hashing function that replicates java.hashCode, for which we'd expect a collision after ~ 76,000 hashes. I do trim the hashcode to a fixed length, hence increaing the probability of collision - for a production system I would recommend using something like SHA1, which finds a collision on average after \( 2^{80} \) hashes.
Bringing it all together
Today we have looked at how to generate fingerprints for three different classes of device data: browser data, audio data and video data. By smartly segmenting these classes into subclasses, each with a different fingerprint, it's possible to develop heuristics that allow systems to be identified with relative confidence even after basic tampering - for instance, by accepting partial matches of N out of all fingerprints, or by discarding tampered fingerprints.
The bottom line is that it is very feasible to fingerprint a device despite changes in the device itself (whether voluntary or not), and that silent, regulatory compliant fingerprinting solutions are very much a thing.
If you're curious about the full code exploration, feel free to tinker with the repl below. Pressing on the play button will (likely) uniquely identify your device.
Remember this next time you shop online 🕵️