Mobile devices have become integral in our lives — always with us, keeping our physical selves plugged-in to the digital world. 

As we increase our consumption of data on the move, so too do we increase the amount of personal information we transmit. Devices increasingly include functionality such as tracking fitness, health and well-being, which means more personal data elements, such as weight and physical activity, are being shared across apps.

Device manufacturers and operating systems work to keep personal information private, but in a quickly changing ecosystem, tying down all the ends gets complicated. And that opens the door for malicious agents to build compromising user profiles.

Given this context, protecting user privacy is just good corporate responsibility. This applies to any company with access to large volumes of user information such as ad networks or popular apps. 

Implementing algorithms that inherently protect user privacy will result in better systems and user experiences overall, while increasing relevance and value to the user.

The Problem with Granularity

The growing volumes of data caused by the increasing granularity of information available on mobile brings with it two issues — handling scale and understanding the noise that can come from over-specification. 

Infrastructure solutions to handle scale are straightforward. However, any knowledge-based activity involving human touchpoints can dramatically break at even modest scales. 

Consider the case of classifying a user as an “upscale” or “budget” buyer based on the places she shops. If one were to do this by hand, this would become intractable for as little as a thousand users. The next approach might be to classify shops into these buckets and automatically assign users based on their shopping trends. However, even this method fails once the number of shops increases beyond a certain scale. 

Granular data makes interpretation difficult by mixing in noise with the signal. 

By designing systems that can handle both scale and noise, privacy protection can be built in from the start. Removing all human intervention in data processing allows easy scaling while making for greater privacy. Clustering information removes noise and at the same time, provides anonymization. 

Two Machine Learning Models

To understand this further, consider the two classes of machine learning models: supervised and unsupervised.

Supervised 

In supervised models, the system is trained on historical outcomes and a set of features associated with each outcome. In user-based modeling, these features could include user characteristics and behavioral attributes. Fortunately, we can replace specific information with abstract labels and do just as well. 

For example, if a user has purchased a laptop and a camera but not a TV, we could replace that by an abstract vector that looks like (A:1,B:1,C:0) — no need to specify what A, B and C stand for. Techniques such as principal component analysis and information entropy reduction automatically evaluate how relevant each of these features are to predicting the desired outcome and we can weigh them appropriately. 

Each feature vector becomes a fingerprint that describes the user without revealing any personal information.

Unsupervised

In unsupervised models, algorithms create abstract clusters based on shared characteristics. If the user is described using anonymous feature vectors, then these clusters are in an abstract many-dimensional space, which prevents typecasting users into common stereotypes. 

Techniques like deep learning take this a step further by automating feature extraction from raw data and creating fingerprints that are not describable. 

Noise reduction techniques include creating small groups that are likely to behave consistently. For example, users belonging to the same nine-digit ZIP code (also referred to as zip+4) are commonly grouped together. Creating these groups makes prediction analysis more robust by being specific enough without dilution, and at the same time provides anonymity for users. 

Most mobile devices have the processing power of small computers and can be used to implement binning. Using a distributed, map-reduce paradigm, the device can map personal information into abstract buckets and share the aggregates with the ecosystem to reduce operation.

Exploration or Exploitation

These techniques help protect user privacy in human form but requires careful design to prevent creation of filter bubbles (only seeing what you have liked before) and personal recommendations in bad taste (for example: “Are you overweight?” ads targeted at recently engaged women). 

Good design entails a balance between showing users new content (exploration) versus bombarding them with known material (exploitation). As with all businesses, careful curation of content is critical to maintaining user experience.

Systems that can handle scale and noise robustly automatically protect user privacy. Taking it one step further, and handling exploitation versus exploration efficiently creates rich user experiences while preventing algorithmic stereotyping.

Title image "Shy" (CC BY 2.0) by  Tom Edgington