Click & Collect neon sign
PHOTO: Henrik Dønnestad

Mobile ad fraud is one of the biggest challenges the mobile marketing industry is currently facing.

Fraud rates have almost doubled in the last year. A recent study by my firm showed 7.3 percent of all paid installs tracked were rejected once analyzed. Some advertisers lose up to 80 percent of their ad budgets, while the total financial impact around the globe is expected to surpass $4.9 billion dollars in 2018.

Just as in other industries, a growing number of market players are touting machine learning as a miracle solution to the ad fraud epidemic, and they assert it will have a huge impact on how the industry deals with it everywhere. 

These claims revolve around the fact that machine learning-based technology can "learn" over time based on the data it's fed. It’s touted that machine learning can spot patterns of data that are too complex for a human to notice, and can also handle immense data sets like the ones seen in mobile advertising.

However, machine learning is a relatively new technology, and there is a long way to go before it’s reliable enough to tackle something like mobile fraud. This article takes a look at a few of the limitations surrounding machine learning-based fraud detection, and goes into detail as to why the solution is not yet ready for primetime.

Separating Theory From Application

As it stands, machine learning has a key theoretical problem. Consider the following analogy: Imagine you want to drink water from a spring, but there are signs the water is contaminated. So, you decide to check first if the water is safe, and then if it isn’t, the second part of your plan is to devise a method to remove any potential contaminants. This would mean figuring out what the pollutants look like, what harm they do, and coming up with a way to filter all of them.

With some difficulty, you create a sophisticated machine that teaches itself how to detect any potential signs of a problem and will inform you about the types of pollution it finds. Your machine proves to be great at detecting exactly what kind of pollution it spots, especially as it sees more cases over time. However, can you then state that the water is definitely safe to drink? Can you be confident all the pollutants were caught by the machine, even the ones that didn’t exist when you first turned it on? Can you also be sure that it will only remove the pollutants, leaving the safe water behind?

Related Article: Ad Fraud: The Biggest Risk Is Dirty Data

Where Machine Learning Fails

Using machine learning in an attempt to filter spoofing of all kinds instead of a specific method can lead to difficulties. This is because fake users have to be filtered out from a combined data set of real users, with a whole host of unclear edge cases. Essentially, the water is running low, so you don’t want to waste a drop.

One scenario that exemplifies the difficulty of gearing machine learning to fraud prevention is a fraudster who uses real device information of a known user (ie. an OS version, IDFA, set of device settings) to commit fraud. Spoofing an install for an app that was never downloaded on the device would result in the machine learning algorithm having a hard time categorizing this fraud correctly. Considering the device really exists, and other attributions from other apps on this device were considered non-fraudulent, how could the machine learning algorithm detect that this time it is fraudulent?

Essentially, it’s this inability to know which data point is genuine and which is not that creates a real challenge when it comes to “training” the neural network. Machine learning could end up creating some extremely complicated rule sets, using a combination of seemingly unrelated identifiers in bizarre combinations. Just like the earlier analogy, this could lead to removing good water, or in this case, real users. This has an effect not only on an advertiser but also on its partners — explaining why a network was not credited with installs due to an overly-convoluted algorithm will be a tough job.

If questioned too often however, vendors who sell anti-fraud tools based on machine learning may, in turn, decide to hide their decisions behind a black box with less transparency — and we can all agree that no one wants that.

Related Article: AI Training: A Crash Course to Improve Your Project ROI

Why Black Boxes Are a Bad Idea

Imagine a network settling a dispute with a client over rejected attributions from a recent campaign. The network lacks any data to explain the rejections, and has to rely on the word of the client, who in turn relies on the attribution service who are monitoring for fraud. While this may not be an issue for a small fraction of a network’s traffic, once it starts affecting larger datasets, relationships may start to sour.

Once a provider loses the ability to (or doesn’t want to) explain why an attribution was rejected, it becomes an opinion-based rejection. Other than facts and logical filters, opinions can be argued over or disagreed with.

However, once you start down this path, you’ll end up in a situation where networks could try to portray every filter as just another opinion to ignore. Transparency goes hand-in-hand with objectivity.

Related Article: Machine Learning Fragmentation Is Slowing Us Down, But There Is a Solution

Why It's Just Too Early

Ultimately, machine learning has the potential to become an excellent means of detection, but it shouldn’t be relied upon for rejection - at least not yet. In its current state, edge cases will be missed. Instead of focusing on a technology that might lead to less transparency, hard work needs to be done to build filters the right way. Filters that stop fraud without also rejecting installs from legitimate sources.

Going back to our analogy of the polluted spring — with machine learning, you’d know for certain there’s pollution. But that doesn’t mean it’s time to rely on that logic to begin filtering water.

Your best bet? With investigation and proper filtering, you can dig deeper, find where the pollution comes from, and stop them all at the source.