Winston Churchill famously said, "To improve is to change; to be perfect is to change often."

If you are an advertising agency, an e-commerce space or any website that relies heavily on clicks, purchases and form completions, split-testing is not only valuable, it's crucial. If you are not constantly split-testing, you are behind the curve and paying opportunity costs, everyday.

To properly scale, however, websites must be aware of some of the most common pitfalls in a growing split-testing operation. Vulnerability is most apparent when interpreting the results of a test through some Key Performance Indicators (KPI).

The most common misinterpretations of split-testing KPI’s is the Phantom Lift, and understanding how to adjust KPI’s to make more insightful decisions will be the key factor in extracting the full value out of split-testing.

What Is Phantom Lift ...

And why does it occur?

In simplest terms, the Phantom Lift is when we run many successful split-tests and our overall results do not meet our expectations. There are three types of lifts: The Expected Lift, The Actual Lift and their difference, The Phantom Lift.

[Expected Lift] – [Actual Lift] = [Phantom Lift]

For example, suppose I use simple A/B Split-Testing with the goal to improve click-rate on a site. In this example, the KPI will be relative percentage increase. That is, if a test variation performs better than a control variation, we say the test performs "x percent" better than the control.

Let's say over the course of a year we run 100 split tests, one right after the other, with the goal of improving our meager 20 percent click-through rate. After each test, we compare the control variation to the test variation. If out test variation performed better than the control, it becomes the new control variation and we record the lift the test provided.

If not, we do nothing and proceed with the next test. At the end of this process, we find we had 50 winners, which produced an average of a 4 percent lift.

Fantastic! We should have at least a seven times better click-thru rate than we did before (1.04^40), right?

Not exactly. You will notice that a seven times better click through rate would mean our click-through rate is now 140 percent; not altogether feasible. Moreover, our click through rate after all of our testing is only 25 percent. This is an improvement, to be sure, but not nearly as large as we would have hoped.

Why the Hell Should You Care?

The example above isn’t just mathematical sleight-of-hand or an elaborate computational ruse.

The problem starts with variance in the data. Variance can be thought of as the volatility of information. The less volatility, the easier it is to learn from data and drive decision making.

What so often happens is that this volatility is grossly unaccounted for, and we let too many false positives drive decision-making.

At this point, the phrases “small sample size” and “statistical significance” tend to creep into the discourse. But a third buzzword is equally important; publication bias. While usually found in the context of academic research, its relevance to split-testing is undeniable.

In our split-testing example earlier, we cherry-picked winning tests and ignored the information losing tests provided. The results we published led us to a false conclusion, in the end.

How Do We Address It?

Choose your KPI wisely. The above method was clearly insufficient, but we can augment how we interpret our results to better protect ourselves against variance and selection bias.

Popular strategies include leveraging statistical significance, confidence intervals, and Bayesian inference. No worthwhile strategy will be entirely immune to variance or false positives, but awareness of the problem is a huge step in the right direction.

Lies, Damn Lies and Statistics?

So what is our takeaway here? That there are lies, damn lies and statistics?

More realistically, false positives, if left unnoticed compound into large disparities in expected and actual performance. Take care with your KPIs and temper expectation with well-reasoned pessimism.