Jakob Nielsen

Sep 15, 20239 min read

Replication Crisis in Website Analytics: The Trouble with UX Stats

Summary: Many analytics results are wrong. Methodology flaws can be fixed through careful analysis, but better tests will be more expensive and identify fewer design winners. This lowers ROI from analytics and will make this method less used.

You conduct an A/B test to determine which of two designs performs best, such as having the highest conversion rate. Fortunately, Design A wins, and the difference between A and B is statistically significant at p<0.05. What happens if you rerun the exact same test to replicate this finding?

You would expect Design A to win again, except for maybe one out of 20 times due to randomness. However, empirical evidence from several companies shows website analytics results fail to replicate at a much higher rate than anticipated.

This is troublesome.

Replication Fails

Replication means that you expect the same result if you conduct the same experiment twice. When I drop an apple, it falls to the ground. Since this can be replicated endlessly, we can have great faith in Sir Isaac Newton’s law of gravity.

When measuring a sample to represent a larger population, the recorded metric will naturally vary around the true mean. That’s expected, which is why we calculate margins of error and statistical significance — typically expressed as a p-value, which is the probability that we would have found the recorded outcome even if there was, in fact, no difference between the two conditions.

If we have a low p-value, the risk of drawing an erroneous conclusion from a study is supposed to be minor, and we can confidently proceed to launch the winning variation on our website and lean back to watch the dollars roll in.

So much for the theory you learned in college. Sadly, even though theoretically, theory and practice are the same, in practice, they’re different.

Eline van Baal reported that many analytics experiments at NS International failed to replicate.
Ron Kohavi reported substantial failed replications at Airbnb

As an aside, if you only follow one person for analytics insights, it should be Ron Kohavi, not me. (You should still follow me for my insights from qualitative research and my ability to make quant research findings understandable and relevant. And for my willingness to buck the orthodoxy that stifles so much online commentary these days.)

Math is hard, so this is an article where you must pay closer attention than usual with my articles. (“Math Genius” by Midjourney.)

When analytics tests fail to replicate, we must question our deepest beliefs. I used to think that if we run an A/B test and A wins with stat-sig, then we run with A. (As with much of this article, I owe Dr. Kohavi for the abbreviation stat-sig for “statistical significance,” which is a mouthful to either say or write. As Steve Jobs said, “true artists steal,” and I don’t think Kohavi minds that I copied the more efficient term from him.)

A/A Testing: Check if Your Tests Have Any Meaning

The first way to check if your analytics have any meaning is to run an A/A test, in which the two test conditions are identical. You don’t vary anything; you test your current design against itself by splitting the traffic like you would for an A/B test.

Since design A is identical to itself, the test must report no stat-sig difference between the two conditions, except for 5% of the time, if you rely on p<0.05 as your significance criterion. (Remember that p indicates the probability of declaring a difference when no difference exists between the conditions. So with no difference, 5% of the time, you’ll declare a winner erroneously, which is expected.)

Something is wrong if your analytics say that A is different from A much more than 5% of the time.

Your next test is to run some replication experiments on past analytics results and see if you usually get the same result.

Reasons for Replication Fails

There are 5 main reasons that replication can fail:

Random chance. The curse of statistics is that we always have a small probability of being wrong, even when we do everything right. The reason for all those p values is to ensure this probability is low enough that we can live with it. But it’s always there. If you run many replication experiments, expect a few to fail due to sheer randomness, which is no cause for alarm. If many fail, you have a problem.
Biased samples. I already mentioned the common problem that there is some underlying systematic difference in the traffic assigned to conditions A and B, respectively.
False analysis. We expect to be wrong 5% of the time, but more than 5% of our conclusions can be wrong if we analyze the studies incorrectly. For example, if you test 20 variants and find one of them to be a winner with stat-sig at 5%, it’s likely not a winner at all. In this example, you are running 20 studies, not one (even if you think of all 20 variants as part of a single test). So, if 5% of the 20 results are wrong, this would often mean that 100% of our declared winners are wrong. Similarly, let’s say you run 101 tests, but your ability to guess at design improvements is so bad that only 1 of those 101 variations is a true winner, whereas 100 variants would make your website worse. If 5% of your results are wrong, you’ll erroneously declare 5 winners from the 100 bad designs. Thus, even if you’re lucky enough to declare the 1 good design a winner correctly, you’ll end up with 6 “winners,” of which only 17% are improvements to your website, whereas 83% of the “winners” will make your site worse if implemented.
Methodology flaws. If there is an underlying error in how the test is set up, the results will also be flawed, meaning that they cannot be trusted and are likely to fail replication. The most common error is to split traffic between conditions A and B non-randomly so that one of the samples has a systematic bias. In one of his talks, Dr. Kohavi mentions a test that initially declared one of the conditions to be a decisive winner. Further investigation found the difference caused by bot traffic being concentrated in one of the two samples. The result changed once the bots (which are not humans and don’t behave like customers) were removed from the analysis.
False assumptions. All statistical formulas depend on assumptions about the data. If the assumptions are wrong, the calculations are wrong as well. So, your carefully computed p-value would not represent the actual probability of an erroneous conclusion but might be completely misleading. The most common assumption in almost all statistics packages is that the data follows a normal distribution, an issue I’ll discuss in depth in the next section.

Suppose you conduct A/A tests or replication tests and discover more errors than are to be expected from pure randomness (which we can never avoid). In that case, you likely have one of the above problems and need to dig into your test methodology with a critical eye to discover what’s wrong.

What If the Normal Distribution Doesn’t Hold?

We can discover methodology flaws through careful analysis. But what if the deepest of all our statistical assumptions is wrong? What if the data doesn’t follow a normal distribution?

The famous “bell curve” is the normal distribution, which looks like the following illustration. It’s called “normal” because it’s the most common and expected distribution for almost all types of data.

The normal distribution is common because of the Central Limit Theorem (CLT), which states that the distribution of the sum (or average) of a large number of independent, identically distributed random variables approaches a normal distribution, regardless of the original distribution of the variables. Almost all the phenomena we want to study are the sum of countless smaller components. For example, many small details combine to determine whether a visitor to an ecommerce site turns into a buying customer. Enough factors impact the outcome that we expect conversion data to follow the normal distribution.

Histogram of a hypothetical user behavior under the (possibly false) assumption that it follows a normal distribution. Each emoji represents roughly 1% of the users. If the data follows a normal distribution, 1.4% of the observations will fall within the tail area highlighted in red. But if there are too many outliers, our assumptions fail. This, again, would mean that most statistical analyses of the data will fail.

In 2006, I analyzed a dataset of 1,520 timed task attempts with various websites. 6% of the samples were outliers far from predicted values if the data had followed a normal distribution. I don’t know whether this result generalizes to the type of measures collected in modern website analytics. But I fear something similar could be afoot, with large numbers of outliers or user segments with drastically varying behaviors than the mainstream.

All your statistical analyses must be redone if the data doesn’t follow a normal distribution. You can eyeball your data in a Q-Q plot as a first attempt to see how closely it follows the normal distribution. (If the normal distribution says that 10% of your data values should be below a specific number, the Q-Q plot will show whether the actual number for the tenth percentile is about the same.

What Should You Do About Stats Problems?

By now, you are probably rather alarmed:

Well, you should be. There’s a considerable risk that you have been worshipping a false deity in the form of flawed data analytics and used bogus findings to drive your design decisions.

I have two action items for you:

Try some replication experiments and a few A/A tests, as outlined above.
If the results from step 1 are scary, dig into your methodology to see where you went astray.

Do Statistical Flaws Matter?

As mentioned, many of your design decisions may be based on bogus analytics findings. But how bad is that?

Quite often, when statistics are wrong, they are only slightly wrong. The main exception is if you have a major bias, in which case the statistics can be very wrong and lead to terrible design decisions. So, you should track down the reasons for any trouble you have identified in your analysis practice.

If all your fundamentals are sound and you are only hit by random errors, then the mistakes will likely be small. Let’s say that A is genuinely better than B but that B is declared the winner from analytics due to a random error. These random swings will mostly happen when A is only slightly better than B. Thus, the penalty of implementing B instead of A will be minor in most cases.

But let’s say your fundamentals are iffy. Take the problem with false positives vs. false negatives:

A false positive is when you declare a winner that’s not the best. False positives are also called Type I errors. However, I recommend against using this terminology because many people can’t remember Type I and Type II (a false negative).
A false negative is when you fail to declare a winner, even though you have one. You don’t have enough statistical power in your sample size to achieve stat-sig, so you are forced to conclude that there is no difference between the variants.

With AI, ideation is free, and with analytics, testing appears cheap. A dangerous cocktail because if you test many false hypotheses (i.e., lousy design variants being tested), you will have many false positives simply due to randomness. You might get even more false positives due to fundamental methodology flaws. Still, randomness will be enough to doom you if your ideation process produces large numbers of poor ideas that you proceed to test.

This is a reason to take it easy and include a filter between the ideation and test steps, where you decline to run an analytics test with ideas that, for example, fail heuristic evaluation or qualitative usability testing.

You want to make data-driven design decisions, but only when the data is reliable. Run fewer, but better analytics experiments. (Midjourney)

Lower ROI from Analytics

Website analytics is built on quicksand. Core assumptions fail in practice. Business-critical decisions often rely on misleading data. These observations lead to two conclusions:

Analytics is riskier than we thought. Much can go wrong, and you must be more careful, including running tests on your methodology and findings to see whether they hold up. This is no reason to avoid analytics, but the need for added carefulness makes it a much more expensive methodology.
Analytics is less valuable than we thought. It’s now more expensive, and we will get fewer robust findings if we insist on the necessary low p-values of no more than 0.01. This radically unbalances the cost-benefit ratio against analytics.

We should not discard analytics. It still has a role to play, especially for websites that satisfy two criteria that, luckily, often go together:

Vast amounts of traffic, on the order of several hundred thousand visitors per day, so we can run experiments with sufficient sample sizes to get low p-values and high statistical power. (Around 200,000 users per condition).
Enormous amounts of money at stake, so that even a tiny lift of, say, 0.1% is worth millions of dollars. Analytics is the only way we can measure minor differences between design options.

On the other hand:

If you have too little traffic to get sound analytics data, don’t rely on analytics.
If you are mainly after big gains, you can identify them faster and more robustly with qualitative usability testing.