A/B Testing Paradox: UX Maturity Reduces Your Gains

Summary: A dataset of 74 A/B tests shows average KPI lifts of 0.2% from the statistically-significant experiments. This produced ROI of 620% in the case study, but as UX maturity climbs, the expected gains dwindle.

Ro Kohavi posted an interesting analysis of a dataset of 74 A/B tests. As always with Kohavi (probably the world’s leading analytics expert), it’s worth reading the full thing if you’re interested in analytics. For everybody else, here are some key takeaways.

Average Gains from A/B Testing

The average lift was 0.15%. (This is how much better the new design is compared with the existing design.)
The average difference between conditions was 0.4%. This is the expected number to gain if we had to pick between A and B without us already having one of the conditions installed as the working version.
54% of cases had a positive lift (i.e., the new design tested better than the existing design). The average lift for these positive cases was 0.5%

In the practical world, we won’t make any changes if our hypothesized alternate design tests worse. In this scenario, slightly more than half the time, we would be able to gain 0.5% on average, for an expected KPI improvement across all cases of 0.3%.

This is a simplified analysis that doesn’t consider statistical significance.

Each experimentation round only improves the targeted KPI by a pinch. (Ideogram plus Leonardo upscale)

19% of Experiments Reach Statistical Significance

Let’s say we use the old-school criterion of p<0.05 as our definition of statistical significance. (Kohavi has often argued that this is too weak a criterion, but since 5% is the norm, I’ll use it here.)

Only 23 of the 74 A/B tests have a stat-sig outcome. In 9 of these 23 cases, the old design won, meaning that we should make no change to our shipping implementation. Only 14 of the 74 cases had a stat-sig difference in favor of the new design. This means that we should launch the new design in 19% of cases and stick with the old design in 81% of cases.

Actually, it doesn’t necessarily mean this. In contrast to Kohavi, I have often argued that it’s fine to launch a new design even if the statistics don’t give us high confidence that it’s better. This is acceptable in cases with low risk, because using the “winning” design (even if the “win” isn’t significant) will create gains on average and rarely cause big losses. (If you’re in a situation where you can’t accept an occasional loss, then don’t do this.)

The average lift for cases with a stat-sig positive test was 1.0%.
Across all 74 experiments, the expected gain (if only implementing the 14 stat-sig improvements) was 0.2%.

Now we’re talking. The dataset covers two years’ worth of analytics experiments from a fashion ecommerce site. On average, they would implement 7 design improvements per year, for an annualized lift of 7.2%.

Sketch out a lot of different designs: that’s the way ahead in UX. Especially now that ideation is free with AI, you shouldn’t stop after a few design ideas. (Midjourney)

Design ideas are great, but if you measure them, you’ll find that only 19% of your good ideas lead to statistically significant gains for the business. (According to the case study data discussed here.) (Midjourney: I “cheated” to achieve character consistency for my UX designer, because I used strong variations of the same original image to make her do two different things: design and measure.)

ROI = 620%

The company has about $5 billion in annual sales. We don’t know its profit margin, but for fashion in general, cost of goods sold is usually half of sales, with additional marginal costs due to fulfillment. Let’s say that the margin is 25%. This means that those 7.2% extra sales are worth $90 million. (7 improvements of 1.0% cumulate to slightly more than 7.0% due to compounding.)

What’s the cost of running the UX team and the analytics team that are necessary to come up with the alternate designs and run the experiments? Again, I don’t know, but let’s say we need 10 UXers and 5 analytics specialists, for a total staff of 15. Let’s further add 35 engineers to implement the 74 design ideas so that they can go live on the site as a “B” condition and then bullet-proof the 7 chosen designs that get to become the new real website.

50 staff at, let’s say, $250K loaded cost per year, for a total cost of $12.5M. (Actual take-home salaries are obviously much lower according to UX salary statistics, but big companies have a lot of overhead.)

Thus, the company will spend $12.5M each year on design experimentation to gain $90M. This is worth doing! (Especially since the expense is only incurred during the year of those experiments, whereas the increased sales will continue to accrue in subsequent years.)

If only considering a one-year horizon, ROI is 620%. This is higher than almost all other investments a company might make.

Increasing Maturity Reduces Gains

So far, so good. We’ve concluded that big companies should institute a culture of incremental UX experimentation to drive up KPIs gradually. Design and analytics are both well worth their cost and have high ROI.

What if CEOs take my advice? Then, their gains will gradually drop.

This is a bit buried in Kohavi’s post, but he says that it’s common to see stronger results in the beginning when a company starts a program of systematic A/B experimentation with design alternatives. After a few years of such experimentation, results usually turn weaker.

Kohavi ascribes this outcome to a difference in the type of design changes that are subjected to formal experimentation in an A/B test. For an “organization early in its maturity cycle, […] such organizations are more careful to do user studies, QA features and build solid releases.” This means that they get strong results, on average. In contrast, he says, “as the trust in the experimentation platform’s safety net grows, more MVP (Minimum Viable Products) implementations are tested in an agile fashion with less confidence in the feature, less QA (so more are aborted due to egregious issues), and the success rate declines.”

In other words, higher maturity means that management is willing to take more risks because they know that the average gains will be worth the expense of experimenting with weaker design ideas.

I find this a credible argument, and I tend to believe Kohavi in such matters due to his extensive experience. I will add one more argument, based on my experience in UX.

Harvesting the Low Hanging Fruit

My argument is the cliché of the low-hanging fruit. Cliché it may be, but these fruits are real in companies without a strong UX culture. There’s so much bad design in the world, and it takes years of UX work to root it out.

In the beginning, when a company changes from happy-go-lucky in its design process to gradually building up its UX maturity (including experimentation maturity), the early UX team is overwhelmed with extensive opportunities to make rich gains for little effort.

Once the design has been through years of refinement, the hard work begins. We can always get better, but we’ve run out of low-hanging fruit. That’s why ROI from usability (and from each analytics experiment) will decline over time. After 10 years — let alone 20 years, which is the common time needed to achieve full UX maturity — of systematic design-improvement efforts, you can’t expect to continue to cash in 620% gains annually.

At some point, ROI becomes so low that the company should invest its funds in other ventures and not hire more UX staff. I doubt that many companies are actually at this fully mature stage yet, but give UX 20 more years to grow, and we’ll see such cases more often.

There’s a lot of fruit on the UX tree, but once the low-hanging fruit has been picked, it becomes harder work to climb up to harvest the rest. (Midjourney)