23 March 2026 | 3 min read

A/B Testing at Scale: Lessons from Dyson's Global Platform

A/B testing Dyson e-commerce optimization analytics

Testing Across 52 Markets

During my time at Dyson, I worked on the global e-commerce platform that served 52 markets across multiple languages, currencies, and cultural contexts. Running A/B tests at this scale taught me lessons that no amount of reading could replicate. What works in the US market might fail in Japan. A winning variation in Germany could be neutral in Australia.

Here is what I learned about running meaningful A/B tests on a global platform.

The Infrastructure Challenge

Running tests across 52 markets means managing test configurations for different locales, ensuring statistical significance per market, and avoiding interactions between concurrent tests. We used Adobe Target as our primary experimentation platform, integrated with Adobe Analytics for measurement.

Market Segmentation

Not every test should run in every market. We categorized tests into three tiers:

Global tests: Fundamental UX changes tested across all markets simultaneously
Regional tests: Tests specific to market clusters (EMEA, APAC, Americas)
Local tests: Market-specific optimizations for high-traffic locales

Global tests require enormous sample sizes because you need statistical significance per market, not just overall. A test that looks positive globally might be dragged up by one large market while performing poorly in smaller ones.

Statistical Pitfalls

The Multiple Comparisons Problem

When you test across 52 markets, you are essentially running 52 simultaneous statistical tests. At a 95% confidence level, you expect roughly 2-3 false positives by pure chance. We addressed this with:

Bonferroni correction: Adjusting the significance threshold based on the number of comparisons
Pre-registered hypotheses: Defining which markets are primary and which are exploratory before the test starts
Sequential testing: Using always-valid confidence intervals rather than fixed-horizon tests

Simpson's Paradox

This bit us more than once. A variation can show a positive overall effect while being negative in every individual market segment. This happens when the variation shifts traffic composition between segments. Always look at segmented results, not just the top line.

What Actually Moves the Needle

After running hundreds of tests, patterns emerged about what reliably improves conversion:

Page load performance: Shaving 500ms off load time consistently improved conversion by 1-2% across all markets
Trust signals: Warranty information, review counts, and security badges had outsized impact in markets with lower brand recognition
Checkout simplification: Every field removed from checkout improved completion rates. This was universal across markets.
Localised social proof: Showing reviews from the same market performed better than showing global reviews

What Rarely Worked

Some commonly tested changes that rarely produced significant results:

Button colour changes (the classic example)
Hero image swaps without copy changes
Navigation reordering without information architecture changes
Minor copy tweaks that did not change the core message

The lesson: test big, meaningful changes to user experience, not cosmetic tweaks.

Process That Scaled

We developed a process that kept testing organized across teams and markets:

Hypothesis document: Every test started with a one-page hypothesis including the change, expected impact, and target metrics
Technical review: Engineering validated feasibility and flagged potential interactions with other tests
QA across markets: Variations were tested in at least 5 representative markets before full rollout
Analysis template: Standardized reporting that included per-market breakdowns and interaction checks

This process added overhead but prevented the common failure mode of launching broken tests or drawing incorrect conclusions from noisy data.

Tooling Recommendations

For teams building A/B testing capabilities:

Invest in logging: You cannot analyze what you did not track. Log everything: impressions, interactions, and downstream conversions
Build a test catalog: Track all running and historical tests in a central system. Without this, test interactions are invisible.
Automate significance checks: Manual monitoring of test results leads to early stopping and biased conclusions. Set automated alerts for when tests reach significance.
Plan for rollback: Every test should have a kill switch. When a variation performs poorly, you need to end it immediately.

Applying These Lessons to AI

I now apply the same rigor to testing AI features. When I add an AI-powered recommendation or content generation feature, I A/B test it against the existing approach with proper statistical methodology. The scale is smaller, but the principles are identical: pre-register your hypothesis, measure what matters, and do not stop the test early because it looks good.