How I Use Copilot Agents for A/B Test Ideation at Scale
The A/B Testing Bottleneck
In my role as Digital Optimisation Executive at Dyson, a big part of my work is running A/B tests to improve website performance. The testing itself is straightforward once you have a hypothesis. The bottleneck is ideation: generating a steady stream of well-reasoned, data-informed test hypotheses that have a genuine chance of moving metrics.
Traditionally, this involves manually reviewing analytics data, heatmaps, session recordings, competitor sites, and user research. Then synthesising all of that into structured test hypotheses. It works, but it is slow and limited by how much information one person can process.
I built a system using AI agents that dramatically accelerates this process.
The Agent System
My ideation system uses three specialised agents working in sequence:
Agent 1: Data Analyst
This agent processes quantitative data and identifies opportunities. It ingests:
- Google Analytics data (page performance, conversion funnels, drop-off points)
- Heatmap summaries (where users click, how far they scroll)
- Previous test results (what worked, what failed, and why)
class DataAnalystAgent:
async def analyse(self, data_sources: dict) -> list[Opportunity]:
prompt = f"""You are a digital analytics expert. Analyse this website data
and identify the top opportunities for A/B testing.
For each opportunity, provide:
- page_or_section: where on the site
- metric_impact: which metric this could improve
- evidence: what data supports this opportunity
- estimated_impact: low/medium/high based on traffic and current performance
Analytics data: {json.dumps(data_sources['analytics'])}
Heatmap insights: {json.dumps(data_sources['heatmaps'])}
Previous test results: {json.dumps(data_sources['past_tests'])}
Return a JSON array of opportunities, maximum 15."""
response = await self.ai_client.complete(prompt)
return [Opportunity(**o) for o in json.loads(response)]
Agent 2: Hypothesis Generator
This agent takes the opportunities identified by the Data Analyst and transforms them into structured test hypotheses. Each hypothesis follows a rigorous format:
class HypothesisGeneratorAgent:
async def generate(self, opportunities: list[Opportunity]) -> list[Hypothesis]:
prompt = f"""You are a conversion rate optimisation expert. For each opportunity,
generate a structured A/B test hypothesis.
Use this format for each hypothesis:
- hypothesis_id: unique identifier
- if_we: what change we make
- because: why we think this will work (based on evidence)
- we_expect: what metric will change and by how much
- primary_metric: the main metric to measure
- secondary_metrics: other metrics to watch
- audience: who this affects
- estimated_duration_days: how long to run the test
Opportunities: {json.dumps([o.dict() for o in opportunities])}
Generate 2-3 hypotheses per opportunity. Be specific and measurable."""
response = await self.ai_client.complete(prompt)
return [Hypothesis(**h) for h in json.loads(response)]
Agent 3: Prioritiser
The final agent scores and ranks hypotheses using a framework based on ICE (Impact, Confidence, Ease):
class PrioritiserAgent:
async def prioritise(self, hypotheses: list[Hypothesis],
constraints: dict) -> list[RankedHypothesis]:
prompt = f"""Score each hypothesis on three dimensions (1-10):
Impact: How much will this move the primary metric if successful?
Confidence: How confident are we this will work, based on the evidence?
Ease: How easy is this to implement and test?
Consider these constraints:
- Development capacity: {constraints['dev_hours_available']} hours this sprint
- Current test slots: {constraints['available_slots']} tests can run simultaneously
- Priority pages: {constraints['priority_pages']}
Hypotheses: {json.dumps([h.dict() for h in hypotheses])}
Return each hypothesis with ice_score (impact * confidence * ease / 1000),
and a recommended_priority rank."""
response = await self.ai_client.complete(prompt)
return sorted(
[RankedHypothesis(**h) for h in json.loads(response)],
key=lambda h: h.ice_score, reverse=True
)
The Output
The system produces a prioritised list of test hypotheses, each with:
- A clear hypothesis statement in "if we... because... we expect" format
- Supporting evidence from the data
- Specific metrics to measure
- An estimated test duration
- An ICE priority score
- Implementation notes
This output feeds directly into our experimentation roadmap. Instead of spending a day generating 5 to 10 hypotheses manually, the system produces 20 to 30 scored hypotheses in under 5 minutes.
Quality of AI-Generated Hypotheses
An important question: are AI-generated hypotheses actually good? In my experience, roughly 60 to 70 percent of the generated hypotheses are viable without modification. Another 20 percent need minor refinement. About 10 to 15 percent are not useful, either too vague, impractical, or based on a misinterpretation of the data.
Compare this to the manual process where every hypothesis takes significant time to develop. Even accounting for the ones you discard, the AI system produces more viable hypotheses per hour by a large margin.
Human Review Is Essential
I want to be clear: I review every hypothesis before it goes into the testing roadmap. The AI does not understand business context, brand guidelines, or political sensitivities the way a human does. Some technically sound hypotheses are impractical for reasons the AI cannot know.
The system accelerates ideation. It does not replace strategic thinking. I spend less time generating hypotheses and more time evaluating and refining them, which is a better use of my expertise.
Integration with the Testing Workflow
The prioritised hypotheses export to a structured format that integrates with our experimentation platform. Each hypothesis becomes a test brief with:
- The hypothesis statement and supporting evidence
- Wireframe suggestions (text descriptions that the design team can work from)
- Success criteria and metrics to track
- Estimated sample size and test duration
Scaling the Process
The real power of this approach is scale. I can run the ideation system against different sections of the site, different user segments, or different time periods, generating targeted hypotheses for each context. This would take weeks manually but takes minutes with the agent system.
For a site the scale of Dyson's, with multiple markets and product lines, this kind of scalable ideation is essential for maintaining a healthy experimentation velocity.
What I Have Learned
Building this system reinforced something I believe strongly about AI in the workplace: the best applications augment human capability rather than replacing it. I am not a worse optimisation professional because I use AI for ideation. I am a better one, because I can consider more data, generate more hypotheses, and spend more of my time on the strategic decisions that require human judgement. That is what AI agents should do: handle the volume so humans can focus on the value.