How to Write a Winning Hypothesis for CRO Experiments
“Let’s test a bigger button” is not a hypothesis. It tells you what to change but says nothing about who it helps, why it should work, or how you would know if you were wrong. An A/B test hypothesis is a specific, testable, and falsifiable prediction about what will happen when you make a change, and the reason you expect it. The reason is the part most teams skip, and it is the part that turns a coin-flip guess into a real experiment.
Here is the discipline in one line, borrowed from a useful framing: a hypothesis is a structured argument that connects a specific problem you found through research to a proposed intervention informed by evidence to an expected, measurable outcome. Get that argument right and a losing test still teaches you something. Get it wrong and you burn traffic to learn nothing.
What makes a hypothesis, and not a guess
The load-bearing idea is falsifiability. You must be able to prove the hypothesis wrong. If there is no result that would count as a failure, you do not have a hypothesis, you have a wish. “Removing the phone field will probably help conversions a bit” cannot be falsified because it predicts nothing precise. “Removing the phone field will increase form completions” can be: the test either moves the metric in the predicted direction at the size you set, or it does not.
This is why an A/B test is only ever as good as its hypothesis and its primary metric. Changing something without a falsifiable prediction means you cannot interpret the result either way, so you have spent traffic and learned nothing.
The anatomy of a complete hypothesis
A finished hypothesis is not one sentence. It is a short record you write down before launch. A complete one contains all of the following:
| Element | What it is |
|---|---|
| Observation / evidence | The research finding that prompted the idea |
| Change | The single thing you are altering |
| Audience / segment | Who sees the variant |
| Page / area | Where the change lives |
| Expected outcome + direction | What moves, and up or down |
| Primary metric | One metric that decides the result |
| Mechanism | The behavioural or psychological reason it should work |
| MDE | The smallest effect you care to detect |
| Sample size + duration | Decided up front, before you launch |
If any of those are missing, finish them before the test goes live. The last three are the ones marketers skip most often, and they are exactly the ones that decide whether your result means anything. Our CRO research methods guide covers how to find the evidence that fills the first row.
Three formulas, ranked
You can write a sound hypothesis with any of these. They are ordered from simplest to most rigorous.
1. If / Then / Because (the workhorse)
If we [make this change], then [metric] will [increase or decrease], because [user behaviour or reason].
A good default. The “because” forces a mechanism, which is what stops it being a guess.
2. Research-led / observation-first (best for serious CRO)
Because we observed [data or evidence], we believe that [change] for [audience] on [page] will cause [impact on metric].
This front-loads the evidence, which makes the argument auditable and stops opinion-led tests sneaking in.
3. The lever framework (the British-practitioner model)
The UK agency Conversion.com structures it around levers:
We believe [lever] for [audience] on [area] will impact [KPI].
Their model: research surfaces barriers (lack of trust, price concern, missing information) and motivations (social proof, guarantees, clear USPs). Those become levers. A hypothesis tests one lever. The nuance worth keeping is that you are looking for evidence of a lever, not conclusive proof, and you can run a cheap minimum viable experiment to validate a lever before investing heavily. You can read their full experimentation framework for the detail.
Weak versus strong, side by side
Weak: “Let’s remove the phone number field from the lead form.”
Strong: “Because form analytics show 34% abandonment at the phone-number field and surveys indicate concern about unwanted sales calls, we believe removing it will increase form completions, because reducing the perceived cost of the form lowers risk.”
A checkout example: “If we reduce the number of form fields in checkout, we expect cart abandonment to fall and conversions to rise, because GA shows a 40% drop-off at this stage and form analytics show users struggling with ‘Address line 2’ and ‘landline phone number’.”
Both strong versions name the evidence, the change, the metric, and the mechanism. Copy that pattern.
Where hypotheses come from
You do not invent hypotheses, you find them. The raw material sits in your own data:
- Quantitative: analytics funnels and drop-off points, form analytics, conversion rates by segment.
- Qualitative: heatmaps, session replay, on-site surveys, customer support tickets and chat logs.
Read those for two things: barriers (what stops people converting) and motivations (what would push them over the line). Each becomes a candidate lever, and each lever becomes a hypothesis. A hypothesis with no evidence behind the “because” is the single most common reason tests fail to teach anything.
The statistics you must bake in
Most CRO articles wave at this section. It is the part that separates a test you can trust from a number you talked yourself into.
Null and alternative. The null hypothesis (H0) says your change has no effect. The alternative (H1) says it has the predicted effect. You run the test to gather enough evidence to reject the null. Stating it plainly keeps you honest about what “winning” means.
P-value, stated correctly. The p-value is the probability of seeing your data, or something more extreme, if the null were true. It is not the probability that your variant is better, and it is not the probability that your hypothesis is true. It only tells you how unusual your data would be in a world where the change did nothing. Plenty of CRO pages get this backwards; getting it right is a credibility test you should pass.
Significance and confidence. The industry standard is 95% confidence, a significance level of 0.05. A result significant at 95% means there is less than a 5% chance of seeing a difference this large if the two versions were truly identical.
Power. Usually set to 80%, this is the probability of detecting a real effect that genuinely exists. A sample-size calculator needs four inputs: baseline conversion rate, minimum detectable effect, confidence (95%), and power (80%).
MDE. The minimum detectable effect is the smallest improvement you care to detect, expressed as a percentage. It is inversely related to sample size: the smaller the effect you want to catch, the more traffic you need, and that cost rises steeply. Decide it inside the hypothesis, before the test runs, not after.
Rough magnitudes. A page converting at 2 to 5% typically needs roughly 1,000 to 2,000 conversions per variant to detect a 10 to 20% relative lift at 95% confidence. Run for a minimum of 14 days even if a calculator says you can stop sooner, with a standard window of two weeks minimum and four to six weeks maximum, capping around six to eight weeks. Always cover full weekly business cycles so weekends and paydays do not skew the result.
Prioritising your hypotheses
When you have thirty ideas and traffic for three tests a month, you need a scoring system.
| Framework | Scores | Notes |
|---|---|---|
| ICE | Impact, Confidence, Ease (1 to 10, multiplied) | Fast weekly scoring |
| PIE | Potential, Importance, Ease | Born in CRO; favours high-traffic pages with quick upside |
| PXL | Binary yes/no objective criteria | Peep Laja’s model; the most rigorous, least biased |
ICE and PIE are quick but subjective, since a 7 to one person is a 4 to another. PXL (from Speero) replaces the 1-to-10 guesswork with objective yes/no questions like “is it above the fold?”, “is it on a high-traffic page?”, and “is the change noticeable within five seconds?”, which strips out a lot of bias. Re-score your backlog quarterly, because a testing roadmap without dates is just a wish list. The Speero prioritisation blueprint sets out PXL in full. For how this fits a wider testing operation, see how to build a CRO programme.
The mistakes that kill tests
Peeking. Stopping a test the moment it flashes “significant” inflates your real false-positive rate far above 5%, commonly to 20 to 30%, and with large samples it can be inflated five to ten times over. The fix lives in the hypothesis: pre-commit your sample size and stopping rule, then hold your nerve. Or use a tool built for continuous monitoring (sequential testing or Bayesian methods) that accounts for repeated looks.
HiPPO tests. Experiments driven by the Highest Paid Person’s Opinion fail at about the same rate as random changes. The whole point of hypothesis discipline is to convert “the director wants it green” into an evidence-grounded, testable prediction that can lose.
Testing several things at once. Change two variables and you cannot attribute the result to either. One change per test.
Audiences too small, no primary metric, or writing the hypothesis after the test. Each of these makes the result uninterpretable. Write the hypothesis down before launch, every time.
Realistic expectations
Honesty here will save your relationship with stakeholders. A meta-analysis of roughly 20,000 experiments across more than 1,000 Optimizely customers found only about 10% produced a statistically significant uplift on the primary metric. Google and Bing report somewhere in the 10 to 20% range. Across Microsoft’s experiments, roughly a third win, a third are neutral, and a third are negative. The widely shared claim that “88% of A/B tests fail” is the same story told for shock value; the 10% win figure is the rigorous version.
Aim for a 10% win rate and a 100% learning rate. A well-formed hypothesis is structured so that both winning and losing outcomes teach you something. If you add a value-proposition video and it loses, you have learned the barrier is probably not value communication, so you redirect toward price or trust. That is a useful result, not a failed one.
Two ideas from the experimentation literature raise your standards further. Pair your primary metric with an Overall Evaluation Criterion and guardrail metrics, so you do not win revenue while quietly wrecking the customer experience. And keep Twyman’s Law in mind: any figure that looks unusually interesting is usually wrong. If your typical variation is under 1% and a test suddenly shows a 10% swing, investigate before you celebrate. The reference work here is Kohavi, Tang and Xu’s Trustworthy Online Controlled Experiments, with supporting papers at exp-platform.com.
A UK-specific caveat: consent and PECR
If you run A/B tests on UK traffic, this affects your sample before a single hypothesis runs. A/B testing cookies and analytics cookies are not exempt under the consent exceptions in the Privacy and Electronic Communications Regulations (PECR). They require consent before they fire. That consent must meet the UK GDPR standard: freely given, specific, informed, and unambiguous, with non-essential scripts (including your testing tool) staying off until the user opts in, toggles off by default, and “Accept” and “Reject” given equal prominence.
The practical consequence: your experiment only runs on consented traffic. That shrinks your usable sample, lengthens your test duration, and can bias the sample toward consent-happy users, so factor it into the sample-size and duration figures inside your hypothesis. The detail is on the ICO’s cookies and similar technologies guidance. One thing to watch: the Data (Use and Access) Act 2025 has now changed the picture, raising the maximum PECR penalty in line with UK GDPR and carving out a narrow exemption for certain low-risk analytics cookies (set transparently, with an opt-out), while advertising and most tracking cookies still need consent. Check the current ICO position before you rely on any of this.
A template you can copy
Fill in every bracket before you launch. If you cannot fill one, you are not ready to test.
Because we observed [evidence: what the data or research showed], we believe that [the single change] for [audience / segment] on [page or area] will [increase / decrease] [primary metric], because [behavioural or psychological mechanism]. We will detect a minimum effect of [MDE], requiring [sample size] per variant over roughly [duration, minimum 14 days], at 95% confidence and 80% power. We commit to this sample size and will not stop early.
Frequently asked questions
What is the difference between an A/B test hypothesis and a guess? A guess says what to change. A hypothesis says what to change, who for, what metric will move and in which direction, and why, in a way that could be proven wrong. The “why” and the falsifiability are what make it a hypothesis rather than an idea.
What is the difference between the null and alternative hypothesis? The null hypothesis says your change has no effect (or a negative one). The alternative says it has the predicted positive effect. You run the test to gather enough evidence to reject the null. If you cannot reject it, the test did not show a winner.
How much traffic do I need to start A/B testing? Enough to reach your required sample size in a reasonable window. As a rough guide, a page converting at 2 to 5% needs around 1,000 to 2,000 conversions per variant to detect a 10 to 20% relative lift at 95% confidence. Use a sample-size calculator with your baseline rate, MDE, confidence and power.
Can I A/B test on a low-traffic site? Yes, with adjustments. Test bigger, bolder changes rather than subtle tweaks, accept a larger minimum detectable effect, limit yourself to two variants, and run the test longer. Small, incremental changes are the ones low-traffic sites cannot measure reliably.
Can I stop a test early once it hits significance? No. Stopping the moment a test flashes significant (peeking) inflates your false-positive rate well above 5%, often to 20 to 30%. Set your sample size and duration in the hypothesis and hold to them, or use a tool with sequential testing or Bayesian methods designed for continuous monitoring.
Why did my A/B test fail or show no winner? A flat or losing result is normal: only about 10% of tests produce a significant uplift. If the hypothesis was well formed, a loss still tells you something, namely that the barrier you targeted was not the real one. Read the segment data, then redirect to a different lever.
Do I need consent to run A/B tests in the UK? In almost all cases, yes. A/B testing and analytics cookies are not exempt under PECR, so they need consent before they fire, meeting the UK GDPR standard. That means your test runs only on consented traffic, which you should account for in your sample-size and duration planning.
Once your hypothesis is sound, the next decisions are which tool will run it and how it handles statistics. See our guide to the best A/B testing tools for 2026 and the broader conversion rate optimisation guide.
More from Experimento
related resultsBest A/B Testing Tools in 2026, Compared by What They Actually Do
A practical 2026 comparison of the best A/B testing tools by what each one is actually good at, from Optimizely and VWO to GrowthBook and PostHog.
read result →9 Optimizely Alternatives Worth Trying (and Who Each One Suits)
Nine real Optimizely alternatives for A/B testing and CRO in 2026, with pricing, strengths, and the exact team each one fits best.
read result →VWO vs Optimizely: Pricing, Features, and the Right Fit for Your Team
A practical VWO vs Optimizely comparison covering pricing, statistics engines, features, and which platform fits mid-market and enterprise teams in 2026.
read result →