How to test a page without wasting traffic
2026 · research · 9 sections
Five families of conversion optimization algorithms — from the 50-year-old A/B test to contextual bandits — and how they compose into a system that actually learns.
## The problem
You have a landing page. You think changing the headline might improve conversions. You make the change. How do you know if it worked?
The naive answer: show the new version to everyone and check the numbers next month. The problem: you might have just run an experiment with no control group, no statistical rigor, and no way to separate your change from everything else that happened that month.
The serious answer is a framework. And it turns out there isn't one framework — there are five families of algorithms, each doing a different thing. They're not competing. They compose.
The fundamental tension underneath all of them is the same: explore vs exploit. You need to learn what works (explore) while also making money from what you already know works (exploit). Every algorithm is a different answer to how much of each you should be doing at any given moment.
## Classical A/B testing
Split traffic 50/50. Wait until you have enough visitors. Run a hypothesis test. If the p-value is below 0.05, declare a winner.
This is the oldest approach and still the most widely used. VWO defaults to it. It has two properties that make it appealing: precise effect size estimates and rock-solid statistical guarantees — if you follow the rules.
Build software faster
The platform teams use to go from idea to production — without the infrastructure overhead.
Ship in hours, not weeks
The platform teams use to go from idea to production — without the infrastructure overhead.
The rules are strict:
Almost nobody follows rule 2. Most people peek at results during the test and stop early when “it looks good.” This inflates your false positive rate from 5% to 20–30%. The math only works if you wait.
One more thing: a p-value of 0.03 does not mean “3% chance the result is wrong.” It means: “if there were actually no difference between A and B, there's a 3% chance you'd see data this extreme.” That's a subtler claim — and a much weaker one.
Use it when: the decision is permanent (a redesign, new pricing, a major UX change), you need a precise measurement of the effect size, and you have enough traffic to wait.
## Bayesian A/B testing
Instead of asking “is the difference statistically significant?”, ask “what's the probability that B is better than A?”
You start with a prior belief about each variant's conversion rate — something vague, like “probably between 1% and 20%.” As data comes in, you update that belief using Bayes' theorem. The result is a posterior distribution: a curve showing how likely each conversion rate is, given what you've seen.
The output is more intuitive. “94% probability B beats A” is easier to act on than “p = 0.03.” You can also quantify risk: “if I pick B and I'm wrong, the expected cost is 0.1% conversion.” That's a real business number.
The caveat: your prior matters, especially with small samples. Two teams using different priors on the same data can reach different conclusions. It's not a bug — it's the honest acknowledgment that beliefs enter the analysis. Classical A/B testing has the same problem but hides it better.
Who uses it: VWO (SmartStats), GrowthBook, AB Tasty, Kameleoon. Google Optimize used Bayesian before shutting down.
## Sequential testing
This is the modern compromise: get the rigor of frequentist testing, but allow peeking.
The peeking problem in classical A/B testing is real and well-documented. Every time you check your results mid-experiment, you're effectively running another hypothesis test. Over multiple checks, your false positive rate compounds. By the time you've peeked ten times, your nominal 5% threshold is actually closer to 30%.
Sequential testing solves this structurally. There are two flavors:
The key paper behind always-valid inference is Johari et al., “Always Valid Inference,” published in 2015 and deployed at Optimizely. It fundamentally changed how the industry thinks about experiment monitoring.
If you're building an autonomous system that needs to decide when to stop experiments without human oversight, mSPRT is probably the right statistical foundation. It's the only framework where “check whenever you want” is mathematically safe.
## Multi-armed bandits
The name comes from a casino analogy. You have five slot machines, each with a different (unknown) payout rate. You have 1,000 pulls. How do you maximize total winnings?
The naive strategy: pull each machine 200 times, find the best one, use it for the remaining pulls. That wastes 800 pulls on machines you know are worse.
A bandit algorithm does better: start by trying each machine a few times, then gradually shift pulls toward the ones that seem best — while occasionally trying the others in case you were wrong about them.
In landing page terms: each variant is a slot machine, each visitor is a pull, a conversion is a payout. The algorithm decides which variant each visitor sees, shifting traffic toward winners in real time.
Three main algorithms, in increasing sophistication:
Thompson Sampling is the industry favorite. Netflix, Optimizely, Statsig, and Google Optimize all use or used it. Its elegance is that uncertainty drives exploration naturally — variants you're unsure about generate high samples more often, so they get tested more. No tuning required.
The key distinction between A/B testing and bandits isn't methodology — it's goal:
| A/B Test | Bandit | |
|---|---|---|
| | Goal | Learn the truth | Make money during the test |
| | Traffic split | Fixed — usually 50/50 | Dynamic — winners get more |
| | Regret | High — half traffic on losers | Low — losers starved early |
| | Precision | High — clean causal estimate | Lower — allocation bias |
| | Best for | Permanent decisions | Ongoing choices |
A/B Test
fixed 25% each
Bandit
thompson sampling
They're not substitutes. Use A/B when you need a precise answer for a permanent decision. Use bandits when you want to optimize an ongoing choice and care about revenue during the test.
## Variance reduction (CUPED)
CUPED — Controlled-experiment Using Pre-Existing Data — is an orthogonal technique invented by Microsoft Research in 2013. It doesn't change what you're testing. It makes the test faster.
Conversion rates are noisy. If your page converts at 5%, detecting a 0.5% improvement requires tens of thousands of visitors to be statistically confident. Most small sites can't get there in any reasonable timeframe.
CUPED's insight: if you know something about each user before the experiment — their past visit frequency, their past conversion history, their device type — you can use that to subtract out noise from the measurement. It's covariate adjustment: regress the outcome on the pre-experiment metric, analyze the residuals.
The result: 25–50% reduction in required sample size. A test that needed four weeks now needs two or three. Nearly every major platform has implemented it — Microsoft, Netflix, Booking.com, Airbnb, Statsig, Eppo, GrowthBook.
The caveat for landing pages: CUPED requires pre-experiment data about users. For pages with mostly new or anonymous visitors, you may not have useful covariates. The benefit scales with how much you know about your users before they arrive.
## Where it gets interesting: contextual bandits
Standard bandits find the single best variant for all users. But what if the best headline for a mobile user from India is different from the best headline for a desktop user in the US? What if morning traffic converts on urgency and evening traffic converts on social proof?
Contextual bandits extend standard MABs by incorporating user context — device, location, time of day, traffic source, past behavior — into the decision. Instead of learning “variant B is best,” they learn “variant B is best for mobile users in high-intent sessions, variant C is best for desktop users from organic search.”
The production proof points are substantial:
The gap between “find the best version for everyone” and “find the best version for each visitor” is a step-function in value. Contextual bandits are the mechanism.
## The real insight: composition
The most important thing about these five families is that you don't pick one. A production optimization system uses several in concert:
// What to show
Decides which variant each visitor sees. Adapts traffic allocation in real time.
// When to stop
Validates whether the winner is real. Tells an autonomous system when to ship.
// How to go faster
Subtracts noise using pre-experiment data. 25–50% fewer visitors needed.
The platforms that figured this out first are the ones dominating now. Statsig runs sequential testing and Bayesian analysis on top of Thompson Sampling on top of CUPED. Optimizely's Stats Engine (mSPRT) was deployed in 2015 and is still considered state of the art.
The one company doing something fundamentally different is Evolv AI (formerly Sentient Ascend). They use evolutionary algorithms — not bandits — to search a massive combinatorial space of variants. Their 2018 AAAI paper showed testing 28,800 variant combinations in three weeks. You can't run 28,000 A/B tests sequentially. Evolutionary search can navigate that space by treating the variant population as a gene pool and selecting toward conversion.
That's the frontier. Not “which of these 4 headlines converts best” but “across 15 editable elements with 4 options each, what combination maximizes conversion, and how fast can an algorithm find it?”
## Why I spent time on this
I'm researching the architecture of an autonomous landing page optimization system — something that generates variants, runs experiments, and ships winners without a human in the loop. The statistical engine underneath that system needs to be right.
After going through the literature, the architecture that makes sense is: contextual bandits for real-time traffic allocation, mSPRT for autonomous stopping decisions, and CUPED for faster signal. The evolutionary layer (Evolv's approach) sits above all of this — it's how you search the combinatorial space, not how you validate individual experiments.
None of this is speculative. Every component has been deployed in production by Optimizely, Netflix, Microsoft, or Booking.com. The question is how to compose them into a single autonomous loop, and whether the loop can be fast enough to be useful at the scale of a small business — not just a company running 10,000 experiments a year.
That's what I'm building next.