$ run optimization-algorithms.md

How to test a page without wasting traffic

2026 · research · 9 sections

Five families of conversion optimization algorithms — from the 50-year-old A/B test to contextual bandits — and how they compose into a system that actually learns.

// 00

## The problem

You have a landing page. You think changing the headline might improve conversions. You make the change. How do you know if it worked?

The naive answer: show the new version to everyone and check the numbers next month. The problem: you might have just run an experiment with no control group, no statistical rigor, and no way to separate your change from everything else that happened that month.

The serious answer is a framework. And it turns out there isn't one framework — there are five families of algorithms, each doing a different thing. They're not competing. They compose.

The fundamental tension underneath all of them is the same: explore vs exploit. You need to learn what works (explore) while also making money from what you already know works (exploit). Every algorithm is a different answer to how much of each you should be doing at any given moment.

// 01

## Classical A/B testing

[Frequentist]

Split traffic 50/50. Wait until you have enough visitors. Run a hypothesis test. If the p-value is below 0.05, declare a winner.

This is the oldest approach and still the most widely used. VWO defaults to it. It has two properties that make it appealing: precise effect size estimates and rock-solid statistical guarantees — if you follow the rules.

experiment · homepage headlinewaiting…
Control

Build software faster

The platform teams use to go from idea to production — without the infrastructure overhead.

Start free trial
CVR
Visitors
Variant

Ship in hours, not weeks

The platform teams use to go from idea to production — without the infrastructure overhead.

Start free trial
CVR
Visitors
scroll to start
// a 50/50 split test. both variants collect the same data. declare a winner only after reaching the pre-planned sample size.

The rules are strict:

// note: 1. Calculate your required sample size before you start. 2. Do not look at the results until you reach it. 3. Run the test once. If you re-run after seeing results, your p-values are meaningless.

Almost nobody follows rule 2. Most people peek at results during the test and stop early when “it looks good.” This inflates your false positive rate from 5% to 20–30%. The math only works if you wait.

One more thing: a p-value of 0.03 does not mean “3% chance the result is wrong.” It means: “if there were actually no difference between A and B, there's a 3% chance you'd see data this extreme.” That's a subtler claim — and a much weaker one.

Use it when: the decision is permanent (a redesign, new pricing, a major UX change), you need a precise measurement of the effect size, and you have enough traffic to wait.

// 02

## Bayesian A/B testing

[Probabilistic]

Instead of asking “is the difference statistically significant?”, ask “what's the probability that B is better than A?”

You start with a prior belief about each variant's conversion rate — something vague, like “probably between 1% and 20%.” As data comes in, you update that belief using Bayes' theorem. The result is a posterior distribution: a curve showing how likely each conversion rate is, given what you've seen.

The output is more intuitive. “94% probability B beats A” is easier to act on than “p = 0.03.” You can also quantify risk: “if I pick B and I'm wrong, the expected cost is 0.1% conversion.” That's a real business number.

The caveat: your prior matters, especially with small samples. Two teams using different priors on the same data can reach different conclusions. It's not a bug — it's the honest acknowledgment that beliefs enter the analysis. Classical A/B testing has the same problem but hides it better.

Who uses it: VWO (SmartStats), GrowthBook, AB Tasty, Kameleoon. Google Optimize used Bayesian before shutting down.

// 03

## Sequential testing

[Always-valid inference]

This is the modern compromise: get the rigor of frequentist testing, but allow peeking.

The peeking problem in classical A/B testing is real and well-documented. Every time you check your results mid-experiment, you're effectively running another hypothesis test. Over multiple checks, your false positive rate compounds. By the time you've peeked ten times, your nominal 5% threshold is actually closer to 30%.

Sequential testing solves this structurally. There are two flavors:

// note: Group Sequential Tests (GST) — pre-plan when you'll look (e.g. weekly for 4 weeks). Use stricter thresholds at early looks, more lenient ones at later looks. The total false positive rate across all looks stays at 5%. Used by Spotify and Booking.com.
// note: Always-Valid Inference (mSPRT) — check whenever you want, no pre-planning required. Uses “confidence sequences” instead of confidence intervals. Slightly lower power than GST, but total flexibility. Used by Optimizely, Netflix, Uber, and Amplitude.

The key paper behind always-valid inference is Johari et al., “Always Valid Inference,” published in 2015 and deployed at Optimizely. It fundamentally changed how the industry thinks about experiment monitoring.

If you're building an autonomous system that needs to decide when to stop experiments without human oversight, mSPRT is probably the right statistical foundation. It's the only framework where “check whenever you want” is mathematically safe.

// 04

## Multi-armed bandits

[Explore-exploit in real time]

The name comes from a casino analogy. You have five slot machines, each with a different (unknown) payout rate. You have 1,000 pulls. How do you maximize total winnings?

The naive strategy: pull each machine 200 times, find the best one, use it for the remaining pulls. That wastes 800 pulls on machines you know are worse.

A bandit algorithm does better: start by trying each machine a few times, then gradually shift pulls toward the ones that seem best — while occasionally trying the others in case you were wrong about them.

In landing page terms: each variant is a slot machine, each visitor is a pull, a conversion is a payout. The algorithm decides which variant each visitor sees, shifting traffic toward winners in real time.

Three main algorithms, in increasing sophistication:

// note: Epsilon-Greedy — 90% of the time, show the current best. 10% of the time, show a random variant. Simple, but the exploration is equally wasted on clearly bad variants and uncertain ones.
// note: Upper Confidence Bound (UCB) — for each variant, compute: estimated conversion rate + uncertainty bonus. Show the variant with the highest total. Variants you've tested heavily have small bonuses (you know them). Untested variants have large bonuses (they might be great). Elegant and principled.
// note: Thompson Sampling — maintain a probability distribution for each variant's conversion rate. Each time a visitor arrives, sample from each distribution, show the variant whose sample is highest. As data accumulates, distributions narrow, and the best variant dominates automatically.

Thompson Sampling is the industry favorite. Netflix, Optimizely, Statsig, and Google Optimize all use or used it. Its elegance is that uncertainty drives exploration naturally — variants you're unsure about generate high samples more often, so they get tested more. No tuning required.

Thompson Samplingwaiting…
Variant A
2 visitors25%
Variant B
2 visitors25%
Variant C
2 visitors25%
Variant D
2 visitors25%
// scroll into view to start the simulation.
// thompson sampling in real time. learned by exploring, shifted traffic as confidence grew.

The key distinction between A/B testing and bandits isn't methodology — it's goal:

 A/B TestBandit
| GoalLearn the truthMake money during the test
| Traffic splitFixed — usually 50/50Dynamic — winners get more
| RegretHigh — half traffic on losersLow — losers starved early
| PrecisionHigh — clean causal estimateLower — allocation bias
| Best forPermanent decisionsOngoing choices
traffic allocation over 400 visitorswaiting…

A/B Test

fixed 25% each

Variant A25%
Variant B25%
Variant C25%
Variant D25%
total conversions0

Bandit

thompson sampling

Variant A25%
Variant B25%
Variant C25%
Variant D25%
total conversions0
// same visitors, same variants. the bandit earns more because it shifts traffic toward the winner.

They're not substitutes. Use A/B when you need a precise answer for a permanent decision. Use bandits when you want to optimize an ongoing choice and care about revenue during the test.

// 05

## Variance reduction (CUPED)

[Free speed]

CUPED — Controlled-experiment Using Pre-Existing Data — is an orthogonal technique invented by Microsoft Research in 2013. It doesn't change what you're testing. It makes the test faster.

Conversion rates are noisy. If your page converts at 5%, detecting a 0.5% improvement requires tens of thousands of visitors to be statistically confident. Most small sites can't get there in any reasonable timeframe.

CUPED's insight: if you know something about each user before the experiment — their past visit frequency, their past conversion history, their device type — you can use that to subtract out noise from the measurement. It's covariate adjustment: regress the outcome on the pre-experiment metric, analyze the residuals.

The result: 25–50% reduction in required sample size. A test that needed four weeks now needs two or three. Nearly every major platform has implemented it — Microsoft, Netflix, Booking.com, Airbnb, Statsig, Eppo, GrowthBook.

The caveat for landing pages: CUPED requires pre-experiment data about users. For pages with mostly new or anonymous visitors, you may not have useful covariates. The benefit scales with how much you know about your users before they arrive.

// 06

## Where it gets interesting: contextual bandits

[Personalization at the algorithm level]

Standard bandits find the single best variant for all users. But what if the best headline for a mobile user from India is different from the best headline for a desktop user in the US? What if morning traffic converts on urgency and evening traffic converts on social proof?

Contextual bandits extend standard MABs by incorporating user context — device, location, time of day, traffic source, past behavior — into the decision. Instead of learning “variant B is best,” they learn “variant B is best for mobile users in high-intent sessions, variant C is best for desktop users from organic search.”

The production proof points are substantial:

// note: Netflix uses contextual bandits to choose which thumbnail to show for each title, per user. Different users see different artwork for the same show. Engagement improved significantly.
// note: Microsoft deployed a contextual bandit on MSN.com and measured a 26% increase in clicks. One of the most cited production contextual bandit results in the literature.
// note: Expedia uses linear contextual bandits for multivariate web optimization.

The gap between “find the best version for everyone” and “find the best version for each visitor” is a step-function in value. Contextual bandits are the mechanism.

// 07

## The real insight: composition

The most important thing about these five families is that you don't pick one. A production optimization system uses several in concert:

// What to show

Thompson SamplingContextual Bandits

Decides which variant each visitor sees. Adapts traffic allocation in real time.

// When to stop

mSPRTBayesian posteriors

Validates whether the winner is real. Tells an autonomous system when to ship.

// How to go faster

CUPEDVariance reduction

Subtracts noise using pre-experiment data. 25–50% fewer visitors needed.

// a production system uses all three. top decides. middle validates. bottom accelerates.

The platforms that figured this out first are the ones dominating now. Statsig runs sequential testing and Bayesian analysis on top of Thompson Sampling on top of CUPED. Optimizely's Stats Engine (mSPRT) was deployed in 2015 and is still considered state of the art.

The one company doing something fundamentally different is Evolv AI (formerly Sentient Ascend). They use evolutionary algorithms — not bandits — to search a massive combinatorial space of variants. Their 2018 AAAI paper showed testing 28,800 variant combinations in three weeks. You can't run 28,000 A/B tests sequentially. Evolutionary search can navigate that space by treating the variant population as a gene pool and selecting toward conversion.

That's the frontier. Not “which of these 4 headlines converts best” but “across 15 editable elements with 4 options each, what combination maximizes conversion, and how fast can an algorithm find it?”

// 08

## Why I spent time on this

I'm researching the architecture of an autonomous landing page optimization system — something that generates variants, runs experiments, and ships winners without a human in the loop. The statistical engine underneath that system needs to be right.

After going through the literature, the architecture that makes sense is: contextual bandits for real-time traffic allocation, mSPRT for autonomous stopping decisions, and CUPED for faster signal. The evolutionary layer (Evolv's approach) sits above all of this — it's how you search the combinatorial space, not how you validate individual experiments.

None of this is speculative. Every component has been deployed in production by Optimizely, Netflix, Microsoft, or Booking.com. The question is how to compose them into a single autonomous loop, and whether the loop can be fast enough to be useful at the scale of a small business — not just a company running 10,000 experiments a year.

That's what I'm building next.

$
? keys · ⌘K palette · esc back