How A/B Testing Statistics Actually Work | Beginner Guide

Imagine you run a website and you want to know whether a green button gets more clicks than an orange one. You could just pick your favorite color and hope for the best. Or you could run an A/B test — and let the math tell you which one actually works. But here's what most beginner explanations skip: the "math" part isn't magic. It's a specific statistical engine with moving parts you can genuinely understand. Once you do, you'll know not just what an A/B test is, but why you can trust (or distrust) its results.

What A/B Testing Actually Is

A/B testing (also called split testing) randomly assigns users to two or more variants and measures a predefined metric to determine which performs better. In plain terms: you split your audience, show each group a different version of something — a button, a headline, a checkout flow — and then count which version produces more of whatever outcome you care about, like purchases or sign-ups.

The word "randomly" is doing a lot of work in that definition. Random assignment is the foundation of the whole enterprise. If you assigned the green button to mobile users and the orange button to desktop users, any difference you saw could be because of the device type, not the color. Randomness is what lets you isolate the variable you actually changed.

The "predefined metric" part matters too. You need to decide what you're measuring before you start — clicks, conversions, time on page. Choosing your goal after seeing the data is a bit like deciding what you were aiming for after firing the arrow. It corrupts the result.

The Core Statistical Question: Could This Have Happened by Chance?

Suppose you run your test and find that the green button got a 5% click rate while the orange button got a 4.5% click rate. Green wins, right? Not so fast. With any experiment, there's always a chance that random variation — pure luck in who happened to visit your site that day — created a gap that doesn't reflect any real difference between the buttons.

This is the central question statistics is designed to answer: Is the difference we observed likely to be real, or could it plausibly be noise?

The Null Hypothesis: Assume Nothing Is Going On

The statistical engine starts with a deliberately skeptical assumption called the null hypothesis. It says: "These two variants perform identically. Any difference we observe is just random chance." Your job, as the experimenter, is to collect enough evidence to reject that assumption.

Think of it like a courtroom. The null hypothesis is "innocent until proven guilty." You need evidence strong enough to overturn the default assumption of no difference.

The p-Value: Measuring the Strength of Your Evidence

The main tool for measuring that evidence is the p-value. It's one of the most misunderstood numbers in science, so let's be precise. The p-value tells you: If there were truly no difference between the variants, how likely is it that random chance alone would produce a gap at least this large?

A small p-value means the result would be very surprising if the null hypothesis were true — so surprising that you're willing to say the null hypothesis is probably wrong, and a real difference exists.

Statistical significance in A/B testing is commonly set at a p-value threshold of 0.05, meaning there is at most a 5% probability the observed difference occurred by chance. So if your p-value is 0.03, you can say: "There's only a 3% chance I'd see a gap this big if these buttons were actually identical. I'm confident enough to call this result real."

If your p-value is 0.4, the opposite is true — a 40% chance the gap is just noise. That's not convincing evidence of anything.

Sample Size: Why You Can't Just Test 20 People

Here's a practical truth: statistical significance doesn't appear on command. It depends on how much data you collect. Running a test on 20 visitors and seeing one extra click on the green button tells you almost nothing. With a small sample, random variation can easily swamp any real signal.

Before starting a test, experienced teams calculate a required sample size — the minimum number of observations needed so that, if a real difference of a given size exists, the test has a good chance of detecting it. This concept is called statistical power. A test with low power might miss a real improvement entirely, because the noise drowns out the signal.

The smaller the difference you're trying to detect, the more data you need. Moving a conversion rate from 2% to 2.1% requires vastly more visitors than moving it from 2% to 4%. Understanding this is why serious A/B testing requires patience — you often need to wait weeks to collect enough traffic before the results mean anything.

The Peeking Problem: Why Checking Early Can Ruin Everything

Here's a trap almost every beginner falls into. You launch your test, check the dashboard the next morning, see that green is winning by a mile, and declare victory. You stop the test early and ship the green button. This feels rational. Why wait if you already have an answer?

But this instinct is statistically dangerous. The "peeking problem" in A/B testing refers to stopping an experiment early when results look promising, which inflates false positive rates — a documented pitfall studied in experimentation literature.

Here's why it goes wrong. Over the course of any experiment, results fluctuate. If you keep checking and stop the moment you see something exciting, you're essentially giving yourself multiple chances to find a false positive. Imagine flipping a fair coin 100 times — at some point along the way, you might see a streak of 7 heads in a row. If you stopped there and declared "this coin is biased," you'd be wrong. The p-value threshold of 0.05 is only valid when you run the test to its predetermined end point.

The fix is discipline: decide your sample size in advance, run the test to completion, and look at the result once. Many platforms now offer "sequential testing" methods that let you check results more often without corrupting the statistics — but that requires specialized approaches, not just repeated glances at the default dashboard.

An Alternative Approach: Bayesian A/B Testing

Everything described so far is called the frequentist approach — the classical method that frames results as "significant" or "not significant" based on p-values. It's powerful but sometimes frustrating, because it gives you a binary verdict without a natural way to say "how confident are we, exactly?"

There's a different statistical philosophy that has become popular in the industry. Bayesian A/B testing is an alternative statistical approach to the classical (frequentist) method, allowing teams to express results as a probability that one variant is better rather than a binary significant/not-significant verdict.

Instead of asking "can we reject the null hypothesis?", Bayesian testing asks "given the data we've collected, what's the probability that version A is better than version B?" The result might be: "There's an 94% chance the green button outperforms the orange one." That's a much more intuitive way to communicate uncertainty to a product team.

Bayesian methods also handle small sample sizes more gracefully in some situations, and they let you incorporate prior knowledge — for instance, if you've run similar tests before, you can factor in what you already knew. The trade-off is that they require more careful setup and interpretation. Neither approach is universally better; each has contexts where it shines.

The Tool Landscape: Where Teams Actually Run These Tests

For years, many teams ran A/B tests on a widely used platform from Google. Google's Optimize product, a major A/B testing platform, was sunset in September 2023, pushing many teams toward alternatives like VWO, Optimizely, or open-source solutions. This shift is a good reminder that the statistical principles underlying A/B testing are separate from any particular tool — the concepts of p-values, sample size, and the peeking problem apply regardless of which software you use.

Open-source options have also become more viable, letting teams run experiments directly in their own infrastructure with full control over the statistical methods applied.

Putting It All Together: What a Rigorous Test Actually Looks Like

A trustworthy A/B test follows a clear sequence:

Define the metric. Decide what you're measuring before you touch anything else.
Calculate required sample size. Determine how many visitors you need based on the minimum difference you'd care about detecting.
Randomly assign users. Each visitor is assigned to a variant at random, keeping the groups comparable.
Run to completion. Don't stop early. Don't peek and react.
Evaluate the result. Apply your statistical test. Did you cross the significance threshold? What does the confidence interval look like?
Decide and document. Ship the winner if the evidence is strong, or accept that neither variant was detectably better — which is also a valid, useful answer.

Why This All Matters Beyond Button Colors

The same statistical engine that determines which button color works better is used to test pricing pages, onboarding flows, email subject lines, recommendation algorithms, and medical treatments. The domain changes; the logic doesn't.

Understanding the mechanics — random assignment, the null hypothesis, p-values, sample size, the peeking problem — gives you something valuable: the ability to read a test result critically. You can ask the right questions. Was the sample size large enough? Did they peek? Was the metric chosen in advance? Those questions separate rigorous experiments from ones that just look like rigorous experiments.

A/B testing, done properly, turns guesswork into evidence. But "done properly" has a specific meaning — and now you know what that meaning is.

Sources

Every factual claim in this article was independently verified against the following sources:

A/B testing - Wikipedia — en.wikipedia.org
Statistical Significance Calculator for A/B Testing — surveymonkey.com
Google Optimize Alternatives in 2026: The Honest Picks — goprecision.co
Where Experimentation goes wrong | GrowthBook Docs — docs.growthbook.io
Bayesian vs. Frequentist A/B Testing: What's the Difference? — cxl.com

Beyond Gut Feelings: How A/B Testing Uses Statistics to Prove What Actually Works on the Web

What A/B Testing Actually Is

The Core Statistical Question: Could This Have Happened by Chance?

The Null Hypothesis: Assume Nothing Is Going On

The p-Value: Measuring the Strength of Your Evidence

Sample Size: Why You Can't Just Test 20 People

The Peeking Problem: Why Checking Early Can Ruin Everything

An Alternative Approach: Bayesian A/B Testing

The Tool Landscape: Where Teams Actually Run These Tests

Putting It All Together: What a Rigorous Test Actually Looks Like

Why This All Matters Beyond Button Colors

Sources

Related Articles

Beyond Gut Feelings: How A/B Testing Uses Statistics to Prove What Actually Works on the Web

What A/B Testing Actually Is

The Core Statistical Question: Could This Have Happened by Chance?

The Null Hypothesis: Assume Nothing Is Going On

The p-Value: Measuring the Strength of Your Evidence

Sample Size: Why You Can't Just Test 20 People

The Peeking Problem: Why Checking Early Can Ruin Everything

An Alternative Approach: Bayesian A/B Testing

The Tool Landscape: Where Teams Actually Run These Tests

Putting It All Together: What a Rigorous Test Actually Looks Like

Why This All Matters Beyond Button Colors

Sources

Related Articles

Your Browser Is Now a Runtime: What WebAssembly Actually Does and Why It Changes Everything

Container Queries: The CSS Feature That Finally Lets Components Style Themselves

Frontend Simplified Reviews: Streamlining Web Development