How A/B Testing Statistics Actually Work | Beginner Guide

You've probably experienced A/B testing without knowing it. You visit a website and see a green "Sign Up" button. Your friend visits the same site and sees an orange one. That's not an accident — the company is running an experiment. But here's the part most people don't realize: changing the button and watching which one gets more clicks is the easy part. The hard part is knowing whether the difference is real. That's where statistics come in.

What A/B Testing Is Actually Trying to Solve

Imagine you flip a coin ten times and get seven heads. Does that mean the coin is rigged? Maybe — or maybe you just got lucky. The same problem haunts website experiments. If the orange button gets 52% more clicks than the green one over a weekend, is orange genuinely better, or did you just happen to catch more enthusiastic clickers on Saturday afternoon?

A/B testing uses statistics to answer that question honestly. You split your visitors into two groups at random: Group A sees the original design (called the "control"), and Group B sees the new version (the "variant"). Then you measure a specific outcome — a click, a sign-up, a purchase — and ask: is the difference between these two groups large enough that it's unlikely to be pure chance?

The Statistical Engine: Null Hypothesis Testing

The method most A/B testing tools use under the hood is called null hypothesis significance testing, or NHST. It sounds intimidating, but the logic is surprisingly intuitive once you see it.

You start by assuming your test changed nothing. This baseline assumption — "the variant performs the same as the control" — is called the null hypothesis. The job of the statistics is to figure out how surprising your observed results would be if that assumption were true.

The measure of that surprise is called a p-value. A p-value is the probability of seeing a result at least as extreme as yours, purely by chance, if the null hypothesis were correct. A small p-value means your data would be very unlikely under pure chance — which is evidence the difference is real.

A/B testing relies on null hypothesis significance testing, where a p-value below 0.05 is conventionally used to declare a result statistically significant. In practical terms, p < 0.05 means there's less than a 5% chance you'd see a difference this large if the variant actually did nothing. That threshold is a convention, not a law of nature, but it's the one you'll see almost everywhere.

What p < 0.05 Does NOT Mean

This is worth pausing on, because it's the most commonly misunderstood part of A/B testing statistics. A p-value below 0.05 does not mean there's a 95% chance your variant is better. It doesn't tell you how big the improvement is. It only tells you that, assuming nothing changed, data like yours would be unusual. It's a signal to take the result seriously — not a guarantee of truth.

Why You Can't Just Run a Test Until It Looks Good

Here's a tempting but deeply flawed approach to A/B testing: check results every day, and stop the test as soon as the numbers look favorable. It feels practical. Why keep the experiment running once you have your answer?

The problem is that this approach breaks the math. Running a test and stopping it the moment results look favorable — known as "peeking" — inflates false positive rates, a documented problem called optional stopping. A false positive is when you declare a winner even though the difference was just random noise.

Here's why peeking is so dangerous: if you check results frequently and stop the test any time p dips below 0.05, you dramatically increase your chances of landing on a lucky fluctuation. The p-value was calculated assuming a fixed sample size decided in advance. When you treat it as a live score to watch and react to, it loses its meaning.

The fix is to decide your stopping point before the test begins — and to commit to it.

How to Know When You Have Enough Data: Statistical Power

Before you launch an A/B test, you need to figure out how many visitors you'll need to collect. Run the test on too few people, and a real improvement might look like noise. Run it on too many, and you've wasted time. The calculation that balances this is called statistical power.

The minimum sample size required for a valid A/B test can be calculated using statistical power (typically set at 80%) and the expected effect size before the test begins.

Let's unpack those two terms:

Statistical power (80%) means your test has an 80% chance of detecting a real difference if one truly exists. In other words, if the orange button genuinely is better, you'll catch it eight times out of ten. The other two times, you'd miss it — that's an acceptable error rate by convention.
Expected effect size is your honest guess at how big the improvement might be. If you think the new design will boost conversions from 5% to 5.5%, that's a small effect, and you'll need a lot of data to detect it reliably. If you think it'll jump from 5% to 8%, you'll need far fewer visitors.

Sample size calculators are freely available online, and most A/B testing platforms include them. The key discipline is doing this calculation upfront and not adjusting it mid-test based on what you see.

A Different Approach: Bayesian A/B Testing

The method described above — p-values, null hypotheses, fixed sample sizes — is called the frequentist approach. It's the dominant method in industry, but it has a vocal critic: a school of statistics called Bayesian inference.

Bayesian A/B testing is an alternative to frequentist methods that produces a probability that one variant is better rather than a binary significant/not-significant result.

This difference is actually huge in practice. Instead of asking "is this result statistically significant at p < 0.05?", a Bayesian test might tell you: "there's an 87% probability that the orange button outperforms the green one." That's an answer non-statisticians can immediately reason about. You can weigh that 87% against the cost of rolling out the change and make a judgment call.

Bayesian methods also handle the peeking problem more gracefully — you can update your probability estimate continuously as new data arrives, which maps more naturally to how teams actually want to work.

The trade-off is that Bayesian tests require you to specify a prior — a starting belief about how likely improvements are before you see any data. Critics argue that priors introduce subjectivity. Frequentist tests make no such assumption. Both approaches have genuine strengths, and sophisticated teams often understand both.

The Tools That Run These Calculations

For many years, product and marketing teams used dedicated A/B testing platforms to handle both the experiment mechanics and the statistical calculations. Google's Optimize platform, which was widely used for A/B testing, was shut down in September 2023, pushing teams toward alternatives like Optimizely and VWO.

Whether you use a dedicated platform or build your own testing infrastructure, the statistical engine underneath follows the same principles: control versus variant, random assignment, a pre-specified sample size, and a decision rule agreed upon before data collection begins.

Putting It All Together

Here's a clean summary of what a statistically honest A/B test actually looks like step by step:

Define your metric — exactly what outcome are you measuring (clicks, sign-ups, purchases)?
Estimate the effect size — what's the smallest improvement that would actually matter to you?
Calculate your sample size — using power (usually 80%) and your effect size estimate, determine how many visitors you need before you can draw conclusions.
Run the test — split visitors randomly and collect data without peeking.
Analyze once — when you hit your pre-specified sample size, calculate your p-value or Bayesian probability and make your decision.
Act accordingly — if the result is significant, ship the winner. If not, treat it as a genuinely null result and move on.

The math behind A/B testing exists to protect you from a very human tendency: seeing patterns in noise. Conversion rates bounce around naturally. Your brain will find a story in any data if you stare at it long enough. The statistical framework is a set of rules you agree to follow before you know the answer — precisely so that the answer means something when you get it.

Understanding why those rules exist doesn't just make you a better reader of test results. It makes you a clearer thinker about evidence in general.

Sources

Every factual claim in this article was independently verified against the following sources:

Why 0.05 is the standard significance level in A/B testing — statsig.com
A/B Testing Sample Size Guide 2026 - How to Calculate | ExperimentHQ — experimenthq.io
The Peeking Problem in A/B Testing: Why Early Results Lie | DRIP — dripagency.de
Google Optimize Alternatives in 2026: 4 Honest Picks for Every Budget - Sigmize — sigmize.com
Bayesian vs Frequentist A/B Testing: A Practitioner's Honest Guide | Atticus Li — atticusli.com

The Math Behind A/B Testing: How Websites Know Which Design Actually Wins

What A/B Testing Is Actually Trying to Solve

The Statistical Engine: Null Hypothesis Testing

What p < 0.05 Does NOT Mean

Why You Can't Just Run a Test Until It Looks Good

How to Know When You Have Enough Data: Statistical Power

A Different Approach: Bayesian A/B Testing

The Tools That Run These Calculations

Putting It All Together

Sources

Related Articles

The Math Behind A/B Testing: How Websites Know Which Design Actually Wins

What A/B Testing Is Actually Trying to Solve

The Statistical Engine: Null Hypothesis Testing

What p < 0.05 Does NOT Mean

Why You Can't Just Run a Test Until It Looks Good

How to Know When You Have Enough Data: Statistical Power

A Different Approach: Bayesian A/B Testing

The Tools That Run These Calculations

Putting It All Together

Sources

Related Articles

Frontend Simplified Reviews: Streamlining Web Development

Frontend Design Tips and Tricks for Better Web Results

Frontend Tips and Tricks for Web Development Success