What Is A/B Testing?
A/B testing (split testing) is a controlled experiment comparing two versions of a webpage, email, or product feature. Version A is the control (existing), and Version B is the variation (proposed change). Users are randomly assigned to each group, and their behavior is measured to determine which version performs better.
The key question is whether the observed difference is statistically significant or due to random chance. This calculator uses a two-proportion z-test to determine significance at the 95% confidence level (p < 0.05). A significant result means the observed difference is unlikely to have occurred by chance alone.
Statistical Formula
The z-score measures how many standard deviations the difference between the two proportions is from zero. The p-value is then calculated from this z-score using the standard normal distribution.
Sample Size Requirements
| Baseline Rate | Min Detectable Effect | Sample per Group |
|---|---|---|
| 3% | 10% relative | ~87,000 |
| 3% | 20% relative | ~22,000 |
| 10% | 10% relative | ~26,000 |
| 10% | 20% relative | ~6,500 |
Common Mistakes
- Peeking: Checking results before reaching required sample size inflates false positive rates.
- Too many variations: Testing many variants increases false positives without Bonferroni correction.
- Short duration: Tests under one full week miss day-of-week effects.
- Ignoring segments: Overall results may mask opposite effects in different user segments.
Frequently Asked Questions
What p-value indicates significance?
By convention, p < 0.05 is considered statistically significant, meaning less than 5% chance the observed difference is due to random variation. Some organizations use stricter thresholds like p < 0.01.
How long should I run an A/B test?
Run until you reach the required sample size for your desired statistical power (typically 80%). As a minimum, run for at least one full week. Never stop a test early just because you see significance.
What is statistical power?
Statistical power (typically 80%) is the probability of detecting a real effect when one exists. Higher power requires larger sample sizes but reduces the chance of missing a true improvement (Type II error).