AB Cheatsheet
AB Cheatsheet
TESTING
CHEAT
SHEET
A/B Testing Cheat Sheet
This comprehensive guide serves as a quick reference for various concepts, steps, and techniques in A/B
tests. With this guide, you will be equipped with the knowledge and tools necessary to answer interview
questions related to A/B testing.
Table of Contents
The Basics of A/B Tests
Selecting Metrics for Experimentation (video)
Selecting Randomization Units (video)
General Considerations
Different Choices of Randomization Units
Randomization Unit vs. Unit of Analysis
Choosing a Target Population
Computing Sample Size (video)
Determine Test Duration
Analyzing Results (video)
Sanity Checks
Hypothesis Tests (video)
Statistical and Practical Significance
Common Problems and Pitfalls (video)
Alternatives to A/B tests
A single or a very small set of metrics that capture the ultimate success you are striving towards
Stable: should not be necessary to update goal metrics every time you launch a new feature
Reflects hypotheses on the drivers of success and indicates we are moving in the right direction to
move the goal metrics
Actionable
Resistant to gaming
User funnel
Minimize to 5 key metrics (success and driver metrics) as a rough rule of thumb. When dealing
with a lot of metrics, OEC (Overall Evaluation Criterion), a combination of multiple key metrics,
can be used. Devising an OEC makes the tradeoffs explicit and makes the exact definition of
success clear. The OEC can be the weighted sum of normalized metrics (each normalized to a
predefined range, say 0-1).
Organizational Guardrails
Ensures we move towards success with the right balance and without violating important
constraints
E.g. Website/App performance, latency: wait times for pages to load, error logs: number of
error messages, client crashes: number of crashes per user, business goals, revenue: revenue
Trust-related guardrails
E.g. the Sample Ratio Mismatch (SRM) guardrail and cache hit ratio to be the same among
Control and Treatment.
For changes visible to users, we should use a user ID or a cookie as the randomization unit.
For changes invisible to users, e.g., change in latency, it depends on what we want to measure. A
user ID or a cookie are still good options if we want to see what happens over time.
2. Variability
If the randomization unit is the same as the unit of analysis, the empirically computed variability is
similar to the analytically computed variability.
If the randomization unit is coarser than the unit of analysis, e.g., the randomization unit is the user
and we wish to analyze the click-through rate (the unit of analysis is a page view), the variability of
the metric will be much higher. This is because the independence assumption is invalid as we are
dividing groups of correlated units, which increases the variability.
3. Ethical considerations
May face security and confidentiality issues when using identifiable randomization units.
It allows for long-term measurements, such as user retention and users’ learning effect.
Cons: Identifiable
Pros: Anonymous.
Session-Based (or Page View-Based): Every user session is a randomization unit. A session starts
when a user logs in and ends when a user logs out or after 30 min of inactivity.
Pros: Finer level of granularity creates more units, and the test will have more power to detect
smaller changes.
Cons: May lead to inconsistent user experience, so it’s appropriate when changes are not visible
to the user
IP-Based: Every IP address is a randomization unit. Every device in every network is assigned a
unique IP.
Pros: Maybe the only option for certain experiments, e.g., testing latency using one hosting service
versus another
Cons: Changes when users change places, creating an inconsistent experience. Many users may
share the same IP address. Therefore, not recommended unless it’s the only option.
The general recommendation for the randomization unit is the same as (or coarser than) the
unit of analysis.
e.g., the randomization unit is the user and we analyze the click-through rate (the unit of analysis
is a page view).
The caveat is that in this case, we need to pay attention to the variability of the unit of analysis as
explained earlier.
It does not work if the randomization unit is finer than the unit of analysis.
e.g., the randomization unit is a page view and we analyze user-level metrics.
This is because the user’s experience is likely to include a mix of variants (i.e., some in Control
and some in Treatment), and computing user-level metrics will not be meaningful.
Consider geographic region, platform (mobile vs tablet vs laptop), device type, user demographics
(age, gender, country, etc), usage or engagement level (analyze the user journey), etc.
Video
Be careful if you select users based on usage and your treatment affects usage. This violates
the stable unit treatment value assumption.
α = Type I Error = when the null hypothesis is actually true, rejecting the null hypothesis =
incorrectly rejecting the null hypothesis = False Positive
Significance level: The probability that we reject H0 even when the treatment has no effect. The
probability of committing a Type I error (α).
β = Type II Error = when the alternative hypothesis is true, failing to reject the null hypothesis =
incorrectly accepting the null hypothesis = False Negative
Statistical power: The probability that we reject H0 when the treatment indeed has an effect. This
measures how sensitive the experiment is. If power is too low, we can’t detect true effects; if it’s
unrealistically high (.99), we may never finish the experiment.
Variances:
Because samples are independent, Var(Δ) = Var(Yˉ t ) + Var(Yˉ c ) where Δ is the difference
between the Treatment average and the Control average. Variances are often estimated either
from historical data or from A/A tests.
Test Duration
Sample Size
Randomization Units/Day
Pitfalls:
mitigate risk (0-5%): Start with team members, company employees, loyal users, etc. in fear of bugs
or other risks — these people tend to be more forgiving.
long-term holdout (optional): Be aware of opportunity costs and ethics because those users won't
enjoy new features for a while.
Sample Ratio Mismatch (SRM). For the study population, we want 50% in the treatment and
50% in the control. If our study population was 1,000 with 800 in the treatment and 200 in the
When the sample size is big enough, by the central limit theorem (CLT), the sampling
distribution of μt − μc should be normally distributed.
Normality test
Organizational-related guardrail metrics are used to ensure that the performance of the
organization is following the standard we expect.
Website/App performance
Business goals
Engagement: e.g., time spent per user, daily active users (DAU), and page views per user.
Z-test or T-test: Both tests can be used to compare proportions or group means and test for
significant differences between them.
Note: In step 1, the "standard deviation" is the standard deviation of the sampling
distribution for the proportion, or standard error (SE). SE should be used in computations
instead of SD.
Chi-Squared Test
Example: checking for the SRM. Using the Chi-squared test as a goodness of fit test (Fairness
of Die in the Wiki page), it is analogous to testing if the treatment/control assignment
mechanism is a fair game (should be 50/50).
These failures should be a priority concern before moving on to analyzing the data. Is this just a
one-time issue or if it will persist or become worse over time? These are supposed to be invariant
metrics; we do not want these to differ between groups.
Z-test or t-test
P-value
Definition: If H0 is true, what's the probability of seeing an outcome (e.g., a t-statistic) at least
this extreme?
How to use: If the p-value is below your threshold of significance (typically 0.05), then you can
reject the null hypothesis.
Assumptions
Normality: When the sample size is big enough, by the central limit theorem (CLT), the
sampling distribution of the difference in the means between the two groups should be
normally distributed.
If the sample isn't large enough for the sampling distribution to be normal
1. Statistically and practically significant: The result is both statistically (p < .05 and 95% CI does not
contain 0) and practically significant, so we should obviously launch it. → Launch!
2. Not practically significant:
Scenario 1: The change is neither statistically (95% CI contains 0) nor practically significant (95% CI
sits in the middle), so not worth launching. → The change does not do much. Either decide to iterate or
abandon this idea.
Scenario 2: Statistically significant (95% CI doesn’t contain 0) but not practically significant → if
implementing a new algorithm is costly, then it’s probably not worth launching; if the cost is low, then it
doesn’t hurt to launch.
Scenario 1: The 95% CI contains 0 and the CI is outside of what is practically significant. → There is
not enough power to draw a strong conclusion and we do not have enough data to make any launch
decision. Run a follow-up test with more units, providing greater statistical power.
Scenario 2: Likely practically significant. Even though our best guess (i.e., point estimate) is larger
than the practical significance boundary, it’s also possible that there is no impact at all. → Repeat this
test but with greater power to gain more precision in the result.
Both scenarios suggest our experiment may be underpowered, so we should probably run new
experiments with more units if time and resources allow.
1. Multiple success metrics (Multiple hypotheses): When the significance level (false positive
probability) is 5% for each metric. With N metrics, Pr(at least one metric is false positive) =
1−(1−0.05)N is much greater than 5%.
Group metrics into expected to change, not sure, and not expected to change.
2. Post-experiment result segmentation: Multiple hypotheses are squeezed into one experiment. Also
a higher chance of false positive results. The overall result can contradict segmented results
(Simpson's Paradox).
Causes
P-hacking: Stop the experiment earlier than the designed duration when observing the p-value
is lower than the threshold value.
The experiment ran as designed but there are not enough randomization units.
High variance
Solutions
If the experiment is still running, we should run the experiment until enough units are
collected.
Clean data to reduce variance: remove outliers (e.g., capping), log transformation (don't log
transform revenue!)
Use trigger analysis, i.e., only include impacted units (e.g. conversion rate may be 0.5% when
you include users from the top funnel but it may be 50% right before the change). The caveat
is when generalizing to all users, true effect could be anywhere between 0 and the observed
effect.
Causes
Seasonality
Market change
Solutions
Long-term monitoring
Network Effects
Use isolation methods. Ensure little or no spillover between the control and treatment units
Cluster-based randomization
Randomize based on groups of people who are more likely to interact with fellow group
members, rather than outsiders
Geo-based randomization
Time-based randomization
Select a random time and place all users in either control or treatment groups for a short
period of time.
Conduct user experience research: Great for generating hypotheses. Terrible for scaling.
Focus groups: A bit more scalable but users may fall into groupthink
Human evaluation: Having human raters rate results or label data is useful for debugging, but they
may differ from actual users.
Quantitative Analysis
Conduct retrospective analysis by analyzing users’ activity logs: Use historical data to understand
baselines, metric distributions, form hypotheses, etc.
Causal inference: interrupted time series (same group go through control and treatment over time),
interleaved experiments (results by two rankers are de-duped and mixed together), regression
Requires making many assumptions and incorrect assumptions can lead to a lack of validity.