0% found this document useful (0 votes)

74 views13 pages

AB Cheatsheet

This document provides a cheat sheet on A/B testing concepts including: 1. The basics of A/B tests and selecting appropriate metrics to measure like success, driver, and guardrail metrics. 2. Factors to consider when selecting a randomization unit like user experience, variability, and ethics. Common randomization units include user-based, cookie-based, and session-based. 3. Choosing an appropriate target population based on attributes like geographic region, platform, and engagement level. The sample size is then computed based on the desired statistical significance.

Uploaded by

Pradeep Sandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views13 pages

AB Cheatsheet

Uploaded by

Pradeep Sandra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

A/B

TESTING
CHEAT
SHEET
A/B Testing Cheat Sheet

This comprehensive guide serves as a quick reference for various concepts, steps, and techniques in A/B
tests. With this guide, you will be equipped with the knowledge and tools necessary to answer interview
questions related to A/B testing.

Table of Contents
The Basics of A/B Tests
Selecting Metrics for Experimentation (video)
Selecting Randomization Units (video)
General Considerations
Different Choices of Randomization Units
Randomization Unit vs. Unit of Analysis
Choosing a Target Population
Computing Sample Size (video)
Determine Test Duration
Analyzing Results (video)
Sanity Checks
Hypothesis Tests (video)
Statistical and Practical Significance
Common Problems and Pitfalls (video)
Alternatives to A/B tests

The Basics of A/B Tests

A/B Testing Fundamentals

Cracking A/B Testing Problems

Top A/B Testing Interview Questions

Selecting Metrics for Experimentation (video)

A/B Testing Cheat Sheet 1

Experiment Metrics Criteria

Measurable within the experiment timeframe.

Attributable to the change in the product/feature.

Sensitive enough to detect changes that matter in a timely fashion.

Success Metrics (goal metrics, true north metrics) (video)

A single or a very small set of metrics that capture the ultimate success you are striving towards

Ensure success metrics are simple and stable

Simple: easily understood and broadly accepted by stakeholders

Stable: should not be necessary to update goal metrics every time you launch a new feature

Driver Metrics (signpost metrics, surrogate metrics, indirect or predictive metrics)

Shorter-term, faster-moving, and more sensitive than goal metrics

Reflects hypotheses on the drivers of success and indicates we are moving in the right direction to
move the goal metrics

Actionable

Resistant to gaming

How to generate driver metrics:

Business goals: growth, engagement, revenue

HEART: happiness, engagement, adoption, retention, and task Success

AARRR: acquisition, activation, retention, referral, and revenue

User funnel

Minimize to 5 key metrics (success and driver metrics) as a rough rule of thumb. When dealing
with a lot of metrics, OEC (Overall Evaluation Criterion), a combination of multiple key metrics,
can be used. Devising an OEC makes the tradeoffs explicit and makes the exact definition of
success clear. The OEC can be the weighted sum of normalized metrics (each normalized to a
predefined range, say 0-1).

Guardrail Metrics (counter metrics)

Organizational Guardrails

Ensures we move towards success with the right balance and without violating important
constraints

E.g. Website/App performance, latency: wait times for pages to load, error logs: number of
error messages, client crashes: number of crashes per user, business goals, revenue: revenue

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 2

per user and total revenue, and engagement (e.g. time spent per user, DAU, and page views
per user)

Trust-related guardrails

Assess the trustworthiness and internal validity of experiment results

E.g. the Sample Ratio Mismatch (SRM) guardrail and cache hit ratio to be the same among
Control and Treatment.

Selecting Randomization Units (video)

General Considerations
1. Consistent user experience

For changes visible to users, we should use a user ID or a cookie as the randomization unit.

For changes invisible to users, e.g., change in latency, it depends on what we want to measure. A
user ID or a cookie are still good options if we want to see what happens over time.

2. Variability

If the randomization unit is the same as the unit of analysis, the empirically computed variability is
similar to the analytically computed variability.

If the randomization unit is coarser than the unit of analysis, e.g., the randomization unit is the user
and we wish to analyze the click-through rate (the unit of analysis is a page view), the variability of
the metric will be much higher. This is because the independence assumption is invalid as we are
dividing groups of correlated units, which increases the variability.

3. Ethical considerations

May face security and confidentiality issues when using identifiable randomization units.

User-level randomization is the most common in practice because:

It ensures a consistent user experience.

It allows for long-term measurements, such as user retention and users’ learning effect.

Different Choices of Randomization Units

Account-Based (User-Based): Every single account is a randomization unit. Users are identifiable via
signing to the website or via cookies.

Pros: Stable across time and platforms

Cons: Identifiable

Cookie-Based: A pseudonymous user ID, specific to a browser and a device.

Pros: Anonymous.

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 3

Cons: Users can clear cookies. Not persistent across platforms, changes when users switch
browser or device platforms.

Session-Based (or Page View-Based): Every user session is a randomization unit. A session starts
when a user logs in and ends when a user logs out or after 30 min of inactivity.

Pros: Finer level of granularity creates more units, and the test will have more power to detect
smaller changes.

Cons: May lead to inconsistent user experience, so it’s appropriate when changes are not visible
to the user

IP-Based: Every IP address is a randomization unit. Every device in every network is assigned a
unique IP.

Pros: Maybe the only option for certain experiments, e.g., testing latency using one hosting service
versus another

Cons: Changes when users change places, creating an inconsistent experience. Many users may
share the same IP address. Therefore, not recommended unless it’s the only option.

Device-Based: Every device ID is a randomization unit.

Pros: Immutable Id associated with a specific device.

Cons: Identifiable. Only available for mobile devices

Randomization Unit vs. Unit of Analysis

The general recommendation for the randomization unit is the same as (or coarser than) the
unit of analysis.

It works if the randomization unit is coarser than the unit of analysis.

e.g., the randomization unit is the user and we analyze the click-through rate (the unit of analysis
is a page view).

The caveat is that in this case, we need to pay attention to the variability of the unit of analysis as
explained earlier.

It does not work if the randomization unit is finer than the unit of analysis.

e.g., the randomization unit is a page view and we analyze user-level metrics.

This is because the user’s experience is likely to include a mix of variants (i.e., some in Control
and some in Treatment), and computing user-level metrics will not be meaningful.

Choosing a Target Population

Do we want to target a specific population / all the users?

Consider geographic region, platform (mobile vs tablet vs laptop), device type, user demographics
(age, gender, country, etc), usage or engagement level (analyze the user journey), etc.

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 4

E.g. A new feature only available for users in a particular geographic region → Only need to
select users in that region

Video

Be careful if you select users based on usage and your treatment affects usage. This violates
the stable unit treatment value assumption.

Computing Sample Size (video)

Two-sampled t-tests are the most common statistical significance tests used. Suppose Y is a metric of
interest.

H0 : mean(Y t ) = mean(Y c ) (no treatment effect)

Hα : mean(Y t ) = c
 mean(Y ) (there is a treatment effect)
The required sample size depends on 4 things:

Significance level: α (common choice is .05):

α = Type I Error = when the null hypothesis is actually true, rejecting the null hypothesis =
incorrectly rejecting the null hypothesis = False Positive

Significance level: The probability that we reject H0 even when the treatment has no effect. The
probability of committing a Type I error (α).

Statistical power: 1 − P (β) (common choice is .8):

β = Type II Error = when the alternative hypothesis is true, failing to reject the null hypothesis =
incorrectly accepting the null hypothesis = False Negative

Statistical power: The probability that we reject H0 when the treatment indeed has an effect. This
measures how sensitive the experiment is. If power is too low, we can’t detect true effects; if it’s
unrealistically high (.99), we may never finish the experiment.

Variances:

Because samples are independent, Var(Δ) = Var(Yˉ t ) + Var(Yˉ c ) where Δ is the difference
between the Treatment average and the Control average. Variances are often estimated either
from historical data or from A/A tests.

Minimally detectable effect (MDE) δ (a.k.a. practical significance):

The smallest difference that matters in practice

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 5

Sample Size Formula
(σt2 +σc2 )(z 1−α/2 +z 1−β )2
n= δ2
Typically, we choose α = 0.05, then z1−α/2 = z.975 ≈ 1.96; and β = 0.2, then z1−β =
z.80 ≈ 0.84. Assuming Treatment and Control are of equal size, the required sample size for
each variant is about:
16σ 2
n= δ2
σ 2 : Sample variance of the difference between the Treatment and the Control. Estimated from
historical data. (For ratio metrics, the maximum variance is 0.25.)
δ: Practical significance (Minimum detectable effect), determined among multiple stakeholders.

Determine Test Duration

Test Duration

Sample Size
Randomization Units/Day

Pitfalls:

Avoid a duration of less than a week.

A longer test gathers more data and is almost always better.

Ramping: trade-off among speed, quality, risk (SQR)

mitigate risk (0-5%): Start with team members, company employees, loyal users, etc. in fear of bugs
or other risks — these people tend to be more forgiving.

maximum power ramp (MPR, 5-50%): Measure treatment effect.

post-MPR (optional): Ensure infra can withstand the change.

long-term holdout (optional): Be aware of opportunity costs and ethics because those users won't
enjoy new features for a while.

Analyzing Results (video)

Sanity Checks
Guardrail Metrics
Trust-related guardrail metrics help us ensure our assumptions regarding the data are not violated.

Sample Ratio Mismatch (SRM). For the study population, we want 50% in the treatment and
50% in the control. If our study population was 1,000 with 800 in the treatment and 200 in the

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 6

control, obviously something is wrong. We will have to perform a hypothesis test if this is not
something we can easily judge.

Cache hit ratio to be the same among Control and Treatment.

Test statistics follow the assumed distribution

When the sample size is big enough, by the central limit theorem (CLT), the sampling
distribution of μt − μc should be normally distributed.

Normality test

Organizational-related guardrail metrics are used to ensure that the performance of the
organization is following the standard we expect.

Website/App performance

Latency: wait times for pages to load.

Error Logs: number of error messages.

Client Crashes: crashes per user.

Business goals

Revenue: revenue per user and total revenue.

Engagement: e.g., time spent per user, daily active users (DAU), and page views per user.

How To Do Sanity Checks

A/A test: Sanity check of the A/B testing system. Run before the system is used in the application

Z-test or T-test: Both tests can be used to compare proportions or group means and test for
significant differences between them.

Example: checking sample sizes between groups using Z-test (video)

Note: In step 1, the "standard deviation" is the standard deviation of the sampling
distribution for the proportion, or standard error (SE). SE should be used in computations
instead of SD.

Chi-Squared Test

Example: checking for the SRM. Using the Chi-squared test as a goodness of fit test (Fairness
of Die in the Wiki page), it is analogous to testing if the treatment/control assignment
mechanism is a fair game (should be 50/50).

What If Sanity Checks Fail?

Stop and assess. Ask what went wrong and how we can address it.

These failures should be a priority concern before moving on to analyzing the data. Is this just a
one-time issue or if it will persist or become worse over time? These are supposed to be invariant
metrics; we do not want these to differ between groups.

We can also rerun the experiment.

How to debug SRM (video)

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 7

Hypothesis Tests (video)
If test statistics follow or can be approximated by normal or t-distributions, use the Z-test or t-test.

Z-test or t-test

Decide one-tailed or two-tailed tests

Compute the mean

Compute either pooled or unpooled variance

P-value

Definition: If H0 is true, what's the probability of seeing an outcome (e.g., a t-statistic) at least
this extreme?

How to use: If the p-value is below your threshold of significance (typically 0.05), then you can
reject the null hypothesis.

Misconception: It's not the probability of Ha being true.

Assumptions

Normality: When the sample size is big enough, by the central limit theorem (CLT), the
sampling distribution of the difference in the means between the two groups should be
normally distributed.

If the sample isn't large enough for the sampling distribution to be normal

solution #1: cap values if data is highly skewed

solution #2: use bootstrapping to calculate statistics

Independence: Each observation of the dependent variable is independent of other

observations.

Otherwise, use non-parametric tests.

Statistical and Practical Significance

The graph shows 4 patterns of A/B testing results in terms of statistical and practical
significance.

Source: Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

1. Statistically and practically significant: The result is both statistically (p < .05 and 95% CI does not
contain 0) and practically significant, so we should obviously launch it. → Launch!
2. Not practically significant:

Scenario 1: The change is neither statistically (95% CI contains 0) nor practically significant (95% CI
sits in the middle), so not worth launching. → The change does not do much. Either decide to iterate or
abandon this idea.

Scenario 2: Statistically significant (95% CI doesn’t contain 0) but not practically significant → if
implementing a new algorithm is costly, then it’s probably not worth launching; if the cost is low, then it
doesn’t hurt to launch.

3. Likely statistically/practically significant:

Scenario 1: The 95% CI contains 0 and the CI is outside of what is practically significant. → There is
not enough power to draw a strong conclusion and we do not have enough data to make any launch
decision. Run a follow-up test with more units, providing greater statistical power.

Scenario 2: Likely practically significant. Even though our best guess (i.e., point estimate) is larger
than the practical significance boundary, it’s also possible that there is no impact at all. → Repeat this
test but with greater power to gain more precision in the result.

Both scenarios suggest our experiment may be underpowered, so we should probably run new
experiments with more units if time and resources allow.

4. Statistically significant and likely practically significant. It is possible that the change is not
practically significant. → Can repeat the test with more power. However, choosing to launch is a
reasonable decision.

Common Problems and Pitfalls (video)

Multiple Testing Problems arise in the two scenarios below.

1. Multiple success metrics (Multiple hypotheses): When the significance level (false positive
probability) is 5% for each metric. With N metrics, Pr(at least one metric is false positive) =
1−(1−0.05)N is much greater than 5%.
Group metrics into expected to change, not sure, and not expected to change.

Set different significance levels for various groups.

2. Post-experiment result segmentation: Multiple hypotheses are squeezed into one experiment. Also
a higher chance of false positive results. The overall result can contradict segmented results
(Simpson's Paradox).

Avoid post-test result segmentation.

If post-test result segmentation is desired:

Ensure enough randomization units in each segment

Ensure sufficient randomization in each segment

Lack of Testing Power

Causes

P-hacking: Stop the experiment earlier than the designed duration when observing the p-value
is lower than the threshold value.

The experiment ran as designed but there are not enough randomization units.

High variance

Solutions

Do not stop the experiment before the design's duration.

If there are not enough randomization units

If the experiment is still running, we should run the experiment until enough units are
collected.

If not, we should re-run the experiment

Clean data to reduce variance: remove outliers (e.g., capping), log transformation (don't log
transform revenue!)

Use trigger analysis, i.e., only include impacted units (e.g. conversion rate may be 0.5% when
you include users from the top funnel but it may be 50% right before the change). The caveat
is when generalizing to all users, true effect could be anywhere between 0 and the observed
effect.

Changes in Users’ Behaviors

Causes

Novelty and primacy effect

Seasonality

Market change

Solutions

Long-term monitoring

Network Effects

Use isolation methods. Ensure little or no spillover between the control and treatment units

Cluster-based randomization

Randomize based on groups of people who are more likely to interact with fellow group
members, rather than outsiders

Geo-based randomization

Limit control/treatment group to a specific location or a city

Time-based randomization

Select a random time and place all users in either control or treatment groups for a short
period of time.

How to detect interference:

Monitor during the experiment

Long-term monitoring by allowing an experiment to run for at least 3 months or by having a

holdback group, namely a small control group that is never given access to a new feature.

Alternatives to A/B tests

Qualitative Analysis

Conduct user experience research: Great for generating hypotheses. Terrible for scaling.

Focus groups: A bit more scalable but users may fall into groupthink

Surveys: Responders may not be faithful or representative.

Human evaluation: Having human raters rate results or label data is useful for debugging, but they
may differ from actual users.

Quantitative Analysis

Conduct retrospective analysis by analyzing users’ activity logs: Use historical data to understand
baselines, metric distributions, form hypotheses, etc.

Causal inference: interrupted time series (same group go through control and treatment over time),
interleaved experiments (results by two rankers are de-duped and mixed together), regression

discontinuity design (compare outcomes for winners vs. near winners — supposedly similar, but
end up in different conditions), propensity score matching (units are not randomly assigned — find
similar units in different groups to compare), difference in differences (initial values differ —
compare changes)

Requires making many assumptions and incorrect assumptions can lead to a lack of validity.

Requires a great deal of care to generate reliable results.

Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
100% (4)
Complete Download Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, PDF All Chapters
55 pages
3141b86-6fd4-7726-D8ad-20a1516bcd Statistics Interview Cheat Sheet - Emmading - Com. All Rights Reserved.
No ratings yet
3141b86-6fd4-7726-D8ad-20a1516bcd Statistics Interview Cheat Sheet - Emmading - Com. All Rights Reserved.
10 pages
Handout9 Trees Bagging Boosting
100% (1)
Handout9 Trees Bagging Boosting
23 pages
Mexican Jumping Bean Lab Report
No ratings yet
Mexican Jumping Bean Lab Report
3 pages
Decipher Getting Started Guide
100% (1)
Decipher Getting Started Guide
26 pages
ML Interview Cheat Sheet
No ratings yet
ML Interview Cheat Sheet
9 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Manual Testing Cheat Sheet
No ratings yet
Manual Testing Cheat Sheet
9 pages
Python Interview Questions 1653100147
No ratings yet
Python Interview Questions 1653100147
24 pages
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
No ratings yet
Machine Lpipearning Interview Questions: Algorithms/Tp: Q1-What's The Trade-Off Between Bias and Variance?
46 pages
Regression Analysis
100% (2)
Regression Analysis
9 pages
A Comprehensive Statistics Cheat Sheet For Data Science 1685659812
No ratings yet
A Comprehensive Statistics Cheat Sheet For Data Science 1685659812
39 pages
ALX Data Analytics Program Description
No ratings yet
ALX Data Analytics Program Description
6 pages
Module 2
No ratings yet
Module 2
20 pages
Kenny-230717-Google Data Scientist Guide
No ratings yet
Kenny-230717-Google Data Scientist Guide
8 pages
Lead Scoring Case Study Presentation
100% (2)
Lead Scoring Case Study Presentation
11 pages
Machine Learning Project Car Price Prediction Algorithm
No ratings yet
Machine Learning Project Car Price Prediction Algorithm
4 pages
Simple Linear Regression - Assign3
No ratings yet
Simple Linear Regression - Assign3
8 pages
AdaBoost Classifier in Python (Article) - DataCamp
100% (1)
AdaBoost Classifier in Python (Article) - DataCamp
9 pages
Data Analyst Roadmap by Rishabh Mishra
No ratings yet
Data Analyst Roadmap by Rishabh Mishra
9 pages
Lecture 9 PDF
100% (1)
Lecture 9 PDF
28 pages
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
18 pages
Summary - Applied Data Science With Python and Jupyter
No ratings yet
Summary - Applied Data Science With Python and Jupyter
2 pages
771 A18 Lec4
100% (1)
771 A18 Lec4
128 pages
Pandas
100% (1)
Pandas
1,131 pages
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
100% (2)
Machine Learning: Lecture 13: Model Validation Techniques, Overfitting, Underfitting
26 pages
ML0101EN Clus K Means Customer Seg Py v1
100% (1)
ML0101EN Clus K Means Customer Seg Py v1
8 pages
XV. Anomaly Detection
0% (1)
XV. Anomaly Detection
4 pages
Data Science - Sem6
100% (3)
Data Science - Sem6
118 pages
U02Lecture07 Classification
100% (1)
U02Lecture07 Classification
56 pages
6 XG Boost - Jupyter Notebook
100% (1)
6 XG Boost - Jupyter Notebook
3 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
DATA SCIENCE INTERVIEW
No ratings yet
DATA SCIENCE INTERVIEW
32 pages
ML MU Unit 2
100% (3)
ML MU Unit 2
84 pages
Predictive Model For E-Commerce
100% (1)
Predictive Model For E-Commerce
3 pages
Big Data
No ratings yet
Big Data
9 pages
Problem Statement - Medicon Case Study
No ratings yet
Problem Statement - Medicon Case Study
2 pages
Netflix Data Science Interview Question
No ratings yet
Netflix Data Science Interview Question
7 pages
Presentation GPT 4
100% (1)
Presentation GPT 4
25 pages
Unit-5 Decision Trees and Ensemble Learning
100% (1)
Unit-5 Decision Trees and Ensemble Learning
162 pages
Predicting Mode of Transport (ML) : Akalya KS
No ratings yet
Predicting Mode of Transport (ML) : Akalya KS
17 pages
CSE-Machine Learning & Big Data - WSS Source Book
No ratings yet
CSE-Machine Learning & Big Data - WSS Source Book
181 pages
Lead Scoring Group Case Study Presentation
100% (2)
Lead Scoring Group Case Study Presentation
19 pages
DS+C25 PGDDS+Masters
No ratings yet
DS+C25 PGDDS+Masters
13 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
WINE Prediction Quality
100% (1)
WINE Prediction Quality
6 pages
Uber HYD COE Business Analyst JD - Analytics & Reporting PDF
No ratings yet
Uber HYD COE Business Analyst JD - Analytics & Reporting PDF
3 pages
Cheet Sheet
No ratings yet
Cheet Sheet
47 pages
Artificial Neural Networks Kluniversity Course Handout
No ratings yet
Artificial Neural Networks Kluniversity Course Handout
18 pages
Machine Learning Interview Questions
100% (1)
Machine Learning Interview Questions
4 pages
Thera Bank - Project
100% (4)
Thera Bank - Project
34 pages
Machine Learning Project Report
100% (1)
Machine Learning Project Report
4 pages
LDA 01 Linear Discriminant Analysis
No ratings yet
LDA 01 Linear Discriminant Analysis
65 pages
Classification and Regression Trees
100% (1)
Classification and Regression Trees
60 pages
76 - Sample - Chapter Kunci M2K3 No 9
No ratings yet
76 - Sample - Chapter Kunci M2K3 No 9
94 pages
SQL - Basics
No ratings yet
SQL - Basics
25 pages
Statistical Machine Learning
100% (1)
Statistical Machine Learning
12 pages
KPMG Data
50% (2)
KPMG Data
3,723 pages
Ensemble Classifiers
100% (1)
Ensemble Classifiers
37 pages
Single customer view Second Edition
From Everand
Single customer view Second Edition
Gerardus Blokdyk
No ratings yet
Statistical Science: Volume 33, Number 2 May 2018
No ratings yet
Statistical Science: Volume 33, Number 2 May 2018
35 pages
Hawassa University Social Media Study
No ratings yet
Hawassa University Social Media Study
4 pages
EB-CA-9709-P6-1522-PR_TB12345678=1
No ratings yet
EB-CA-9709-P6-1522-PR_TB12345678=1
44 pages
Effect Sizes
No ratings yet
Effect Sizes
32 pages
Pengaruh Kepercayaan Diri Terhadap Kemampuan Public Speaking Mahasiswa
No ratings yet
Pengaruh Kepercayaan Diri Terhadap Kemampuan Public Speaking Mahasiswa
7 pages
The Motor Proficiency Test For Children
No ratings yet
The Motor Proficiency Test For Children
7 pages
S1 Binomial Distribution & Hypothesis Testing 1 QP
No ratings yet
S1 Binomial Distribution & Hypothesis Testing 1 QP
2 pages
Chapter 4 Qualitative Research Paper Sample
No ratings yet
Chapter 4 Qualitative Research Paper Sample
7 pages
Banking Ombudsman PDF
No ratings yet
Banking Ombudsman PDF
59 pages
Assignment, SOC 101, Section-4, Group-1
No ratings yet
Assignment, SOC 101, Section-4, Group-1
5 pages
Cox Proportional Hazard Model
No ratings yet
Cox Proportional Hazard Model
34 pages
ASL QA
No ratings yet
ASL QA
5 pages
Chapter 3 Content Contextual Analysis
No ratings yet
Chapter 3 Content Contextual Analysis
26 pages
NYA Lab 2B Scientific Method Statistics W2022
No ratings yet
NYA Lab 2B Scientific Method Statistics W2022
13 pages
As Stats Paper A
No ratings yet
As Stats Paper A
5 pages
1462 1770 1 PB
No ratings yet
1462 1770 1 PB
5 pages
Impact of Technology Integration On Academic Performance
No ratings yet
Impact of Technology Integration On Academic Performance
7 pages
7 Es Lesson Planning
No ratings yet
7 Es Lesson Planning
12 pages
Assignment 2 HMEF5133 Directed Reading May 2019 Semester: TH TH
No ratings yet
Assignment 2 HMEF5133 Directed Reading May 2019 Semester: TH TH
7 pages
Gate 2015 Question & Answer Key
No ratings yet
Gate 2015 Question & Answer Key
22 pages
Instant Download Categorical and Nonparametric Data Analysis E. Michael Nussbaum PDF All Chapters
100% (3)
Instant Download Categorical and Nonparametric Data Analysis E. Michael Nussbaum PDF All Chapters
62 pages
TEM Infographic
No ratings yet
TEM Infographic
1 page
PD Mishra
77% (52)
PD Mishra
479 pages
METHODOLOGY
No ratings yet
METHODOLOGY
11 pages
Evaluation of Permanent Sample Surveys For Growth and Yield Studies A Swiss Example 1995 Forest Ecology and Management
No ratings yet
Evaluation of Permanent Sample Surveys For Growth and Yield Studies A Swiss Example 1995 Forest Ecology and Management
8 pages
Approaches of Gathering Evidence and Basic Concepts to Audit Sampling
No ratings yet
Approaches of Gathering Evidence and Basic Concepts to Audit Sampling
2 pages
What Is Validity in Research?
100% (1)
What Is Validity in Research?
6 pages
Grounded Theory
No ratings yet
Grounded Theory
10 pages
(Martha Et Al., 2007)
No ratings yet
(Martha Et Al., 2007)
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

AB Cheatsheet

Uploaded by

AB Cheatsheet

Uploaded by

A/B

The Basics of A/B Tests

Cracking A/B Testing Problems

Top A/B Testing Interview Questions

Selecting Metrics for Experimentation (video)

A/B Testing Cheat Sheet 1

Measurable within the experiment timeframe.

Attributable to the change in the product/feature.

Sensitive enough to detect changes that matter in a timely fashion.

Success Metrics (goal metrics, true north metrics) (video)

Ensure success metrics are simple and stable

Simple: easily understood and broadly accepted by stakeholders

Driver Metrics (signpost metrics, surrogate metrics, indirect or predictive metrics)

Shorter-term, faster-moving, and more sensitive than goal metrics

How to generate driver metrics:

Business goals: growth, engagement, revenue

HEART: happiness, engagement, adoption, retention, and task Success

AARRR: acquisition, activation, retention, referral, and revenue

Guardrail Metrics (counter metrics)

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 2

Assess the trustworthiness and internal validity of experiment results

Selecting Randomization Units (video)

User-level randomization is the most common in practice because:

It ensures a consistent user experience.

Different Choices of Randomization Units

Pros: Stable across time and platforms

Cookie-Based: A pseudonymous user ID, specific to a browser and a device.

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 3

Device-Based: Every device ID is a randomization unit.

Pros: Immutable Id associated with a specific device.

Cons: Identifiable. Only available for mobile devices

Randomization Unit vs. Unit of Analysis

It works if the randomization unit is coarser than the unit of analysis.

Choosing a Target Population

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 4

Computing Sample Size (video)

H0 : mean(Y t ) = mean(Y c ) (no treatment effect)

Significance level: α (common choice is .05):

Statistical power: 1 − P (β) (common choice is .8):

Minimally detectable effect (MDE) δ (a.k.a. practical significance):

The smallest difference that matters in practice

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 5

Determine Test Duration

Avoid a duration of less than a week.

A longer test gathers more data and is almost always better.

Ramping: trade-off among speed, quality, risk (SQR)

maximum power ramp (MPR, 5-50%): Measure treatment effect.

post-MPR (optional): Ensure infra can withstand the change.

Analyzing Results (video)

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 6

Cache hit ratio to be the same among Control and Treatment.

Test statistics follow the assumed distribution

Latency: wait times for pages to load.

Error Logs: number of error messages.

Client Crashes: crashes per user.

Revenue: revenue per user and total revenue.

How To Do Sanity Checks

Example: checking sample sizes between groups using Z-test (video)

What If Sanity Checks Fail?

We can also rerun the experiment.

How to debug SRM (video)

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 7

Decide one-tailed or two-tailed tests

Compute the mean

Compute either pooled or unpooled variance

Misconception: It's not the probability of Ha being true.

solution #1: cap values if data is highly skewed

solution #2: use bootstrapping to calculate statistics

Independence: Each observation of the dependent variable is independent of other

Otherwise, use non-parametric tests.

Statistical and Practical Significance

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 8

Source: Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing

3. Likely statistically/practically significant:

A/B Testing Cheat Sheet - emmading.com. All Rights Reserved. 9

Common Problems and Pitfalls (video)