0% found this document useful (0 votes)
29 views7 pages

AB Test Notes

AB testing notes to crack data scientist/analyst interview

Uploaded by

veronicazheng4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views7 pages

AB Test Notes

AB testing notes to crack data scientist/analyst interview

Uploaded by

veronicazheng4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

A/B Test - Definition

In an A/B test the experimenter sets up two experiences: “A,” the control, is usually the current
system and considered the “champion,” and “B,” the treatment, is a modification that attempts
to improve something—the “challenger.”

All elements are the same except for one variable

determine sample size


Type I error, also known as a “false positive”: the error of rejecting a null hypothesis when it is
actually true. A larger value means a less reliable test. It is used in A/B testing when differences
are observed but there is no difference.
Type II error, false negative, while a type II error is the mistaken acceptance of the null. A larger
value means a less reliable test. It is used in A/B testing when differences are not observed but
there is a difference.

Sensitivity (true positive rate): the ability of a test to correctly identify patients with a disease.
Specificity (true negative rate): the ability of a test to correctly identify people without the
disease.
Beta: false negative
statistical power (true positive): the power of a hypothesis test is the probability that the test
correctly rejects the null hypothesis

Multiple testing problem


Eg. The false positive rate is 5%. There are 3 groups, what is the chance of at least one false
positive?

P(no false positive) = (1-0.05)^3 = 0.95^3 = 0.857


P(at least 1 false positive) = 1 – P(no false positive) = 0.143
Type 1 error is over 14%

Need to decrease p-value to deal with the multiple testing problem

Bonferroni correction:
Significance level/number of tests
Significance level 10 test = 0.05/10 = 0.005
This method is too conservative

Novelty Effect & Primacy Effect


Novelty Effect: people welcome the changes and use more
Primacy Effect: people are reluctant to change

They will not last long. A/B test has a larger or smaller initial effect due to novelty or primacy.
If a test is very successful initially and after a week the treatment effect quickly declined, then it
is due to the novelty effect.

How to deal with novelty and primacy effect:


- Rule out the possibility by running tests only on the first-time users
o Not suitable for mature products or products with low traffic. Because there’s
few new users and we do not have enough randomization unit.
o E.g., Given new button/link, will advertisers appeal rejected ads more?
Could choose users that have never used the ‘appeal’ function.
- If the test is already running, compare first-time users to old users in the treatment
group
- Or run the test long enough.
- Take users that appear in the first day or two, and plot the treatment effect for them
over time

Interference between variants


Typical design
- Split users randomly
- Users are independent

Cases when assumption fails


- Social networks, e.g., FB, LinkedIn
- Two-sided market, e.g., Uber, Airbnb

Network effect
- User behaviors are impacted by others
- The effect can spillover the control group
- The difference underestimates the treatment effect
E.g., We are testing a new feature to increase posts created per user. Users are assigned
randomly. The test won by 1% in terms of the number of posts. What will happen when the
new feature is launched to all users? Will it be the same as 1%?
Ans: The difference will be more than 1%. Suppose people in treatment group post more
often. Their friends, who are in control group, may also want to post more after seeing
more posts. SO the detected effect between ctrl and exp is actually smaller than what it
should be.

(From trustworthy)
E.g., LinkedIn launch a better “People You May Know” RS for treatment group. If the
primary metric is #invitations sent, both group invitations are likely to increase. So the
treatment effect(delta) is biased downward.

Network mitigation
- Create network clusters
o People interact mostly within the cluster
o Assign clusters randomly to treatment and control group

Two-sided markets
- Resources are shared among control and treatment groups
E.g., Can coupon make people use Uber more?
Ans: The treatment group attract more drivers, fewer drivers are available for the control
Group. The difference becomes larger than it should be. Actual effect < treatment effect.

Two-sided markets mitigation


- Geo-based randomization
o Split by geolocations.
o E.g., New York vs. San Francisco
o Big variance since markets are unique
- Time-based randomization / Switchback testing
o Split by day of the week
o Assign all users to either treatment or control
o Only when the treatment effect is in a short time (Uber surge price)
o Caveat since pp behave differently during week
o Does not work when the treatment effect takes a long time (referral program,
coupon. May take a long time to register or use coupon)

A/B test steps


1. Prerequisites
a. Define key metrics (DAU, session time)
b. Change need to be easy to make (cannot redesign the whole website)
c. Have enough randomization units (users, thousands)
2. Experiment design
a. Hypothesis
b. What population to select
i. selected population (from user funnel) vs all users (fairly represented, reduce
contamination, reduce bias)
ii. may consider network effect and two-sided market
e.g., email campaign: change link position in email.
User funnel: receive email -- view email – click link in email – click official website
– add to cart.
Would like to choose people that check the email.
c. sample size (α , β , σ 2 , δ )
i. power of test 80% (1-beta, beta is the probability of false negative)
ii. significance level: 5%

( )
2 2 2
iii. rule of thumb 2 σ z α + z β /δ
2

iv. Assume we want to see $2/user boost in revenue


v. Clustering sampling to reduce network effect, need to make treatment and
control as independent as possible
d. duration of the experiment
i. novelty, primacy
ii. seasonality (if during holidays, need to run longer)
iii. day of week effect (people behave differently on different days of the week,
need to run for at least a week)
iv. want to start from a small proportion, then gradually increase the
percentage (5% first day, 10% second day, then 50% in each group moving
on). Reduce the risk of screwing up. Can adjust if anything goes wrong.
e. create variation
3. Running experiment
a. Collect data
4. Result and decision
a. Sanity check to make sure results are reliable.
i. Randomization
ii. User experiences are consistent in different groups
iii. Guardrail metrics (e.g., loading latency, bounce rate)
b. Consider trade-off of different metrics (less ad spend? Less time spend? Maybe
post friends and ads). Helpful to think about product life cycle. What are the
metrics we want to focus on at this time.
i. E.g., coupons improve user engagement, but may decrease total revenue.
c. Costs of launching a change
i. Implementation cost. Opportunity cost
ii. Cost of engineering maintenance (more complex infrastructure)
5. Post-launch monitoring

C.I. covers the null hypothesis, diff = 0, not


significant in terms of p-value, but may be significant in terms of practical results because point
estimate 2.45 >= 2

Concept Questions
1. What assumptions are made for t-test/hypothesis test?
The common assumptions made when doing a t-test include those regarding the scale of
measurement, random sampling, normality of data distribution, adequacy of sample size, and
equality of variance in standard deviation.

2. When is A/B testing a good idea? When is it a bad idea?


When you want to change web design, cost, service (easy to make), have meaningful traffic (big
sample size), have long enough time, have an informed hypothesis

3. What is a null hypothesis?

A null hypothesis is a type of hypothesis used in statistics that proposes that there is no
difference between certain characteristics of a population

4. What is statistical significance?

In statistical hypothesis testing, a result has statistical significance when it is very unlikely to
have occurred given the null hypothesis.
Statistical significance refers to the claim that a result from data generated by testing or
experimentation is not likely to occur randomly or by chance but is instead likely to be
attributable to a specific cause.

More on A/B test


Opportunity sizing
https://shopify.engineering/shopify-data-guide-opportunity-sizing

Confounders

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy