AB Testing
AB Testing
1 https://www.statology.org/random-selection-vs-random-assignment/
F 0.507556
M 0.492444
Name: gender, dtype: float64
sample_df = checkout.sample(n=3000)
sample_df['gender'].value_counts(normalize=True)
M 0.506333
F 0.493667
Name: gender, dtype: float64
checkout_page gender
A M 0.505000
F 0.495000
B F 0.507333
M 0.492667
C F 0.520333
M 0.479667
Name: gender, dtype: float64
1 https://jamanetwork.com/journals/jama/article-abstract/392650
1 Kohavi, Ron,Tang, Diane,Xu, Ya. Trustworthy Online Controlled Experiments. Cambridge University Press.
# Create pairplots
sns.pairplot(admissions[['Serial No.',\
'GRE Score', 'Chance of Admit']])
0.8026104595903503
Primary (goal/north-star):
Best describes the success of the
business or mission
Granular metrics:
Best explain users' behavior
Proportions:
Signup rate: signups/total visitors
Page abandonment rate: page abandoners/total visitors
Ratios:
Click-through-rate(CTR): clicks/page visits or clicks/ad impressions
Non-gameable
Bright colors
Time on page
gender
F 0.908056
M 0.780009
Name: purchased, dtype: float64
checkout[(checkout['browser']=='chrome')|(checkout['browser']=='safari')]\
.groupby('gender')['order_value'].mean()
gender
F 29.814161
M 30.383431
Name: order_value, dtype: float64
order_value purchased
browser
chrome 30.016625 0.839088
firefox 29.887491 0.851725
safari 30.119808 0.844337
As measured by metric(s) M
Example of the alternative hypothesis:
Based on user experience research, we believe that if we update our checkout page
design
Then the percentage of purchasing customers will increase
P-value
Probability of obtaining a result
assuming the Null hypothesis is true.
If p-value < α
Reject Null hypothesis
If p-value > α
Fail to reject Null hypothesis
3057.547
24.9564
std_A = checkout[checkout['checkout_page']=='A']['order_value'].std()
print(std_A)
2.418
85.306
Set the adjusted α* to the individual test α divided by the number of tests m
30.0096
25.3581
9000
9000
2938
2491.0
2541.0
checkout.groupby('checkout_page')['order_value'].agg({'mean','std','count'})
experiment
control 4071
exposed 4006
Power_divergenceResult(statistic=0.5230902562832735, pvalue=0.4695264353014863)
SRM likely not present
1Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for
Practitioners
print(simp_imbalanced.groupby('Variant').mean())
Variant Conversion
A 0.80
B 0.64
print(simp_imbalanced.groupby(['Variant','Device']).mean())
Variant Device
A Phone 40
Tablet 10
B Phone 10
Tablet 40
print(simp_balanced.groupby('Variant').mean())
Variant Conversion
A 0.70
B 0.52
Users avoiding trying a new feature due to familiarity with the old one.
Confidence intervals
95% CI is the range that captures the
true difference 95% of the time
p-value: 0.0058
Group A 95% CI : [0.8072, 0.8349]
Group B 95% CI : [0.8349, 0.8608]
0.847
(0.7912669777384846, 0.9087330222615153)
(0.8385342148455946, 0.9414657851544054)
(0.8265485838585659, 0.9334514161414341)
(0.7568067872454262, 0.8831932127545737)
(0.8506543911914558, 0.9493456088085442)*
(0.8385342148455946, 0.9414657851544054)
(0.7230037568938057, 0.8569962431061944)
(0.8146830076144598, 0.9253169923855402)
(0.8029257122801267, 0.9170742877198733)
(0.8146830076144598, 0.9253169923855402)
(0.8506543911914558, 0.9493456088085442)*
(0.7454722433688197, 0.8745277566311804)
...
checkout.groupby('checkout_page')['time_on_page'].mean()
checkout_page
A 44.668527
B 42.723772
C 42.223772
Calculate the metrics per variant
Analyze the difference using t-test
checkout_page
A 44.668527
B 42.723772
C 42.223772
ttest = pingouin.ttest(x=checkout[checkout['checkout_page']=='C']['time_on_page'],
y=checkout[checkout['checkout_page']=='B']['time_on_page'],
paired=False,
alternative="two-sided")
print(ttest)
2. Independence
Each observation/data point is independent.
Not accounting for dependencies inflates error rates.
3. Normality
Normally distributed data.
Unpaired data
mean count
checkout_page
A 44.668527 3000
B 42.723772 3000
C 42.223772 3000
Null: There is no significant difference in signup rates between landing page designs C and D
# Calculate p-value
print('p-value=',stats.chi2_contingency(table,correction=False)[1])
Unit of analysis:
The entity being analyzed in an A/B test
order_value purchased
mean sum count mean sum count
checkout_page
A 24.956437 61417.791564 2461 0.820333 2461.0 3000
B 29.876202 75915.430125 2541 0.847000 2541.0 3000
C 34.917589 90890.484142 2603 0.867667 2603.0 3000
checkout.groupby('checkout_page')['order_value'].sum()/
checkout.groupby('checkout_page')['purchased'].count()
checkout_page
A 20.472597
B 25.305143
C 30.296828
dtype: float64
1Budylin, Roman & Drutsa, Alexey & Katsev, Ilya & Tsoy, Valeriya. (2018). Consistent Transformation of Ratio
Metrics for Efficient Online Controlled Experiments. 55-63. 10.1145/3159652.3159699.
{'mean_control': 20.472597188012,
'mean_treatment': 25.30514337484097,
'difference': 4.833,
'diff_CI': '[4.257, 5.408]',
'p-value': 5.954978880467735e-61}
Avoid making decisions by peeking at the results before reaching the designed sample size,
as this inflates error rates similar to multiple comparisons.
Users may behave differently on weekends versus weekdays, so we should include overall
behavior.
Isolation
Change one variable at a time to attribute impact.