0% found this document useful (0 votes)
17 views101 pages

AB Testing

A/B testing is an experimental approach used to determine which version of a design performs better based on specific metrics through random assignment. Key steps include defining goals, sampling users, logging actions, and analyzing results for statistical significance. The document also discusses the importance of randomization, hypothesis formulation, metrics design, and the implications of multiple comparisons in A/B testing.

Uploaded by

Junaid Sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views101 pages

AB Testing

A/B testing is an experimental approach used to determine which version of a design performs better based on specific metrics through random assignment. Key steps include defining goals, sampling users, logging actions, and analyzing results for statistical significance. The document also discusses the importance of randomization, hypothesis formulation, metrics design, and the implications of multiple comparisons in A/B testing.

Uploaded by

Junaid Sheikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

What is A/B testing?

A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Intro to A/B testing
An A/B test is...

an experiment designed to test which version is better

based on metric(s): signup rate, average sales per user, etc.

using random assignment and analyzing results

A/B TESTING IN PYTHON


To A/B test or not to test?
Good use of A/B testing: Do not A/B test if:

Optimizing conversion rates No sufficient traffic/"small" sample size

Releasing new app features No clear logical hypothesis


Evaluating incremental effects of ads Ethical considerations

Assessing the impact of drug trials High opportunity cost

A/B TESTING IN PYTHON


A/B testing fundamental steps
1. Specify the goal and designs/experiences

2. Randomly sample users for enrollment


3. Randomly assign users to:
control variant: current state
treatment/test variant(s): new design

4. Log user actions and compute metrics


5. Test for statistically significant differences

A/B TESTING IN PYTHON


Value of randomization
Generalizability and representativeness

Minimizing bias between groups


Establishing causality by isolating treatment effect

1 https://www.statology.org/random-selection-vs-random-assignment/

A/B TESTING IN PYTHON


Python example of random assignment
checkout.info()

RangeIndex: 9000 entries, 0 to 8999


Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 user_id 9000 non-null int64
1 checkout_page 9000 non-null object
2 order_value 7605 non-null float64
3 purchased 9000 non-null float64
4 gender 9000 non-null object
5 browser 9000 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 422.0+ KB

A/B TESTING IN PYTHON


Python example of random assignment
checkout['gender'].value_counts(normalize=True)

F 0.507556
M 0.492444
Name: gender, dtype: float64

sample_df = checkout.sample(n=3000)
sample_df['gender'].value_counts(normalize=True)

M 0.506333
F 0.493667
Name: gender, dtype: float64

A/B TESTING IN PYTHON


Python example of random assignment
checkout.groupby('checkout_page')['gender'].value_counts(normalize=True)

checkout_page gender
A M 0.505000
F 0.495000
B F 0.507333
M 0.492667
C F 0.520333
M 0.479667
Name: gender, dtype: float64

A/B TESTING IN PYTHON


Why run
experiments?
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
The value of A/B testing
Reduce uncertainty around the impact of new designs and features

Decision-making --> scientific, evidence-based - not intuition


Generous value for the investment: simple changes lead to major wins

Continuous optimization at the mature stage of the business


Correlation does not imply causation

A/B TESTING IN PYTHON


Hierarchy of evidence

1 https://jamanetwork.com/journals/jama/article-abstract/392650

A/B TESTING IN PYTHON


Do error messages reduce churn?
Microsoft Office 365 spurious correlation example:1

Spurious correlation: a strong correlation that appears to be causal but is not.

1 Kohavi, Ron,Tang, Diane,Xu, Ya. Trustworthy Online Controlled Experiments. Cambridge University Press.

A/B TESTING IN PYTHON


Pearson's correlation coefficient
A score that measures the strength of a linear relationship between two variables.

r>0: positive correlation


r = 0: neutral correlation

r<0: negative correlation


Pearson's correlation coefficient (r) formula:

Assumes: Normal distribution and Linearity

A/B TESTING IN PYTHON


Correlations visual inspection
# Import visualization library seaborn
import seaborn as sns

# Create pairplots
sns.pairplot(admissions[['Serial No.',\
'GRE Score', 'Chance of Admit']])

A/B TESTING IN PYTHON


Pearson correlation heatmap
# Import visualization library seaborn
import seaborn as sns

# Print Pearson correlation coefficient


print(admissions['GRE Score']\
.corr(admissions['Chance of Admit']))

0.8026104595903503

# Plot correlations heatmap


sns.heatmap(admissions.corr(),annot=True)

A/B TESTING IN PYTHON


Metrics design and
estimation
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Types of metrics

Primary (goal/north-star):
Best describes the success of the
business or mission
Granular metrics:
Best explain users' behavior

More sensitive and actionable


Signup rate:
= (clicks/visitors) X (signups/clicks)
Instrumentation/guardrail metrics:
Outside the scope of this course

A/B TESTING IN PYTHON


Types of metrics
Quantitative categorization

Means/percentiles: average sales, median time on page

Proportions:
Signup rate: signups/total visitors
Page abandonment rate: page abandoners/total visitors
Ratios:
Click-through-rate(CTR): clicks/page visits or clicks/ad impressions

Revenue per session


Metrics can be combined to form a more comprehensive success/failure criteria

A/B TESTING IN PYTHON


Metrics requirements
Stable/robust against the unimportant differences

Sensitive to the important changes


Measurable within logging limitations

Non-gameable
Bright colors

Time on page

A/B TESTING IN PYTHON


Python metrics estimation
checkout.groupby('gender')['purchased'].mean()

gender
F 0.908056
M 0.780009
Name: purchased, dtype: float64

checkout[(checkout['browser']=='chrome')|(checkout['browser']=='safari')]\
.groupby('gender')['order_value'].mean()

gender
F 29.814161
M 30.383431
Name: order_value, dtype: float64

A/B TESTING IN PYTHON


Python metrics estimation
checkout.groupby('browser')[['order_value', 'purchased']].mean()

order_value purchased
browser
chrome 30.016625 0.839088
firefox 29.887491 0.851725
safari 30.119808 0.844337

A/B TESTING IN PYTHON


Hypothesis
formulation and
distributions
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Defining hypotheses
A hypothesis is:
a statement explaining an event
a starting point for further investigation

an idea we want to test


A strong hypothesis:
is testable, declarative, concise, and logical

enables systematic iteration


is easier to generalize and confirm understanding
results in actionable/focused recommendations

A/B TESTING IN PYTHON


Hypothesis format
General framing format:
Based on X, we believe that if we do Y
Then Z will happen

As measured by metric(s) M
Example of the alternative hypothesis:
Based on user experience research, we believe that if we update our checkout page
design
Then the percentage of purchasing customers will increase

As measured by purchase rate


Null hypothesis: ...the percentage of purchasing customers will not change...

A/B TESTING IN PYTHON


Calculating sample statistics
# Calculate the number of users in groups A and B
n_A = checkout[checkout['checkout_page'] == 'A']['purchased'].count()
n_B = checkout[checkout['checkout_page'] == 'B']['purchased'].count()
print('Group A users:',n_A)
print('Group B users:',n_B)

Group A users: 3000


Group B users: 3000

# Calculate the mean purchase rates of groups A and B


p_A = checkout[checkout['checkout_page'] == 'A']['purchased'].mean()
p_B = checkout[checkout['checkout_page'] == 'B']['purchased'].mean()
print('Group A mean purchase rate:',p_A)
print('Group B mean purchase rate:',p_B)

Group A mean purchase rate: 0.820


Group B mean purchase rate: 0.847

A/B TESTING IN PYTHON


Simulating and plotting distributions
The number of purchasers in n trials with
purchasing probability p is Binomially
distributed.

# Import binom from scipy library


from scipy.stats import binom
# Create x-axis range and Binomial distributions A and B
x = np.arange(n_A*p_A - 100, n_B*p_B + 100)
binom_a = binom.pmf(x, n_A, p_A)
binom_b = binom.pmf(x, n_B, p_B)
# Plot Binomial distributions A and B
plt.bar(x, binom_a, alpha=0.4, label='Checkout A')
plt.bar(x, binom_b, alpha=0.4, label='Checkout B')
plt.xlabel('Purchased')
plt.ylabel('PMF')
plt.title('PMF of Checkouts Binomial distribution')
plt.show()

A/B TESTING IN PYTHON


Central limit theorem
For a sufficiently large sample size, the distribution of the sample means, p, will be

normally distributed around the true population mean

with a standard deviation = standard error of the mean


irrespective of the distribution of the underlying data

A/B TESTING IN PYTHON


Central limit theorem in python
# Set random seed for repeatability
np.random.seed(47)
# Create an empty list to hold means
sampled_means = []
# Create loop to simulate 1000 sample means
for i in range(1000):
# Take a sample of n=100
sample = checkout['purchased'].sample(100,replace=True)
# Get the sample mean and append to list
sample_mean = np.mean(sample)
sampled_means.append(sample_mean)
# Plot distribution
sns.displot(sampled_means, kde=True)
plt.show()

A/B TESTING IN PYTHON


Hypothesis mathematical representation
# Import norm from scipy library
from scipy.stats import norm
# Create x-axis range and normal distributions A and B
x = np.linspace(0.775, 0.9, 500)
norm_a = norm.pdf(x, p_A, np.sqrt(p_A*(1-p_A) / n_A))
norm_b = norm.pdf(x, p_B, np.sqrt(p_B*(1-p_B) / n_B))
# Plot normal distributions A and B
sns.lineplot(x, norm_a, ax=ax, label= 'Checkout A')
sns.lineplot(x, norm_b, color='orange', \
ax=ax, label= 'Checkout B')
ax.axvline(p_A, linestyle='--')
ax.axvline(p_B, linestyle='--')
plt.xlabel('Purchased Proportion')
plt.ylabel('PDF')
plt.legend(loc="upper left")
plt.show()

A/B TESTING IN PYTHON


Experimental design:
setting up testing
parameters
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Distribution parameters
d follows a normal distribution Null vs alternative hypothesis distributions

If observed difference 'd' is unlikely:


reject the Null hypothesis

A/B TESTING IN PYTHON


Design parameters and error types
Power (1- β )
β = Type II error = False negative
Commonly set at 80%

Minimum Detectable Effect (MDE)


Smallest difference we care to capture

A/B TESTING IN PYTHON


Design parameters and error types
Significance level α
α = Type I error = False positive
Commonly set at 5%

P-value
Probability of obtaining a result
assuming the Null hypothesis is true.
If p-value < α
Reject Null hypothesis
If p-value > α
Fail to reject Null hypothesis

A/B TESTING IN PYTHON


Experiment parameters analogy
Analogy for explaining statistical power and parameters:

1. Time at store = sample size/experiment duration

2. Bag of chips size = effect size/MDE


3. Store cleanliness/organization = data variance

A/B TESTING IN PYTHON


Experimental design:
power analysis
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Effect size
Cohen's d for differences in means # Calculate standardized effect size
from statsmodels.stats.proportion import proportion_effectsize
effect_size_std = proportion_effectsize(.33, .3)
print(effect_size_std)

Cohen's h for differences in proportions 0.0645

# Calculate standardized effect size


from statsmodels.stats.proportion import proportion_effectsize
effect_size_std = proportion_effectsize(p_B, p_A)
Rule of thumb print(effect_size_std)

Small effect = 0.2


0.0716

Medium effect = 0.5

Large effect = 0.8

A/B TESTING IN PYTHON


Sample size estimation for proportions
# Import power module
from statsmodels.stats import power
# Calculate sample size
sample_size = power.TTestIndPower().solve_power(effect_size=effect_size_std,
power=.80,
alpha=.05,
nobs1=None)
print(sample_size)

3057.547

A/B TESTING IN PYTHON


Effect of sample size and MDE on power
# Import t-test power package
from statsmodels.stats.power import TTestIndPower
# Specify parameters for power analysis
sample_sizes = array(range(10, 120))
effect_sizes = array([0.2, 0.5, 0.8])
# Plot power curves
TTestIndPower().plot_power(nobs=sample_sizes, effect_size=effect_sizes)
plt.show()

A/B TESTING IN PYTHON


Sample size estimation for means
# Calculate the baseline mean order value
mean_A = checkout[checkout['checkout_page']=='A']['order_value'].mean()
print(mean_A)

24.9564

std_A = checkout[checkout['checkout_page']=='A']['order_value'].std()
print(std_A)

2.418

# Specify the desired minimum average order value


mean_new = 26

# Calculate the standardized effect size


std_effect_size=(mean_new-mean_A)/std_A

A/B TESTING IN PYTHON


Sample size estimation for means
sample_size = power.TTestIndPower().solve_power(effect_size=std_effect_size,
power=.80,
alpha=.05,
nobs1=None)
print(sample_size)

85.306

A/B TESTING IN PYTHON


Multiple
comparisons tests
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Introduction to the multiple comparisons problem
Single comparison: Multiple comparisons:
Control (A) versus Treatment (B) Multiple variants (A/B/n tests)
One metric Multiple metrics

No subcategories Granular categories

A/B TESTING IN PYTHON


Family-wise error rate
P(making Type I error) = α = 0.05

P(not making Type I error) = 1 - α


P(not making Type I error in m tests) = (1 - α)m

P(making at least one Type I error in m tests) = 1 - (1 - α)m = FWER


Family-wise Error Rate (FWER): the probability of making one or more type I errors when
performing multiple hypothesis tests.

For a single test, FWER = 1 - (1 - α)^1 = α = 0.05


But what if we perform more than one test?

A/B TESTING IN PYTHON


Family-wise error rate
import matplotlib.pyplot as plt
FWER = 1 - (1 - α)^10

import numpy as np FWER for 10 tests = 40%


alpha = 0.05
x = np.linspace(0, 20, 21)
y = 1-(1-alpha)**x
plt.plot(x,y, marker='o')
plt.title('FWER vs Number of Tests')
plt.xlabel('Number of Tests')
plt.ylabel('FWER')
plt.show()

A/B TESTING IN PYTHON


Correction methods
The simplest and most popular approach is the Bonferroni Correction

Set the adjusted α* to the individual test α divided by the number of tests m

Less stringent Sidak correction


Set FWER to desired α, then solve for αs

A/B TESTING IN PYTHON


Bonferroni correction example
Without correction, all three tests are considered significant
but the probability of making a type I error is inflated at 14%
With a Bonferroni Correction, A versus D is no longer significant, but FWER is controlled at
0.049

A/B TESTING IN PYTHON


statsmodels multipletests method
import statsmodels.stats.multitest as smt
pvals = [0.023,0.0005,0.00004]

corrected = smt.multipletests(pvals, alpha=0.05, method='bonferroni')

print("Significant Test:", corrected[0])


print("Corrected P-values:", corrected[1])
print("Bonferroni Corrected alpha: {:.4f}".format(corrected[3]))

Significant Test: [False True True]


Corrected P-values: [0.069 0.0015 0.00012]
Bonferroni Corrected alpha: 0.0167

A/B TESTING IN PYTHON


Data cleaning and
exploratory analysis
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Cleaning missing values
Missing values
Drop, ignore, impute

# Calculate the mean order value


checkout.order_value.mean()

30.0096

# Replace missing values with zeros and get mean


checkout['order_value'].fillna(0).mean()

25.3581

A/B TESTING IN PYTHON


Cleaning duplicates
Duplicates
Identical rows should be dropped
# Check for duplicate rows due to logging issues
print(len(checkout))
print(len(checkout.drop_duplicates(keep='first')))

9000
9000

A/B TESTING IN PYTHON


Cleaning duplicates
Duplicates
Duplicate users should be handled with care.
# Unique users in group B
print(checkout[checkout['checkout_page'] == 'B']['user_id'].nunique())
# Unique users who purchased at least once
print(checkout[checkout['checkout_page'] == 'B'].groupby('user_id')['purchased'].max().sum())
# Total purchase events in group B
print(checkout[checkout['checkout_page'] == 'B']['purchased'].sum())

2938
2491.0
2541.0

A/B TESTING IN PYTHON


EDA summary stats
Mean, count, and standard deviation summary

checkout.groupby('checkout_page')['order_value'].agg({'mean','std','count'})

mean count std


checkout_page
A 24.956437 2461 2.418837
B 29.876202 2541 7.277644
C 34.917589 2603 4.869816

A/B TESTING IN PYTHON


EDA plotting
Bar plots

sns.barplot(x=checkout['checkout_page'], y=checkout['order_value'], estimator=np.mean)


plt.title('Average Order Value per Checkout Page Variant')
plt.xlabel('Checkout Page Variant')
plt.ylabel('Order Value [$]')

A/B TESTING IN PYTHON


EDA plotting
Histograms

sns.displot(data=checkout, x='order_value', hue = 'checkout_page', kde=True)

A/B TESTING IN PYTHON


EDA plotting
Time series (line plots)

sns.lineplot(data=AdSmart,x='date', y='yes', hue='experiment', ci=False)

1 Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

A/B TESTING IN PYTHON


Sanity checks:
Internal validity
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Sample Ratio Mismatch (SRM)
Sample Ration Mismatch (SRM)
Allocation across variants deviates from
design

Chi-square goodness of fit test

A/B TESTING IN PYTHON


SRM python example
# Calculate the unique IDs per variant
AdSmart.groupby('experiment')['auction_id'].nunique()

experiment
control 4071
exposed 4006

# Assign the unqiue counts to each variant


control_users=AdSmart[AdSmart['experiment']=='control']['auction_id'].nunique()
exposed_users=AdSmart[AdSmart['experiment']=='exposed']['auction_id'].nunique()
total_users=control_users+exposed_users
# Calculate allocation ratios per variant
control_perc = control_users / total_users
exposed_perc = exposed_users / total_users
print("Percentage of users in the Control group:",100*round(control_perc,5),"%")
print("Percentage of users in the Exposed group:",100*round(exposed_perc,5),"%")

Percentage of users in the Control group: 50.402 %


Percentage of users in the Exposed group: 49.598 %

1 Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

A/B TESTING IN PYTHON


SRM python example
# Creat lists of observed and expected counts per variant
observed = [ control_users, exposed_users ]
expected = [ total_users/2, total_users/2 ]
# Import chisquare from scipy library
from scipy.stats import chisquare
# Run chisquare test on observed and expected lists
chi = chisquare(observed, f_exp=expected)
# Print test results and interpretation
print(chi)
if chi[1] < 0.01:
print("SRM may be present")
else:
print("SRM likely not present")

Power_divergenceResult(statistic=0.5230902562832735, pvalue=0.4695264353014863)
SRM likely not present

1 Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

A/B TESTING IN PYTHON


SRM root-causing
Common causes of SRM:1

Assignment: incorrect bucketing or faulty randomization functions

Execution: delayed variants starting time or ramp up rates


Data logging: logging delays or bot filtering

Interference: experimenter pausing a variant

1Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for
Practitioners

A/B TESTING IN PYTHON


A/A tests
A/A test
Presents an identical experience to two groups of users
Reveals bugs in experimental setup

No statistically significance differences between the metrics


False positives can still happen at the specified α (5% of the time)

Reveals imbalances in distributions across groups (e.g. browsers, devices, etc.)

A/B TESTING IN PYTHON


Distributions balance Python example
Balanced browsers distribution Imbalanced browsers distribution

Valid test Invalid test


checkout.groupby('checkout_page')['browser'].value_counts(normalize=True) AdSmart.groupby('experiment')['browser'].value_counts(normalize=True)

checkout_page browser experiment browser


A chrome 0.341333 control Chrome Mobile 0.591992
safari 0.332000 Facebook 0.137804
firefox 0.326667 Samsung Internet 0.120855
B safari 0.352000 Chrome Mobile WebView 0.071727
firefox 0.325000 Mobile Safari 0.060427
chrome 0.323000 Chrome Mobile iOS 0.008352
C safari 0.346000 Mobile Safari UI/WKWebView 0.007369
chrome 0.330000 exposed Chrome Mobile 0.535197
firefox 0.324000 Chrome Mobile WebView 0.298802
Samsung Internet 0.082876
Facebook 0.050674
Mobile Safari 0.022716
Chrome Mobile iOS 0.004244

1 Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

A/B TESTING IN PYTHON


Sanity checks:
external validity
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Simpson's paradox
Simpson's Paradox: a statistical phenomenon where certain trends between variables emerge,
disappear or reverse when the population is divided into segments.

print(simp_imbalanced.groupby('Variant').mean())

Variant Conversion
A 0.80
B 0.64

print(simp_imbalanced.groupby(['Variant','Device']).mean())

Variant Device Conversion


A Phone 0.875
Tablet 0.500
B Phone 0.900
Tablet 0.575

A/B TESTING IN PYTHON


Simpson's paradox
simp_imbalanced.groupby(['Variant','Device'])\
['Device'].count()

Variant Device
A Phone 40
Tablet 10
B Phone 10
Tablet 40

A/B TESTING IN PYTHON


Simpson's paradox
simp_balanced.groupby(['Variant','Device'])\ print(simp_balanced.groupby(['Variant','Device']).mean())
['Device'].count()

Variant Device Conversion


Variant Device A Phone 0.750
A Phone 40 Tablet 0.500
Tablet 10 B Phone 0.575
B Phone 40 Tablet 0.300
Tablet 10

print(simp_balanced.groupby('Variant').mean())

Variant Conversion
A 0.70
B 0.52

A/B TESTING IN PYTHON


Novelty effect
Novelty effect
A short-lived improvement in metrics caused by users' curiosity about a new feature.
Change aversion
The opposite of novelty effect.

Users avoiding trying a new feature due to familiarity with the old one.

A/B TESTING IN PYTHON


Novelty effect visual inspection
# Plot Lift in CTR vs test days
novelty.plot('date', 'CTR_lift')
plt.ylim([0, 0.09])
plt.title('Lift in CTR vs Test Duration')
plt.show()

A/B TESTING IN PYTHON


Correcting for novelty effects
Increasing the test duration
Start including data after treatment effect stabilizes.
Examine new and returning user cohorts
New users are by default less likely to experience novelty effects.

Old users compare consider their old experiences.

A/B TESTING IN PYTHON


Analyzing difference
in proportions A/B
tests
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Framework for difference in proportions
If p-value < α
Reject Null hypothesis
If p-value > α
Fail to reject Null hypothesis

Confidence intervals
95% CI is the range that captures the
true difference 95% of the time

Like fishing with a net instead of a spear


Centered around the observed difference
between the treatment and the control

A/B TESTING IN PYTHON


Two sample proportions z-test
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
# Calculate the number of users in groups A and B
n_A = checkout[checkout['checkout_page'] == 'A']['user_id'].nunique()
n_B = checkout[checkout['checkout_page'] == 'B']['user_id'].nunique()
print('Group A users:',n_A)
print('Group B users:',n_B)

Group A users: 2940


Group B users: 2938

# Compute unique purchasers in each group


puchased_A = checkout[checkout['checkout_page'] == 'A'].groupby('user_id')['purchased'].max().sum()
purchased_B = checkout[checkout['checkout_page'] == 'B'].groupby('user_id')['purchased'].max().sum()
# Assign groups lists
purchasers_abtest = [puchased_A, purchased_B]
n_abtest = [n_A, n_B]

A/B TESTING IN PYTHON


Two sample proportions z-test
# Calculate p-value and confidence intervals
z_stat, pvalue = proportions_ztest(purchasers_abtest, nobs=n_abtest)
(A_lo95, B_lo95), (A_up95, B_up95) = proportion_confint(purchasers_abtest, nobs=n_abtest, alpha=0.05)
# Print the p-value and confidence intervals
print(f'p-value: {pvalue:.4f}')
print(f'Group A 95% CI : [{A_lo95:.4f}, {A_up95:.4f}]')
print(f'Group B 95% CI : [{B_lo95:.4f}, {B_up95:.4f}]')

p-value: 0.0058
Group A 95% CI : [0.8072, 0.8349]
Group B 95% CI : [0.8349, 0.8608]

A/B TESTING IN PYTHON


Confidence intervals for proportions
# Set random seed for repeatability
np.random.seed(34)
# Calculate the average purchase rate for group A
pop_mean = checkout[checkout['checkout_page'] == 'B']['purchased'].mean()
print(pop_mean)

0.847

A/B TESTING IN PYTHON


Confidence intervals for proportions
# Calculate 20 90% confidence intervals for 20 random samples of size 100 each
for i in range(20):
confidence_interval = proportion_confint(
count = checkout[checkout['checkout_page'] == 'B'].sample(100)['purchased'].sum(),
nobs = 100,
alpha = (1 - 0.90))
print(confidence_interval)

(0.7912669777384846, 0.9087330222615153)
(0.8385342148455946, 0.9414657851544054)
(0.8265485838585659, 0.9334514161414341)
(0.7568067872454262, 0.8831932127545737)
(0.8506543911914558, 0.9493456088085442)*
(0.8385342148455946, 0.9414657851544054)
(0.7230037568938057, 0.8569962431061944)
(0.8146830076144598, 0.9253169923855402)
(0.8029257122801267, 0.9170742877198733)
(0.8146830076144598, 0.9253169923855402)
(0.8506543911914558, 0.9493456088085442)*
(0.7454722433688197, 0.8745277566311804)
...

A/B TESTING IN PYTHON


Analyzing difference
in means A/B tests
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Framework for difference in means
Calculate required sample size If p-value < α

Run experiment and perform sanity checks Reject Null hypothesis


If p-value > α
Fail to reject Null hypothesis

checkout.groupby('checkout_page')['time_on_page'].mean()

checkout_page
A 44.668527
B 42.723772
C 42.223772
Calculate the metrics per variant
Analyze the difference using t-test

A/B TESTING IN PYTHON


Pingouin t-test
checkout.groupby('checkout_page')['time_on_page'].mean()

checkout_page
A 44.668527
B 42.723772
C 42.223772

ttest = pingouin.ttest(x=checkout[checkout['checkout_page']=='C']['time_on_page'],
y=checkout[checkout['checkout_page']=='B']['time_on_page'],
paired=False,
alternative="two-sided")
print(ttest)

T dof alternative p-val CI95% cohen-d BF10 power


T-test -1.995423 5998 two-sided 0.046042 [-0.99, -0.01] 0.051522 0.212 0.514054

A/B TESTING IN PYTHON


Pingouin pairwise
pairwise = pingouin.pairwise_tests(data = checkout,
dv = "time_on_page",
between = "checkout_page",
padjust = "bonf")
print(pairwise)

Contrast A B Paired Parametric T dof alternative \


0 checkout_page A B False True 7.026673 5998.0 two-sided
1 checkout_page A C False True 8.833244 5998.0 two-sided
2 checkout_page B C False True 1.995423 5998.0 two-sided

p-unc p-corr p-adjust BF10 hedges


0 2.349604e-12 7.048812e-12 bonf 1.305e+09 0.181405
1 1.316118e-18 3.948354e-18 bonf 1.811e+15 0.228045
2 4.604195e-02 1.381258e-01 bonf 0.212 0.051515

A/B TESTING IN PYTHON


Non-parametric
statistical tests
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Parametric tests assumptions
1. Random sampling
Data is randomly sampled from the population.
Investigate the data collection/sampling process.

2. Independence
Each observation/data point is independent.
Not accounting for dependencies inflates error rates.

3. Normality
Normally distributed data.

Large "enough" sample size.


Two sample t-test n >= 30 in each group.
Two sample proportions test: >=10 successes and >=10 failures in each group.

A/B TESTING IN PYTHON


Mann-Whitney U test
Non-parametric test for statistical significance

Determines if two independent samples have the same parent distribution


Rank sum test

Unpaired data

A/B TESTING IN PYTHON


Mann-Whitney U test in python
# Calculate the mean and count of time on page by variant
print(checkout.groupby('checkout_page')['time_on_page'].agg({'mean', 'count'}))

mean count
checkout_page
A 44.668527 3000
B 42.723772 3000
C 42.223772 3000

# Set random seed for repeatability


np.random.seed(40)
# Take a random sample of size 25 from each variant
ToP_samp_A = checkout[checkout['checkout_page'] == 'A'].sample(25)['time_on_page']
ToP_samp_B = checkout[checkout['checkout_page'] == 'B'].sample(25)['time_on_page']

A/B TESTING IN PYTHON


Mann-Whitney U test in python
# Run a Mann-Whitney U test
mwu_test = pingouin.mwu(x=ToP_samp_A,
y=ToP_samp_B,
alternative='two-sided')
# Print the test results
print(mwu_test)

U-val alternative p-val RBC CLES


MWU 441.0 two-sided 0.013007 -0.4112 0.7056

A/B TESTING IN PYTHON


Chi-square test of independence
Free from parametric test assumptions

Tests whether two or more categorical variables are independent


Null hypothesis: The variables are independent.

Alternative hypothesis: The variables are not independent.

A/B TESTING IN PYTHON


Chi-square test in python
Homepage signup rates A/B test

Null: There is no significant difference in signup rates between landing page designs C and D

Alternative: There is no significant difference in signup rates between them

# Calculate the number of users in groups C and D


n_C = homepage[homepage['landing_page'] == 'C']['user_id'].nunique()
n_D = homepage[homepage['landing_page'] == 'D']['user_id'].nunique()

# Compute unique signups in each group


signup_C = homepage[homepage['landing_page'] == 'C'].groupby('user_id')['signup'].max().sum()
no_signup_C = n_C - signup_C
signup_D = homepage[homepage['landing_page'] == 'D'].groupby('user_id')['signup'].max().sum()
no_signup_D = n_D - signup_D

A/B TESTING IN PYTHON


Chi-square test in python
# Create the signups table
table = [[signup_C, no_signup_C], [signup_D, no_signup_D]]
print('Group C signup rate:',round(signup_C/n_C,3))
print('Group D signup rate:',round(signup_D/n_D,3))

# Calculate p-value
print('p-value=',stats.chi2_contingency(table,correction=False)[1])

Group C signup rate: 0.064


Group D signup rate: 0.048
p-value= 0.009165

A/B TESTING IN PYTHON


Ratio metrics and
the delta method
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Ratio metrics A/B testing
Mean metrics

Unit of analysis:
The entity being analyzed in an A/B test

Denominator in ratio metrics


Randomization unit:
The subject randomly allocated to each variant

A/B TESTING IN PYTHON


Ratio metrics A/B testing
Per-user Ratio metrics

A/B TESTING IN PYTHON


Delta method motivation
print(checkout.groupby('checkout_page')[['order_value','purchased']].agg({'sum','count','mean'}))

order_value purchased
mean sum count mean sum count
checkout_page
A 24.956437 61417.791564 2461 0.820333 2461.0 3000
B 29.876202 75915.430125 2541 0.847000 2541.0 3000
C 34.917589 90890.484142 2603 0.867667 2603.0 3000

checkout.groupby('checkout_page')['order_value'].sum()/
checkout.groupby('checkout_page')['purchased'].count()

checkout_page
A 20.472597
B 25.305143
C 30.296828
dtype: float64

A/B TESTING IN PYTHON


Delta method variance
Delta method ratio metrics variance estimation:{

# Delta method variance of ratio metric


def var_delta(x,y):
x_bar = np.mean(x)
y_bar = np.mean(y)
x_var = np.var(x,ddof=1)
y_var = np.var(y,ddof=1)
cov_xy = np.cov(x,y,ddof=1)[0][1]
# Note that we divide by len(x) here because the denominator of the test statistic is standard error (=sqrt(var/n))
var_ratio = (x_var/y_bar**2 + y_var*(x_bar**2/y_bar**4) - 2*cov_xy*(x_bar/y_bar**3))/len(x)
return var_ratio

1Budylin, Roman & Drutsa, Alexey & Katsev, Ilya & Tsoy, Valeriya. (2018). Consistent Transformation of Ratio
Metrics for Efficient Online Controlled Experiments. 55-63. 10.1145/3159652.3159699.

A/B TESTING IN PYTHON


Delta method z-test
# Delta method ztest calculation
ztest_delta(x_control,y_control,x_treatment,y_treatment, alpha = 0.05)
Output:

Input arguments: mean_control : control group ratio metric


mean
x_control : control variant user-level ratio
mean_treatment : treatment group ratio
numerator column
metric mean
y_control : control variant user-level ratio
difference : difference between treatment
denominator column
and control means
x_treatment : treatment variant user-level
diff_CI : confidence interval of the
ratio numerator column
difference in means
y_treatment : treatment variant user-level
p-value : the two-tailed z-test p-value
ratio denominator column
1 https://medium.com/@ahmadnuraziz3/applying-delta-method-for-a-b-tests-analysis-8b1d13411c22

A/B TESTING IN PYTHON


Python example
# Create DataFrames for per user metrics for variants A and B
A_per_user = pd.DataFrame({'order_value':checkout[checkout['checkout_page']=='A'].groupby('user_id')['order_value'].sum()
,'page_view':checkout[checkout['checkout_page']=='A'].groupby('user_id')['user_id'].count()})
B_per_user = pd.DataFrame({'order_value':checkout[checkout['checkout_page']=='B'].groupby('user_id')['order_value'].sum()
,'page_view':checkout[checkout['checkout_page']=='B'].groupby('user_id')['user_id'].count()})

# Assign the control and treatment ratio columns


x_control = A_per_user['order_value']
y_control = A_per_user['page_view']
x_treatment = B_per_user['order_value']
y_treatment = B_per_user['page_view']

# Run a z-test for ratio metrics


ztest_delta(x_control,y_control,x_treatment,y_treatment)

{'mean_control': 20.472597188012,
'mean_treatment': 25.30514337484097,
'difference': 4.833,
'diff_CI': '[4.257, 5.408]',
'p-value': 5.954978880467735e-61}

A/B TESTING IN PYTHON


A/B Testing best
practices and
advanced topics
intro
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
Best practices
Avoid peeking

Avoid making decisions by peeking at the results before reaching the designed sample size,
as this inflates error rates similar to multiple comparisons.

Account for day-of-the-week effects

Users may behave differently on weekends versus weekdays, so we should include overall
behavior.

A/B TESTING IN PYTHON


Best practices
Simplicity/feasibility:
Do we need to build the full feature?
Painted door tests

Isolation
Change one variable at a time to attribute impact.

A/B TESTING IN PYTHON


Advanced topics
Multifactorial design and interaction effects
Measures the isolated effect of each variable
Uncovers interaction/synergistic effects

Bayesian A/B testing


Incorporates prior data into current experiment
Views population parameters as distributions

More intuitive understanding of test results


SUTVA violation and network effects
One user's assignment in a test impacts others
Common in social networks A/B tests
One solution: clusters assignment

A/B TESTING IN PYTHON


Wrap-up: A/B
testing in python
A/B TESTING IN PYTHON

Moe Lotfy, PhD


Principal Data Science Manager
A/B testing summary
Chapter 1 Chapter 2

A/B testing steps and use-cases Formulating A/B testing hypotheses

Metrics definition and estimation Error rates, power, effect size


.sample() , .corr() , pairplot , heatmap Power analysis: sample size estimation

Multiple comparisons corrections


Chapter 3 Chapter 4

Data cleaning and EDA Analyzing differences in means

Sanity checks for validation Non-parametric tests


Analyzing difference in proportions Delta method for ratio metrics
proportions_ztest , proportion_confint Best practices and advanced topics

A/B TESTING IN PYTHON


Congratulations!
A/B TESTING IN PYTHON

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy