0% found this document useful (0 votes)

17 views101 pages

AB Testing

A/B testing is an experimental approach used to determine which version of a design performs better based on specific metrics through random assignment. Key steps include defining goals, sampling users, logging actions, and analyzing results for statistical significance. The document also discusses the importance of randomization, hypothesis formulation, metrics design, and the implications of multiple comparisons in A/B testing.

Uploaded by

Junaid Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

17 views101 pages

AB Testing

Uploaded by

Junaid Sheikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

What is A/B testing?

A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Intro to A/B testing
An A/B test is...

an experiment designed to test which version is better

based on metric(s): signup rate, average sales per user, etc.

using random assignment and analyzing results

A/B TESTING IN PYTHON

To A/B test or not to test?
Good use of A/B testing: Do not A/B test if:

Optimizing conversion rates No sufficient traffic/"small" sample size

Releasing new app features No clear logical hypothesis

Evaluating incremental effects of ads Ethical considerations

Assessing the impact of drug trials High opportunity cost

A/B TESTING IN PYTHON

A/B testing fundamental steps
1. Specify the goal and designs/experiences

2. Randomly sample users for enrollment

3. Randomly assign users to:
control variant: current state
treatment/test variant(s): new design

4. Log user actions and compute metrics

5. Test for statistically significant differences

A/B TESTING IN PYTHON

Value of randomization
Generalizability and representativeness

Minimizing bias between groups

Establishing causality by isolating treatment effect

1 https://www.statology.org/random-selection-vs-random-assignment/

A/B TESTING IN PYTHON

Python example of random assignment
checkout.info()

RangeIndex: 9000 entries, 0 to 8999

Data columns (total 6 columns):
# Column Non-Null Count Dtype
0 user_id 9000 non-null int64
1 checkout_page 9000 non-null object
2 order_value 7605 non-null float64
3 purchased 9000 non-null float64
4 gender 9000 non-null object
5 browser 9000 non-null object
dtypes: float64(2), int64(1), object(3)
memory usage: 422.0+ KB

A/B TESTING IN PYTHON

Python example of random assignment
checkout['gender'].value_counts(normalize=True)

F 0.507556
M 0.492444
Name: gender, dtype: float64

sample_df = checkout.sample(n=3000)
sample_df['gender'].value_counts(normalize=True)

M 0.506333
F 0.493667
Name: gender, dtype: float64

A/B TESTING IN PYTHON

Python example of random assignment
checkout.groupby('checkout_page')['gender'].value_counts(normalize=True)

checkout_page gender
A M 0.505000
F 0.495000
B F 0.507333
M 0.492667
C F 0.520333
M 0.479667
Name: gender, dtype: float64

A/B TESTING IN PYTHON

Why run
experiments?
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
The value of A/B testing
Reduce uncertainty around the impact of new designs and features

Decision-making --> scientific, evidence-based - not intuition

Generous value for the investment: simple changes lead to major wins

Continuous optimization at the mature stage of the business

Correlation does not imply causation

A/B TESTING IN PYTHON

Hierarchy of evidence

1 https://jamanetwork.com/journals/jama/article-abstract/392650

A/B TESTING IN PYTHON

Do error messages reduce churn?
Microsoft Office 365 spurious correlation example:1

Spurious correlation: a strong correlation that appears to be causal but is not.

1 Kohavi, Ron,Tang, Diane,Xu, Ya. Trustworthy Online Controlled Experiments. Cambridge University Press.

A/B TESTING IN PYTHON

Pearson's correlation coefficient
A score that measures the strength of a linear relationship between two variables.

r>0: positive correlation

r = 0: neutral correlation

r<0: negative correlation

Pearson's correlation coefficient (r) formula:

Assumes: Normal distribution and Linearity

A/B TESTING IN PYTHON

Correlations visual inspection
# Import visualization library seaborn
import seaborn as sns

# Create pairplots
sns.pairplot(admissions[['Serial No.',\
'GRE Score', 'Chance of Admit']])

A/B TESTING IN PYTHON

Pearson correlation heatmap
# Import visualization library seaborn
import seaborn as sns

# Print Pearson correlation coefficient

print(admissions['GRE Score']\
.corr(admissions['Chance of Admit']))

0.8026104595903503

# Plot correlations heatmap

sns.heatmap(admissions.corr(),annot=True)

A/B TESTING IN PYTHON

Metrics design and
estimation
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Types of metrics

Primary (goal/north-star):
Best describes the success of the
business or mission
Granular metrics:
Best explain users' behavior

More sensitive and actionable

Signup rate:
= (clicks/visitors) X (signups/clicks)
Instrumentation/guardrail metrics:
Outside the scope of this course

A/B TESTING IN PYTHON

Types of metrics
Quantitative categorization

Means/percentiles: average sales, median time on page

Proportions:
Signup rate: signups/total visitors
Page abandonment rate: page abandoners/total visitors
Ratios:
Click-through-rate(CTR): clicks/page visits or clicks/ad impressions

Revenue per session

Metrics can be combined to form a more comprehensive success/failure criteria

A/B TESTING IN PYTHON

Metrics requirements
Stable/robust against the unimportant differences

Sensitive to the important changes

Measurable within logging limitations

Non-gameable
Bright colors

Time on page

A/B TESTING IN PYTHON

Python metrics estimation
checkout.groupby('gender')['purchased'].mean()

gender
F 0.908056
M 0.780009
Name: purchased, dtype: float64

checkout[(checkout['browser']=='chrome')|(checkout['browser']=='safari')]\
.groupby('gender')['order_value'].mean()

gender
F 29.814161
M 30.383431
Name: order_value, dtype: float64

A/B TESTING IN PYTHON

Python metrics estimation
checkout.groupby('browser')[['order_value', 'purchased']].mean()

order_value purchased
browser
chrome 30.016625 0.839088
firefox 29.887491 0.851725
safari 30.119808 0.844337

A/B TESTING IN PYTHON

Hypothesis
formulation and
distributions
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Defining hypotheses
A hypothesis is:
a statement explaining an event
a starting point for further investigation

an idea we want to test

A strong hypothesis:
is testable, declarative, concise, and logical

enables systematic iteration

is easier to generalize and confirm understanding
results in actionable/focused recommendations

A/B TESTING IN PYTHON

Hypothesis format
General framing format:
Based on X, we believe that if we do Y
Then Z will happen

As measured by metric(s) M
Example of the alternative hypothesis:
Based on user experience research, we believe that if we update our checkout page
design
Then the percentage of purchasing customers will increase

As measured by purchase rate

Null hypothesis: ...the percentage of purchasing customers will not change...

A/B TESTING IN PYTHON

Calculating sample statistics
# Calculate the number of users in groups A and B
n_A = checkout[checkout['checkout_page'] == 'A']['purchased'].count()
n_B = checkout[checkout['checkout_page'] == 'B']['purchased'].count()
print('Group A users:',n_A)
print('Group B users:',n_B)

Group A users: 3000

Group B users: 3000

# Calculate the mean purchase rates of groups A and B

p_A = checkout[checkout['checkout_page'] == 'A']['purchased'].mean()
p_B = checkout[checkout['checkout_page'] == 'B']['purchased'].mean()
print('Group A mean purchase rate:',p_A)
print('Group B mean purchase rate:',p_B)

Group A mean purchase rate: 0.820

Group B mean purchase rate: 0.847

A/B TESTING IN PYTHON

Simulating and plotting distributions
The number of purchasers in n trials with
purchasing probability p is Binomially
distributed.

# Import binom from scipy library

from scipy.stats import binom
# Create x-axis range and Binomial distributions A and B
x = np.arange(n_A*p_A - 100, n_B*p_B + 100)
binom_a = binom.pmf(x, n_A, p_A)
binom_b = binom.pmf(x, n_B, p_B)
# Plot Binomial distributions A and B
plt.bar(x, binom_a, alpha=0.4, label='Checkout A')
plt.bar(x, binom_b, alpha=0.4, label='Checkout B')
plt.xlabel('Purchased')
plt.ylabel('PMF')
plt.title('PMF of Checkouts Binomial distribution')
plt.show()

A/B TESTING IN PYTHON

Central limit theorem
For a sufficiently large sample size, the distribution of the sample means, p, will be

normally distributed around the true population mean

with a standard deviation = standard error of the mean

irrespective of the distribution of the underlying data

A/B TESTING IN PYTHON

Central limit theorem in python
# Set random seed for repeatability
np.random.seed(47)
# Create an empty list to hold means
sampled_means = []
# Create loop to simulate 1000 sample means
for i in range(1000):
# Take a sample of n=100
sample = checkout['purchased'].sample(100,replace=True)
# Get the sample mean and append to list
sample_mean = np.mean(sample)
sampled_means.append(sample_mean)
# Plot distribution
sns.displot(sampled_means, kde=True)
plt.show()

A/B TESTING IN PYTHON

Hypothesis mathematical representation
# Import norm from scipy library
from scipy.stats import norm
# Create x-axis range and normal distributions A and B
x = np.linspace(0.775, 0.9, 500)
norm_a = norm.pdf(x, p_A, np.sqrt(p_A*(1-p_A) / n_A))
norm_b = norm.pdf(x, p_B, np.sqrt(p_B*(1-p_B) / n_B))
# Plot normal distributions A and B
sns.lineplot(x, norm_a, ax=ax, label= 'Checkout A')
sns.lineplot(x, norm_b, color='orange', \
ax=ax, label= 'Checkout B')
ax.axvline(p_A, linestyle='--')
ax.axvline(p_B, linestyle='--')
plt.xlabel('Purchased Proportion')
plt.ylabel('PDF')
plt.legend(loc="upper left")
plt.show()

A/B TESTING IN PYTHON

Experimental design:
setting up testing
parameters
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Distribution parameters
d follows a normal distribution Null vs alternative hypothesis distributions

If observed difference 'd' is unlikely:

reject the Null hypothesis

A/B TESTING IN PYTHON

Design parameters and error types
Power (1- β )
β = Type II error = False negative
Commonly set at 80%

Minimum Detectable Effect (MDE)

Smallest difference we care to capture

A/B TESTING IN PYTHON

Design parameters and error types
Significance level α
α = Type I error = False positive
Commonly set at 5%

P-value
Probability of obtaining a result
assuming the Null hypothesis is true.
If p-value < α
Reject Null hypothesis
If p-value > α
Fail to reject Null hypothesis

A/B TESTING IN PYTHON

Experiment parameters analogy
Analogy for explaining statistical power and parameters:

1. Time at store = sample size/experiment duration

2. Bag of chips size = effect size/MDE

3. Store cleanliness/organization = data variance

A/B TESTING IN PYTHON

Experimental design:
power analysis
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Effect size
Cohen's d for differences in means # Calculate standardized effect size
from statsmodels.stats.proportion import proportion_effectsize
effect_size_std = proportion_effectsize(.33, .3)
print(effect_size_std)

Cohen's h for differences in proportions 0.0645

# Calculate standardized effect size

from statsmodels.stats.proportion import proportion_effectsize
effect_size_std = proportion_effectsize(p_B, p_A)
Rule of thumb print(effect_size_std)

Small effect = 0.2

0.0716

Medium effect = 0.5

Large effect = 0.8

A/B TESTING IN PYTHON

Sample size estimation for proportions
# Import power module
from statsmodels.stats import power
# Calculate sample size
sample_size = power.TTestIndPower().solve_power(effect_size=effect_size_std,
power=.80,
alpha=.05,
nobs1=None)
print(sample_size)

3057.547

A/B TESTING IN PYTHON

Effect of sample size and MDE on power
# Import t-test power package
from statsmodels.stats.power import TTestIndPower
# Specify parameters for power analysis
sample_sizes = array(range(10, 120))
effect_sizes = array([0.2, 0.5, 0.8])
# Plot power curves
TTestIndPower().plot_power(nobs=sample_sizes, effect_size=effect_sizes)
plt.show()

A/B TESTING IN PYTHON

Sample size estimation for means
# Calculate the baseline mean order value
mean_A = checkout[checkout['checkout_page']=='A']['order_value'].mean()
print(mean_A)

24.9564

std_A = checkout[checkout['checkout_page']=='A']['order_value'].std()
print(std_A)

2.418

# Specify the desired minimum average order value

mean_new = 26

# Calculate the standardized effect size

std_effect_size=(mean_new-mean_A)/std_A

A/B TESTING IN PYTHON

Sample size estimation for means
sample_size = power.TTestIndPower().solve_power(effect_size=std_effect_size,
power=.80,
alpha=.05,
nobs1=None)
print(sample_size)

85.306

A/B TESTING IN PYTHON

Multiple
comparisons tests
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Introduction to the multiple comparisons problem
Single comparison: Multiple comparisons:
Control (A) versus Treatment (B) Multiple variants (A/B/n tests)
One metric Multiple metrics

No subcategories Granular categories

A/B TESTING IN PYTHON

Family-wise error rate
P(making Type I error) = α = 0.05

P(not making Type I error) = 1 - α

P(not making Type I error in m tests) = (1 - α)m

P(making at least one Type I error in m tests) = 1 - (1 - α)m = FWER

Family-wise Error Rate (FWER): the probability of making one or more type I errors when
performing multiple hypothesis tests.

For a single test, FWER = 1 - (1 - α)^1 = α = 0.05

But what if we perform more than one test?

A/B TESTING IN PYTHON

Family-wise error rate
import matplotlib.pyplot as plt
FWER = 1 - (1 - α)^10

import numpy as np FWER for 10 tests = 40%

alpha = 0.05
x = np.linspace(0, 20, 21)
y = 1-(1-alpha)**x
plt.plot(x,y, marker='o')
plt.title('FWER vs Number of Tests')
plt.xlabel('Number of Tests')
plt.ylabel('FWER')
plt.show()

A/B TESTING IN PYTHON

Correction methods
The simplest and most popular approach is the Bonferroni Correction

Set the adjusted α* to the individual test α divided by the number of tests m

Less stringent Sidak correction

Set FWER to desired α, then solve for αs

A/B TESTING IN PYTHON

Bonferroni correction example
Without correction, all three tests are considered significant
but the probability of making a type I error is inflated at 14%
With a Bonferroni Correction, A versus D is no longer significant, but FWER is controlled at
0.049

A/B TESTING IN PYTHON

statsmodels multipletests method
import statsmodels.stats.multitest as smt
pvals = [0.023,0.0005,0.00004]

corrected = smt.multipletests(pvals, alpha=0.05, method='bonferroni')

print("Significant Test:", corrected[0])

print("Corrected P-values:", corrected[1])
print("Bonferroni Corrected alpha: {:.4f}".format(corrected[3]))

Significant Test: [False True True]

Corrected P-values: [0.069 0.0015 0.00012]
Bonferroni Corrected alpha: 0.0167

A/B TESTING IN PYTHON

Data cleaning and
exploratory analysis
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Cleaning missing values
Missing values
Drop, ignore, impute

# Calculate the mean order value

checkout.order_value.mean()

30.0096

# Replace missing values with zeros and get mean

checkout['order_value'].fillna(0).mean()

25.3581

A/B TESTING IN PYTHON

Cleaning duplicates
Duplicates
Identical rows should be dropped
# Check for duplicate rows due to logging issues
print(len(checkout))
print(len(checkout.drop_duplicates(keep='first')))

9000
9000

A/B TESTING IN PYTHON

Cleaning duplicates
Duplicates
Duplicate users should be handled with care.
# Unique users in group B
print(checkout[checkout['checkout_page'] == 'B']['user_id'].nunique())
# Unique users who purchased at least once
print(checkout[checkout['checkout_page'] == 'B'].groupby('user_id')['purchased'].max().sum())
# Total purchase events in group B
print(checkout[checkout['checkout_page'] == 'B']['purchased'].sum())

2938
2491.0
2541.0

A/B TESTING IN PYTHON

EDA summary stats
Mean, count, and standard deviation summary

checkout.groupby('checkout_page')['order_value'].agg({'mean','std','count'})

mean count std

checkout_page
A 24.956437 2461 2.418837
B 29.876202 2541 7.277644
C 34.917589 2603 4.869816

A/B TESTING IN PYTHON

EDA plotting
Bar plots

sns.barplot(x=checkout['checkout_page'], y=checkout['order_value'], estimator=np.mean)

plt.title('Average Order Value per Checkout Page Variant')
plt.xlabel('Checkout Page Variant')
plt.ylabel('Order Value [$]')

A/B TESTING IN PYTHON

EDA plotting
Histograms

sns.displot(data=checkout, x='order_value', hue = 'checkout_page', kde=True)

A/B TESTING IN PYTHON

EDA plotting
Time series (line plots)

sns.lineplot(data=AdSmart,x='date', y='yes', hue='experiment', ci=False)

1 Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

A/B TESTING IN PYTHON

Sanity checks:
Internal validity
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Sample Ratio Mismatch (SRM)
Sample Ration Mismatch (SRM)
Allocation across variants deviates from
design

Chi-square goodness of fit test

A/B TESTING IN PYTHON

SRM python example
# Calculate the unique IDs per variant
AdSmart.groupby('experiment')['auction_id'].nunique()

experiment
control 4071
exposed 4006

# Assign the unqiue counts to each variant

control_users=AdSmart[AdSmart['experiment']=='control']['auction_id'].nunique()
exposed_users=AdSmart[AdSmart['experiment']=='exposed']['auction_id'].nunique()
total_users=control_users+exposed_users
# Calculate allocation ratios per variant
control_perc = control_users / total_users
exposed_perc = exposed_users / total_users
print("Percentage of users in the Control group:",100*round(control_perc,5),"%")
print("Percentage of users in the Exposed group:",100*round(exposed_perc,5),"%")

Percentage of users in the Control group: 50.402 %

Percentage of users in the Exposed group: 49.598 %

1 Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

A/B TESTING IN PYTHON

SRM python example
# Creat lists of observed and expected counts per variant
observed = [ control_users, exposed_users ]
expected = [ total_users/2, total_users/2 ]
# Import chisquare from scipy library
from scipy.stats import chisquare
# Run chisquare test on observed and expected lists
chi = chisquare(observed, f_exp=expected)
# Print test results and interpretation
print(chi)
if chi[1] < 0.01:
print("SRM may be present")
else:
print("SRM likely not present")

Power_divergenceResult(statistic=0.5230902562832735, pvalue=0.4695264353014863)
SRM likely not present

1 Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

A/B TESTING IN PYTHON

SRM root-causing
Common causes of SRM:1

Assignment: incorrect bucketing or faulty randomization functions

Execution: delayed variants starting time or ramp up rates

Data logging: logging delays or bot filtering

Interference: experimenter pausing a variant

1Diagnosing Sample Ratio Mismatch in Online Controlled Experiments: A Taxonomy and Rules of Thumb for
Practitioners

A/B TESTING IN PYTHON

A/A tests
A/A test
Presents an identical experience to two groups of users
Reveals bugs in experimental setup

No statistically significance differences between the metrics

False positives can still happen at the specified α (5% of the time)

Reveals imbalances in distributions across groups (e.g. browsers, devices, etc.)

A/B TESTING IN PYTHON

Distributions balance Python example
Balanced browsers distribution Imbalanced browsers distribution

Valid test Invalid test

checkout.groupby('checkout_page')['browser'].value_counts(normalize=True) AdSmart.groupby('experiment')['browser'].value_counts(normalize=True)

checkout_page browser experiment browser

A chrome 0.341333 control Chrome Mobile 0.591992
safari 0.332000 Facebook 0.137804
firefox 0.326667 Samsung Internet 0.120855
B safari 0.352000 Chrome Mobile WebView 0.071727
firefox 0.325000 Mobile Safari 0.060427
chrome 0.323000 Chrome Mobile iOS 0.008352
C safari 0.346000 Mobile Safari UI/WKWebView 0.007369
chrome 0.330000 exposed Chrome Mobile 0.535197
firefox 0.324000 Chrome Mobile WebView 0.298802
Samsung Internet 0.082876
Facebook 0.050674
Mobile Safari 0.022716
Chrome Mobile iOS 0.004244

1 Adsmart Kaggle dataset: https://www.kaggle.com/datasets/osuolaleemmanuel/ad-ab-testing

A/B TESTING IN PYTHON

Sanity checks:
external validity
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Simpson's paradox
Simpson's Paradox: a statistical phenomenon where certain trends between variables emerge,
disappear or reverse when the population is divided into segments.

print(simp_imbalanced.groupby('Variant').mean())

Variant Conversion
A 0.80
B 0.64

print(simp_imbalanced.groupby(['Variant','Device']).mean())

Variant Device Conversion

A Phone 0.875
Tablet 0.500
B Phone 0.900
Tablet 0.575

A/B TESTING IN PYTHON

Simpson's paradox
simp_imbalanced.groupby(['Variant','Device'])\
['Device'].count()

Variant Device
A Phone 40
Tablet 10
B Phone 10
Tablet 40

A/B TESTING IN PYTHON

Simpson's paradox
simp_balanced.groupby(['Variant','Device'])\ print(simp_balanced.groupby(['Variant','Device']).mean())
['Device'].count()

Variant Device Conversion

Variant Device A Phone 0.750
A Phone 40 Tablet 0.500
Tablet 10 B Phone 0.575
B Phone 40 Tablet 0.300
Tablet 10

print(simp_balanced.groupby('Variant').mean())

Variant Conversion
A 0.70
B 0.52

A/B TESTING IN PYTHON

Novelty effect
Novelty effect
A short-lived improvement in metrics caused by users' curiosity about a new feature.
Change aversion
The opposite of novelty effect.

Users avoiding trying a new feature due to familiarity with the old one.

A/B TESTING IN PYTHON

Novelty effect visual inspection
# Plot Lift in CTR vs test days
novelty.plot('date', 'CTR_lift')
plt.ylim([0, 0.09])
plt.title('Lift in CTR vs Test Duration')
plt.show()

A/B TESTING IN PYTHON

Correcting for novelty effects
Increasing the test duration
Start including data after treatment effect stabilizes.
Examine new and returning user cohorts
New users are by default less likely to experience novelty effects.

Old users compare consider their old experiences.

A/B TESTING IN PYTHON

Analyzing difference
in proportions A/B
tests
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Framework for difference in proportions
If p-value < α
Reject Null hypothesis
If p-value > α
Fail to reject Null hypothesis

Confidence intervals
95% CI is the range that captures the
true difference 95% of the time

Like fishing with a net instead of a spear

Centered around the observed difference
between the treatment and the control

A/B TESTING IN PYTHON

Two sample proportions z-test
from statsmodels.stats.proportion import proportions_ztest, proportion_confint
# Calculate the number of users in groups A and B
n_A = checkout[checkout['checkout_page'] == 'A']['user_id'].nunique()
n_B = checkout[checkout['checkout_page'] == 'B']['user_id'].nunique()
print('Group A users:',n_A)
print('Group B users:',n_B)

Group A users: 2940

Group B users: 2938

# Compute unique purchasers in each group

puchased_A = checkout[checkout['checkout_page'] == 'A'].groupby('user_id')['purchased'].max().sum()
purchased_B = checkout[checkout['checkout_page'] == 'B'].groupby('user_id')['purchased'].max().sum()
# Assign groups lists
purchasers_abtest = [puchased_A, purchased_B]
n_abtest = [n_A, n_B]

A/B TESTING IN PYTHON

Two sample proportions z-test
# Calculate p-value and confidence intervals
z_stat, pvalue = proportions_ztest(purchasers_abtest, nobs=n_abtest)
(A_lo95, B_lo95), (A_up95, B_up95) = proportion_confint(purchasers_abtest, nobs=n_abtest, alpha=0.05)
# Print the p-value and confidence intervals
print(f'p-value: {pvalue:.4f}')
print(f'Group A 95% CI : [{A_lo95:.4f}, {A_up95:.4f}]')
print(f'Group B 95% CI : [{B_lo95:.4f}, {B_up95:.4f}]')

p-value: 0.0058
Group A 95% CI : [0.8072, 0.8349]
Group B 95% CI : [0.8349, 0.8608]

A/B TESTING IN PYTHON

Confidence intervals for proportions
# Set random seed for repeatability
np.random.seed(34)
# Calculate the average purchase rate for group A
pop_mean = checkout[checkout['checkout_page'] == 'B']['purchased'].mean()
print(pop_mean)

0.847

A/B TESTING IN PYTHON

Confidence intervals for proportions
# Calculate 20 90% confidence intervals for 20 random samples of size 100 each
for i in range(20):
confidence_interval = proportion_confint(
count = checkout[checkout['checkout_page'] == 'B'].sample(100)['purchased'].sum(),
nobs = 100,
alpha = (1 - 0.90))
print(confidence_interval)

(0.7912669777384846, 0.9087330222615153)
(0.8385342148455946, 0.9414657851544054)
(0.8265485838585659, 0.9334514161414341)
(0.7568067872454262, 0.8831932127545737)
(0.8506543911914558, 0.9493456088085442)*
(0.8385342148455946, 0.9414657851544054)
(0.7230037568938057, 0.8569962431061944)
(0.8146830076144598, 0.9253169923855402)
(0.8029257122801267, 0.9170742877198733)
(0.8146830076144598, 0.9253169923855402)
(0.8506543911914558, 0.9493456088085442)*
(0.7454722433688197, 0.8745277566311804)
...

A/B TESTING IN PYTHON

Analyzing difference
in means A/B tests
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Framework for difference in means
Calculate required sample size If p-value < α

Run experiment and perform sanity checks Reject Null hypothesis

If p-value > α
Fail to reject Null hypothesis

checkout.groupby('checkout_page')['time_on_page'].mean()

checkout_page
A 44.668527
B 42.723772
C 42.223772
Calculate the metrics per variant
Analyze the difference using t-test

A/B TESTING IN PYTHON

Pingouin t-test
checkout.groupby('checkout_page')['time_on_page'].mean()

checkout_page
A 44.668527
B 42.723772
C 42.223772

ttest = pingouin.ttest(x=checkout[checkout['checkout_page']=='C']['time_on_page'],
y=checkout[checkout['checkout_page']=='B']['time_on_page'],
paired=False,
alternative="two-sided")
print(ttest)

T dof alternative p-val CI95% cohen-d BF10 power

T-test -1.995423 5998 two-sided 0.046042 [-0.99, -0.01] 0.051522 0.212 0.514054

A/B TESTING IN PYTHON

Pingouin pairwise
pairwise = pingouin.pairwise_tests(data = checkout,
dv = "time_on_page",
between = "checkout_page",
padjust = "bonf")
print(pairwise)

Contrast A B Paired Parametric T dof alternative \

0 checkout_page A B False True 7.026673 5998.0 two-sided
1 checkout_page A C False True 8.833244 5998.0 two-sided
2 checkout_page B C False True 1.995423 5998.0 two-sided

p-unc p-corr p-adjust BF10 hedges

0 2.349604e-12 7.048812e-12 bonf 1.305e+09 0.181405
1 1.316118e-18 3.948354e-18 bonf 1.811e+15 0.228045
2 4.604195e-02 1.381258e-01 bonf 0.212 0.051515

A/B TESTING IN PYTHON

Non-parametric
statistical tests
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Parametric tests assumptions
1. Random sampling
Data is randomly sampled from the population.
Investigate the data collection/sampling process.

2. Independence
Each observation/data point is independent.
Not accounting for dependencies inflates error rates.

3. Normality
Normally distributed data.

Large "enough" sample size.

Two sample t-test n >= 30 in each group.
Two sample proportions test: >=10 successes and >=10 failures in each group.

A/B TESTING IN PYTHON

Mann-Whitney U test
Non-parametric test for statistical significance

Determines if two independent samples have the same parent distribution

Rank sum test

Unpaired data

A/B TESTING IN PYTHON

Mann-Whitney U test in python
# Calculate the mean and count of time on page by variant
print(checkout.groupby('checkout_page')['time_on_page'].agg({'mean', 'count'}))

mean count
checkout_page
A 44.668527 3000
B 42.723772 3000
C 42.223772 3000

# Set random seed for repeatability

np.random.seed(40)
# Take a random sample of size 25 from each variant
ToP_samp_A = checkout[checkout['checkout_page'] == 'A'].sample(25)['time_on_page']
ToP_samp_B = checkout[checkout['checkout_page'] == 'B'].sample(25)['time_on_page']

A/B TESTING IN PYTHON

Mann-Whitney U test in python
# Run a Mann-Whitney U test
mwu_test = pingouin.mwu(x=ToP_samp_A,
y=ToP_samp_B,
alternative='two-sided')
# Print the test results
print(mwu_test)

U-val alternative p-val RBC CLES

MWU 441.0 two-sided 0.013007 -0.4112 0.7056

A/B TESTING IN PYTHON

Chi-square test of independence
Free from parametric test assumptions

Tests whether two or more categorical variables are independent

Null hypothesis: The variables are independent.

Alternative hypothesis: The variables are not independent.

A/B TESTING IN PYTHON

Chi-square test in python
Homepage signup rates A/B test

Null: There is no significant difference in signup rates between landing page designs C and D

Alternative: There is no significant difference in signup rates between them

# Calculate the number of users in groups C and D

n_C = homepage[homepage['landing_page'] == 'C']['user_id'].nunique()
n_D = homepage[homepage['landing_page'] == 'D']['user_id'].nunique()

# Compute unique signups in each group

signup_C = homepage[homepage['landing_page'] == 'C'].groupby('user_id')['signup'].max().sum()
no_signup_C = n_C - signup_C
signup_D = homepage[homepage['landing_page'] == 'D'].groupby('user_id')['signup'].max().sum()
no_signup_D = n_D - signup_D

A/B TESTING IN PYTHON

Chi-square test in python
# Create the signups table
table = [[signup_C, no_signup_C], [signup_D, no_signup_D]]
print('Group C signup rate:',round(signup_C/n_C,3))
print('Group D signup rate:',round(signup_D/n_D,3))

# Calculate p-value
print('p-value=',stats.chi2_contingency(table,correction=False)[1])

Group C signup rate: 0.064

Group D signup rate: 0.048
p-value= 0.009165

A/B TESTING IN PYTHON

Ratio metrics and
the delta method
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Ratio metrics A/B testing
Mean metrics

Unit of analysis:
The entity being analyzed in an A/B test

Denominator in ratio metrics

Randomization unit:
The subject randomly allocated to each variant

A/B TESTING IN PYTHON

Ratio metrics A/B testing
Per-user Ratio metrics

A/B TESTING IN PYTHON

Delta method motivation
print(checkout.groupby('checkout_page')[['order_value','purchased']].agg({'sum','count','mean'}))

order_value purchased
mean sum count mean sum count
checkout_page
A 24.956437 61417.791564 2461 0.820333 2461.0 3000
B 29.876202 75915.430125 2541 0.847000 2541.0 3000
C 34.917589 90890.484142 2603 0.867667 2603.0 3000

checkout.groupby('checkout_page')['order_value'].sum()/
checkout.groupby('checkout_page')['purchased'].count()

checkout_page
A 20.472597
B 25.305143
C 30.296828
dtype: float64

A/B TESTING IN PYTHON

Delta method variance
Delta method ratio metrics variance estimation:{

# Delta method variance of ratio metric

def var_delta(x,y):
x_bar = np.mean(x)
y_bar = np.mean(y)
x_var = np.var(x,ddof=1)
y_var = np.var(y,ddof=1)
cov_xy = np.cov(x,y,ddof=1)[0][1]
# Note that we divide by len(x) here because the denominator of the test statistic is standard error (=sqrt(var/n))
var_ratio = (x_var/y_bar**2 + y_var*(x_bar**2/y_bar**4) - 2*cov_xy*(x_bar/y_bar**3))/len(x)
return var_ratio

1Budylin, Roman & Drutsa, Alexey & Katsev, Ilya & Tsoy, Valeriya. (2018). Consistent Transformation of Ratio
Metrics for Efficient Online Controlled Experiments. 55-63. 10.1145/3159652.3159699.

A/B TESTING IN PYTHON

Delta method z-test
# Delta method ztest calculation
ztest_delta(x_control,y_control,x_treatment,y_treatment, alpha = 0.05)
Output:

Input arguments: mean_control : control group ratio metric

mean
x_control : control variant user-level ratio
mean_treatment : treatment group ratio
numerator column
metric mean
y_control : control variant user-level ratio
difference : difference between treatment
denominator column
and control means
x_treatment : treatment variant user-level
diff_CI : confidence interval of the
ratio numerator column
difference in means
y_treatment : treatment variant user-level
p-value : the two-tailed z-test p-value
ratio denominator column
1 https://medium.com/@ahmadnuraziz3/applying-delta-method-for-a-b-tests-analysis-8b1d13411c22

A/B TESTING IN PYTHON

Python example
# Create DataFrames for per user metrics for variants A and B
A_per_user = pd.DataFrame({'order_value':checkout[checkout['checkout_page']=='A'].groupby('user_id')['order_value'].sum()
,'page_view':checkout[checkout['checkout_page']=='A'].groupby('user_id')['user_id'].count()})
B_per_user = pd.DataFrame({'order_value':checkout[checkout['checkout_page']=='B'].groupby('user_id')['order_value'].sum()
,'page_view':checkout[checkout['checkout_page']=='B'].groupby('user_id')['user_id'].count()})

# Assign the control and treatment ratio columns

x_control = A_per_user['order_value']
y_control = A_per_user['page_view']
x_treatment = B_per_user['order_value']
y_treatment = B_per_user['page_view']

# Run a z-test for ratio metrics

ztest_delta(x_control,y_control,x_treatment,y_treatment)

{'mean_control': 20.472597188012,
'mean_treatment': 25.30514337484097,
'difference': 4.833,
'diff_CI': '[4.257, 5.408]',
'p-value': 5.954978880467735e-61}

A/B TESTING IN PYTHON

A/B Testing best
practices and
advanced topics
intro
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
Best practices
Avoid peeking

Avoid making decisions by peeking at the results before reaching the designed sample size,
as this inflates error rates similar to multiple comparisons.

Account for day-of-the-week effects

Users may behave differently on weekends versus weekdays, so we should include overall
behavior.

A/B TESTING IN PYTHON

Best practices
Simplicity/feasibility:
Do we need to build the full feature?
Painted door tests

Isolation
Change one variable at a time to attribute impact.

A/B TESTING IN PYTHON

Advanced topics
Multifactorial design and interaction effects
Measures the isolated effect of each variable
Uncovers interaction/synergistic effects

Bayesian A/B testing

Incorporates prior data into current experiment
Views population parameters as distributions

More intuitive understanding of test results

SUTVA violation and network effects
One user's assignment in a test impacts others
Common in social networks A/B tests
One solution: clusters assignment

A/B TESTING IN PYTHON

Wrap-up: A/B
testing in python
A/B TESTING IN PYTHON

Moe Lotfy, PhD

Principal Data Science Manager
A/B testing summary
Chapter 1 Chapter 2

A/B testing steps and use-cases Formulating A/B testing hypotheses

Metrics definition and estimation Error rates, power, effect size

.sample() , .corr() , pairplot , heatmap Power analysis: sample size estimation

Multiple comparisons corrections

Chapter 3 Chapter 4

Data cleaning and EDA Analyzing differences in means

Sanity checks for validation Non-parametric tests

Analyzing difference in proportions Delta method for ratio metrics
proportions_ztest , proportion_confint Best practices and advanced topics

A/B TESTING IN PYTHON

Congratulations!
A/B TESTING IN PYTHON

Statistics For Data Scientists
100% (1)
Statistics For Data Scientists
486 pages
Tba Record Final
No ratings yet
Tba Record Final
140 pages
1 Applied-Chapter1
No ratings yet
1 Applied-Chapter1
35 pages
Group 6 CC07
No ratings yet
Group 6 CC07
36 pages
Psychology Statistics
No ratings yet
Psychology Statistics
26 pages
Business Analytics
No ratings yet
Business Analytics
33 pages
AB Testing Cheat Sheet
No ratings yet
AB Testing Cheat Sheet
13 pages
DS306 数据科学面试 - AB Test专题.pptx (With Watermark) (Compressed) -水印
No ratings yet
DS306 数据科学面试 - AB Test专题.pptx (With Watermark) (Compressed) -水印
95 pages
Building Statistical Models in Python 1st Edition Anonymous Download
100% (1)
Building Statistical Models in Python 1st Edition Anonymous Download
47 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
Chapter1 - Introduction & Overview
No ratings yet
Chapter1 - Introduction & Overview
42 pages
Chapter 1
No ratings yet
Chapter 1
36 pages
DSBA Curriculum Guide
No ratings yet
DSBA Curriculum Guide
19 pages
Hands On With Probability and Statistical
No ratings yet
Hands On With Probability and Statistical
9 pages
Module 2 BDA
No ratings yet
Module 2 BDA
40 pages
Report Intership Chapters
No ratings yet
Report Intership Chapters
39 pages
Chapter 3
No ratings yet
Chapter 3
36 pages
Python Data Analyst Handbook Guide - Byom - Cybertechie
No ratings yet
Python Data Analyst Handbook Guide - Byom - Cybertechie
57 pages
Netflix A - B Testing
No ratings yet
Netflix A - B Testing
25 pages
Ziyaul 12
No ratings yet
Ziyaul 12
26 pages
How To Design An A B Test As A Data Scientist Am
No ratings yet
How To Design An A B Test As A Data Scientist Am
9 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
ML With Python
No ratings yet
ML With Python
6 pages
Analyze A/B Test Results: #We Are Setting The Seed To Assure You Get The Same Answers On Quizzes As We Set Up
No ratings yet
Analyze A/B Test Results: #We Are Setting The Seed To Assure You Get The Same Answers On Quizzes As We Set Up
12 pages
Data Science Statistics Notes
No ratings yet
Data Science Statistics Notes
8 pages
Program Delivery
No ratings yet
Program Delivery
37 pages
A B Testing
100% (1)
A B Testing
28 pages
VHS To DVD 7.0: Honestech
No ratings yet
VHS To DVD 7.0: Honestech
74 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
Tips For Testing in Python 1646539645
No ratings yet
Tips For Testing in Python 1646539645
23 pages
The Art Science of AB Testing For Business Decisions
No ratings yet
The Art Science of AB Testing For Business Decisions
97 pages
AB Testing in ML
No ratings yet
AB Testing in ML
2 pages
T Test
No ratings yet
T Test
3 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
38 pages
PDS - Week-1-Mentor Deck
No ratings yet
PDS - Week-1-Mentor Deck
22 pages
A - B Testing - Solution Document
No ratings yet
A - B Testing - Solution Document
5 pages
Lab Experiment 4 - AI
No ratings yet
Lab Experiment 4 - AI
7 pages
Data Science
No ratings yet
Data Science
9 pages
A - B Testing - Notes - Practice
No ratings yet
A - B Testing - Notes - Practice
13 pages
A - B Testing & Experimentation
No ratings yet
A - B Testing & Experimentation
5 pages
AI ML June 4 2022
No ratings yet
AI ML June 4 2022
40 pages
Data Science Study Plan v1
No ratings yet
Data Science Study Plan v1
29 pages
Nac PDF
No ratings yet
Nac PDF
23 pages
Data Analyst Nano Degree Course
No ratings yet
Data Analyst Nano Degree Course
22 pages
Probabilistic Programming: Marius Popescu 2018 - 2019
No ratings yet
Probabilistic Programming: Marius Popescu 2018 - 2019
42 pages
Data Analytics Lab
No ratings yet
Data Analytics Lab
46 pages
Industrialreport
No ratings yet
Industrialreport
26 pages
Lab 03
No ratings yet
Lab 03
13 pages
Machine Learning: Dr. Muhammad Asadullah
No ratings yet
Machine Learning: Dr. Muhammad Asadullah
69 pages
MATLAB Short Notes
No ratings yet
MATLAB Short Notes
312 pages
101827-FS2018-0: Programming With MATLAB: Advanced Course: Felix Wichmann
No ratings yet
101827-FS2018-0: Programming With MATLAB: Advanced Course: Felix Wichmann
31 pages
MGNM801 Syllabus
No ratings yet
MGNM801 Syllabus
2 pages
Constructing Research Instrument
No ratings yet
Constructing Research Instrument
19 pages
Sampling Technique and Determining Sample Size
100% (4)
Sampling Technique and Determining Sample Size
16 pages
Ch4 Slides
No ratings yet
Ch4 Slides
13 pages
Big Book of MLOps 2nd Edition
No ratings yet
Big Book of MLOps 2nd Edition
78 pages
Data Science Interview Questions 2019
No ratings yet
Data Science Interview Questions 2019
16 pages
Unlock Insights With AB Testing Data-Driven Decision Making
No ratings yet
Unlock Insights With AB Testing Data-Driven Decision Making
5 pages
Data Analytics in Python (Johar) SP2022
No ratings yet
Data Analytics in Python (Johar) SP2022
4 pages
Sfs5e PPT ch01
No ratings yet
Sfs5e PPT ch01
88 pages
Quantitative Research Design
No ratings yet
Quantitative Research Design
9 pages
Section 7.2
No ratings yet
Section 7.2
45 pages
Business Statistics I BBA 1303: Muktasha Deena Chowdhury Assistant Professor, Statistics, AUB
100% (1)
Business Statistics I BBA 1303: Muktasha Deena Chowdhury Assistant Professor, Statistics, AUB
54 pages
2.types of Studies
No ratings yet
2.types of Studies
32 pages
Introduction To Biostatistics: Reynaldo G. San Luis III, MD
No ratings yet
Introduction To Biostatistics: Reynaldo G. San Luis III, MD
28 pages
AB Cheatsheet
No ratings yet
AB Cheatsheet
13 pages
PREPARING CHAPTER 3 in Practical Research
No ratings yet
PREPARING CHAPTER 3 in Practical Research
27 pages
Data Science With Python Updated Brochure
No ratings yet
Data Science With Python Updated Brochure
13 pages
Electrical Machines Lab Manual Student
No ratings yet
Electrical Machines Lab Manual Student
59 pages
Lesson 1 The Scientific Method
No ratings yet
Lesson 1 The Scientific Method
35 pages
School of Statistics
No ratings yet
School of Statistics
9 pages
Math
No ratings yet
Math
50 pages
Assigned Topics For Class Facilitation
No ratings yet
Assigned Topics For Class Facilitation
1 page
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
No ratings yet
NAME-Rajat Gupta Section - B2B2 (Marketing and Analytics) UID - 2019-1706-0001-0007
9 pages
ABSTRACT-WPS Office
No ratings yet
ABSTRACT-WPS Office
10 pages
Sample Size Determination For Survey Research and Non-Probability Sampling Techniques: A Review and Set of Recommendations
No ratings yet
Sample Size Determination For Survey Research and Non-Probability Sampling Techniques: A Review and Set of Recommendations
22 pages
The Effect of Educational Networking On Students' Performance in Biology
No ratings yet
The Effect of Educational Networking On Students' Performance in Biology
22 pages
Learn Data Scoence Learnbay
No ratings yet
Learn Data Scoence Learnbay
8 pages
Wilcoxon Test: Lador, Cindy P. Obinguar, Ma. An Gelica U Saludo, Coke Aidenry E. Sombelon, Mary Grace B
No ratings yet
Wilcoxon Test: Lador, Cindy P. Obinguar, Ma. An Gelica U Saludo, Coke Aidenry E. Sombelon, Mary Grace B
20 pages
Format Pencatatan TGL 13-1-2022 Akun Puskesmas
No ratings yet
Format Pencatatan TGL 13-1-2022 Akun Puskesmas
10 pages
Ph.D. Entrance Exam Syllabus
100% (1)
Ph.D. Entrance Exam Syllabus
8 pages
GRADE 11 Practical Research (November December)
No ratings yet
GRADE 11 Practical Research (November December)
22 pages
AG213 UNIT 4 QM Revision PDF
No ratings yet
AG213 UNIT 4 QM Revision PDF
22 pages
Chapter Three: Research Design
No ratings yet
Chapter Three: Research Design
28 pages
DS II Mid Term 2017 Solution
No ratings yet
DS II Mid Term 2017 Solution
20 pages
Sasa Reviewer
No ratings yet
Sasa Reviewer
8 pages
Comp Chapter5 - Bordon
No ratings yet
Comp Chapter5 - Bordon
8 pages
Multiple Choice: Choose Only The Best Correct Answer and Just Encircle The Letter of Your Choice
No ratings yet
Multiple Choice: Choose Only The Best Correct Answer and Just Encircle The Letter of Your Choice
2 pages
Kinds of Quantitative Research Designs
No ratings yet
Kinds of Quantitative Research Designs
11 pages
Quiz in Practical Research 1
No ratings yet
Quiz in Practical Research 1
2 pages
Chapter 2 - Psychological Research Methods and Statistics (Section 2)
No ratings yet
Chapter 2 - Psychological Research Methods and Statistics (Section 2)
3 pages
Testing of Hypothesis
No ratings yet
Testing of Hypothesis
15 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.