```
library(tidyverse)
library(data.table)
set.seed(63)
knitr::opts_chunk$set(warning = F, message = F)
```

A/B tests involve comparing two different groups, typically with a randomised trial. Traditionally this might be a clinical trial where a new drug is given to a treatment group who are compared with a control group who did not receive the drug. Other early A/B tests were done in agriculture, where fields would be split into sections that were treated differently. A/B tests are also used widely in business to test things like new website features, emails, or machine learning models.

This post describes some functions to simulate A/B tests. Subsequent posts will then make use of this code to explore topics like early stopping for A/B tests. The type of A/B test being simulated here is a simple two-group test where the outcome is binary (0 = fail, 1 = success). In the context of a new website feature this could represent clicks or conversion. For emails, a binary outcome might be whether the email was opened. The function will allow for multiple weeks of testing to be simulated. This feature will then facilitate future analysis of early stopping.

# Simulating a week of a test

The basic building block for simulating A/B tests is getting a weeks worth of data for the test and control groups.
The core function used here is `rbinom()`

.
When first tackling this problem I played around with `Rcpp`

to create my own version of `rbinom()`

.
However, a quick bit of benchmarking showed it was no faster than the `rbinom()`

function built into R.
If we print out this function we can see why.

`rbinom`

```
## function (n, size, prob)
## .Call(C_rbinom, n, size, prob)
## <bytecode: 0x7f888d76ade8>
## <environment: namespace:stats>
```

The output shows that `rbinom()`

is already calling C with `.Call(C_rbinom, n, size, prob)`

!
While it was fun trying out `Rcpp`

, I could have saved myself some time by not optimising too early.

The function below simulates a single week of testing.
We’re making use of `data.table`

for its combination of speed and relatively simple syntax.
I initially worked on a set of functions using the `tidyverse`

but found them to be too slow.

A `data.table`

is returned by the function with two columns, `control`

and `treatment`

.
For the `control`

column, we get `n_per_week`

values from a binomial distribution with `baseline`

probability of being a success (1).
For the `treatment`

column the probability of the value being 1 is shift by `treatment_effect`

.

```
sim_week <- function(n_per_week, baseline, treatment_effect) {
data.table(
control = rbinom(n_per_week, 1, baseline),
treatment = rbinom(n_per_week, 1, baseline + treatment_effect)
)
}
```

Below the function is called with `treatment_effect = 0`

meaning there’s no real difference between the contol and treatment groups.

```
sim_week(1000, 0.5, 0) %>%
colMeans()
```

```
## control treatment
## 0.498 0.518
```

The `treatment_effect`

could be set of 0.1 to indicate a 60% chance of the outcome between 1 for the treatment group, versus 50% for control.

```
sim_week(1000, 0.5, 0.1) %>%
colMeans()
```

```
## control treatment
## 0.483 0.593
```

# Simulate multiple weeks of testing

Now our basic building block is in place, we can move to slightly more complicated problem: simulating multiple weeks of a test.
The complication here stem from the fact that weeks aren’t independent.
The data for the second week of a test accumulates on top of the first week.
In terms of arguments to the function, the only thing that is new is `n_weeks`

, which is the number of weeks of data we want to simulate.

```
sim_weeks <-
function(n_per_week,
baseline,
treatment_effect,
n_weeks) {
# create list where each element is the data
# for a week created by sim_week()
weeks_list = replicate(n_weeks,
sim_week(n_per_week, baseline, treatment_effect),
simplify = F)
# collapse the list of weeks into a
# data.table with a week id column
weeks_dt <- rbindlist(weeks_list, idcol = 'week')
weeks_dt %>%
# melt into long format with week as the id
melt(id.vars = 'week',
variable.name = 'group',
value.name = 'success') %>%
# sum success flag by week and group (treat vs control)
.[, .(successes_per_week = sum(success)), by = c('week', 'group')] %>%
# create the cumulative successes for each week by group
.[, successes := cumsum(successes_per_week), by = 'group'] %>%
# create the cumulative failures by week
.[, failures := week * n_per_week - successes] %>%
# order by week
.[order(week)]
}
```

To begin with, the function creates a list with the *new* data for each week.
These samples are independent because they only reflect the new data for each week, without combining with previous weeks.
This list is then combined into a single `data.table`

with a column to identify the week.
There’s a nice function (`rbindList()`

) provided by `data.table`

for this.

```
# create list where each element is the data
# for a week created by sim_week()
weeks_list = replicate(n_weeks,
sim_week(n_per_week, baseline, treatment_effect),
simplify = F)
# collapse the list of weeks into a
# data.table with a week id column
weeks_dt <- rbindlist(weeks_list, idcol = 'week')
```

The next step is to take these samples and combine them with previous weeks.
We take our `data.table`

containing the *new* data for each week and combine it following a number of steps.

`melt()`

into a long format.- Create a column called
`success_per_week`

by summing the binary`success`

value by week and group (treatment vs control). - Calculate the cumulative sum of successes within each group ordered by week into a column called
`successes`

. - Create a
`failures`

column containing the cumulative number of zeros for each week. - Order by week.

```
weeks_dt %>%
# melt into long format with week as the id
melt(id.vars = 'week',
variable.name = 'group',
value.name = 'success') %>%
# sum success flag by week and group (treat vs control)
.[, .(successes_per_week = sum(success)), by = c('week', 'group')] %>%
# create the cumulative successes for each week by group
.[, successes := cumsum(successes_per_week), by = 'group'] %>%
# create the cumulative failures by week
.[, failures := week * n_per_week - successes] %>%
# order by week
.[order(week)]
```

Calling this function, we can see the output it gives us. This has been designed to make statistical tests for each week easy to add, which we’ll turn to next.

```
sim_weeks(500, 0.5, 0.2, 2) %>%
head()
```

```
## week group successes_per_week successes failures
## 1: 1 control 247 247 253
## 2: 1 treatment 342 342 158
## 3: 2 control 226 473 527
## 4: 2 treatment 345 687 313
```

# Add statistical tests each week

In a future post we’ll explore early stopping for A/B tests.
To faciliate this the next function runs a `prop.test()`

for each week of cumulative data.

```
sim_test <-
function(n_per_week,
baseline,
treatment_effect,
n_weeks) {
# get the data
sim_result <-
sim_weeks(n_per_week, baseline, treatment_effect, n_weeks)
# create an empty dt to populate below
dt_prop_test_result = data.table(
week = 1:n_weeks,
control_prop = vector(mode = 'numeric', length = n_weeks),
treatment_prop = vector(mode = 'numeric', length = n_weeks),
p_value = vector(mode = 'numeric', length = n_weeks)
)
for (w in 1:n_weeks) {
# create a matrix with successes and fails in
# the columns control and treatment in the rows
mat = as.matrix(sim_result[week == w, .(successes, failures)])
# do the prop.test
prop_test_result = prop.test(mat)
# update the results data.table
dt_prop_test_result[week == w, `:=`(
control_prop = prop_test_result$estimate[1],
treatment_prop = prop_test_result$estimate[2],
p_value = prop_test_result$p.value
)]
}
return(dt_prop_test_result)
}
```

Breaking this down a bit, the function starts by simply calling `sim_weeks()`

.
Next an empty `data.table`

is instantiated to be filled with values.
This will then be populated with the week, proportions in the control and treatment groups, and the \(p\)-value for the `prop.test()`

.

```
# create an empty dt to populate below
dt_prop_test_result = data.table(
week = 1:n_weeks,
control_prop = vector(mode = 'numeric', length = n_weeks),
treatment_prop = vector(mode = 'numeric', length = n_weeks),
p_value = vector(mode = 'numeric', length = n_weeks)
)
```

Finally, the function loops through the weeks, first creating a matrix of successes and fails in the format favoured by `prop.test()`

.
The `prop.test()`

is then run and the results are added into `dt_prop_test_result`

.

```
for (w in 1:n_weeks) {
# create a matrix with successes and fails in
# the columns control and treatment in the rows
mat = as.matrix(sim_result[week == w, .(successes, failures)])
# do the prop.test
prop_test_result = prop.test(mat)
# update the results data.table
dt_prop_test_result[week == w, `:=`(
control_prop = prop_test_result$estimate[1],
treatment_prop = prop_test_result$estimate[2],
p_value = prop_test_result$p.value
)]
}
```

Running the function we can see the output in a neat format for analysis.

```
sim_test(500, 0.5, 0.1, 2) %>%
head()
```

```
## week control_prop treatment_prop p_value
## 1: 1 0.490 0.624 2.650377e-05
## 2: 2 0.502 0.597 2.394067e-05
```

# Simulate multiple tests

The final thing that we’ll want to do to run simulation experiments is run multiple tests. The function below allows us to do that.

```
sim_tests <-
function(n_per_week,
baseline,
treatment_effect,
n_weeks,
n_tests) {
# replicate the test n_tests times
tests_list = replicate(n_tests,
sim_test(n_per_week, baseline, treatment_effect, n_weeks),
simplify = F)
# return a data.table with an id col called test
rbindlist(tests_list, idcol = 'test')
}
```

First off, a list of results from running `sim_test()`

`n_tests`

times is created.

```
# replicate the test n_tests times
tests_list = replicate(
n_tests,
sim_test(n_per_week, baseline, treatment_effect, n_weeks),
simplify = F
)
```

This is then combined into a single `data.table`

with an id column called `test`

.

```
# return a data.table with an id col called test
rbindlist(tests_list, idcol = 'test')
```

Running the final function gives an output that can then be used to analyse A/B tests.

```
sim_tests(
n_per_week = 500,
baseline = 0.5,
treatment_effect = 0,
n_weeks = 2,
n_tests = 5
) %>%
head(10)
```

```
## test week control_prop treatment_prop p_value
## 1: 1 1 0.462 0.516 0.10001468
## 2: 1 2 0.467 0.511 0.05441929
## 3: 2 1 0.478 0.506 0.41090763
## 4: 2 2 0.480 0.506 0.26350568
## 5: 3 1 0.446 0.524 0.01619845
## 6: 3 2 0.487 0.510 0.32517730
## 7: 4 1 0.514 0.508 0.89931895
## 8: 4 2 0.519 0.481 0.09798734
## 9: 5 1 0.498 0.478 0.56910237
## 10: 5 2 0.488 0.508 0.39548488
```

A simple example of how this function could be used is to check the false-positive rate for tests. In the example below, there’s no true treatment effect meaning that 5% of p-values are below 0.05. A future post will look into some of these topics in more depth.

```
dat_sim_results <-
sim_tests(
n_per_week = 500,
baseline = 0.5,
treatment_effect =0,
n_weeks = 1,
n_tests = 1000
)
# false positive rate
mean(dat_sim_results$p_value < 0.05)
```

`## [1] 0.056`

# Benchmarking

On my 2015 Macbook Pro, the `sim_tests()`

function takes about 10 seconds to simulate 500 tests with 1000 customers per week and 4 weeks.

```
bench = bench::mark(
sim_tests = sim_tests(1000, 0.5, 0, 4, 500),
min_iterations = 5,
check = FALSE
)
```

# Appendix: code

```
#======================================
# simulate a week of a test
#======================================
sim_week <- function(n_per_week, baseline, treatment_effect) {
data.table(
control = rbinom(n_per_week, 1, baseline),
treatment = rbinom(n_per_week, 1, baseline + treatment_effect)
)
}
#======================================
# simulate weeks of testing
#======================================
sim_weeks <-
function(n_per_week,
baseline,
treatment_effect,
n_weeks) {
# create list where each element is the data
# for a week created by sim_week()
weeks_list = replicate(n_weeks,
sim_week(n_per_week, baseline, treatment_effect),
simplify = F)
# collapse the list of weeks into a
# data.table with a week id column
weeks_dt <- rbindlist(weeks_list, idcol = 'week')
weeks_dt %>%
# melt into long format with week as the id
melt(id.vars = 'week',
variable.name = 'group',
value.name = 'success') %>%
# sum success flag by week and group (treat vs control)
.[, .(successes_per_week = sum(success)), by = c('week', 'group')] %>%
# create the cumulative successes for each week by group
.[, successes := cumsum(successes_per_week), by = 'group'] %>%
# create the cumulative failures by week
.[, failures := week * n_per_week - successes] %>%
# order by week
.[order(week)]
}
#======================================
# simulate weeks of testing with
# a prop.test for each week
#======================================
sim_test <-
function(n_per_week,
baseline,
treatment_effect,
n_weeks) {
# get the data
sim_result <-
sim_weeks(n_per_week, baseline, treatment_effect, n_weeks)
# create an empty dt to populate below
dt_prop_test_result = data.table(
week = 1:n_weeks,
control_prop = vector(mode = 'numeric', length = n_weeks),
treatment_prop = vector(mode = 'numeric', length = n_weeks),
p_value = vector(mode = 'numeric', length = n_weeks)
)
for (w in 1:n_weeks) {
# create a matrix with successes and fails in
# the columns control and treatment in the rows
mat = as.matrix(sim_result[week == w, .(successes, failures)])
# do the prop.test
prop_test_result = prop.test(mat)
# update the results data.table
dt_prop_test_result[week == w, `:=`(
control_prop = prop_test_result$estimate[1],
treatment_prop = prop_test_result$estimate[2],
p_value = prop_test_result$p.value
)]
}
return(dt_prop_test_result)
}
#======================================
# simulate a set of tests of the same
# type using the above functions
#======================================
sim_tests <-
function(n_per_week,
baseline,
treatment_effect,
n_weeks,
n_tests) {
# replicate the test n_tests times
tests_list = replicate(n_tests,
sim_test(n_per_week, baseline, treatment_effect, n_weeks),
simplify = F)
# return a data.table with an id col called test
rbindlist(tests_list, idcol = 'test')
}
```