If an A/B testing experiment has multiple treatment groups, the cost of testing each treatment group and the probability of false positives will increase. When there are multiple treatments groups, I usually use two types of reinforcement learning, Upper Confidence Bound (UCB) Algorithm and Thompson Sampling to quickly find out the best treatment group by using the minimum amount of time. You can see how I use reinforcement learning to conduct A/B testing in A/B Testing Reinforcement Learning Project

1. Type II error rate β or Power, because Power = 1 - β. You know one of them, you know the other.

2. Significance level α, usually 0.05.

3. Minimum detectable effect.

Sample size n approximately equals 16 (based on α = 0.05 and β = 0.8) multiplied by sample variance divided by δ square, whereas δ is the difference between treatment and control. Type 1 (false positive) is worse than a Type 2 (false negative) error.

Now we only need to find out the two parameters in our formula. 1. Sample variances can be found in the dataset. 2. The difference between treatment and control groups, which is the minimum detectable effect. It is the smallest difference that would matter in practice. eg. 5% increase in revenue. Then we can determine our sample size for A/B testing. We need more samples if the sample variance is larger, and we need fewer samples if the delta is larger.

For Social Networks Companies:

1. Creating network clusters to represent groups of users who are more likely to interact with people within the group than people outside of the group.

2. Ego-cluster randomization. A cluster is composed of an “ego” (a focal individual), and her “alters” (the individuals she is immediately connected to).

For Two-sided Market Companies:

1. Geo-based randomization. Selecting users from different locations would be a good idea but the drawback is that the variances between groups would be large.

2. Time-based randomization. Selecting a day of a week and assigning users to one group is used by companies sometimes. It works when the network effect only lasts for a short period. It doesn't work for a long-time experiment, eg. referral program.

There are two types of people. Some people don't like changes, and this is called the primacy effect or change aversion. Some people like changes and this is called the novelty effect. However, both effects will not last long as people’s behavior will stabilize after a certain amount of time. If an A/B test has a larger or smaller initial effect, it’s probably due to novel or primacy effects. There are two ways to address these issues.

1. Compare new users’ results in the control group to those in the treatment group to evaluate novelty effect

2. Compare first-time users’ results with existing users’ results in the treatment group to get an actual estimate of the impact of the novelty or primacy effect.

Pr(FP = 0) = 0.95 * 0.95 * 0.95 = 0.857

Pr(FP >= 1) = 1 - Pr(FP = 0) = 0.143

With only 3 treatment groups (4 variants), the probability of a false positive (or Type I error) is over 14%. This is called the “multiple testing” problem.

To solve this issue, we use can use two methods.

1. Bonferroni correction. It divides the significance level 0.05 by the number of tests.

2. False Discovery Rate (FDR) FDR = E[# of false positive / # of rejections]. It is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. If you have a huge number of metrics, it's ok to have some false positives.

Click-through rate: the proportion of search sessions where the user clicked on one of the results displayed.

Zero results rate: the proportion of searches that yielded 0 results.

and other metrics outside the scope of this task. EL uses JavaScript to asynchronously send messages (events) to our servers when the user has performed specific actions. In this task, you will analyze a subset of our event logs.

Copyright © Jason Fang. All rights reserved.