I continue with my series of articles about A/B testing of LiveOps. In the first article, I talked about ideas for A/B tests, deteriorating tests, A/A, and A/A/B tests. In the second I explained how to choose the right metrics, generate options and prepare a sample. This time I will talk about statistical significance, approaches to the interpretation of results, and draw some conclusions.
Interpretation of Results
While collecting data, you will be tempted to take a peep at the results. When we start the test, we often have an internal favorite or a version that the producer really likes. And it happens that the test is still running, however, we’ve peeped at the results and we see that the favorite is already winning. We decide to stop the test to save time and money, and often this is a mistake. It is possible that if we waited for the whole allocated audience to finish the test, then the results could be different. This is the same as if a match between FC Barcelona and a little-known team would be stopped 15 minutes before the end if Barcelona managed to score a goal. But we all know that 90 minutes of a match spring surprises. Therefore, you can peep at the results, but you always need to wait for the entire sample to go through the test.
Significance is a measure of certainty. We can say that the test is significant if we are confident that the results that we obtained on a small sample will be exactly the same for a larger audience. Significance depends on many factors and we ourselves choose the percentage of mistakes we can make.
There are two approaches to measuring significance:
1. The frequency approach. The one students learn about at universities.
- The probability here is the frequency of an event.
- It is used to test statistical hypotheses.
- The output is p-value.
The frequency approach is more trivial; it is described in most textbooks on mathematical statistics. We need to meet many conditions in order to take into account all the features of the frequency approach.
2. The Bayesian approach.
- The probability here is a degree of confidence (subjective probability).
- As a result, we get the probability of success for each of the options.
This approach is less demanding on the source data: we do not have to check the distribution and we need fewer data in general, but the price for this is a complexity of Bayesian calculations.
Application of the Frequency and Bayesian Approaches to Various Metrics
LiveOps influence various metrics. There are:
1. Binomial metrics. They are usually measured as a percentage:
- 0 or 1 (Yes or No).
- Conversion (paid/didn’t pay, clicked/didn’t click).
- Retention (returned/didn’t return).
2. Non-binomial metrics. They are not measured in percentages, but, for example, in money or minutes:
Let’s take a look at several cases:
The frequency approach and binomial metrics (e.g. retention or conversions)
Here you can use the classic t-test (Student's t-test) or z-test (Fisher’s z-test).
The Bayesian approach and binomial metrics
In this case, the probability is somewhat subjective. It is evaluated before the test (a priori probability) and after the test (a posteriori probability). It is important to understand that this approach works somewhat differently than the frequency approach.
The advantage of this method is that as a result, you get, figuratively speaking, the probability of success for each of the test groups and not a simple p-value. It is quite convenient to interpret, although difficult to calculate.
Other methods of statistical analysis are used for non-binomial metrics. This is important to consider because in LiveOps most of the changes are about money.
Sum it up. Common Errors in A/B Testing
1. Wrong hypothesis and testing of changes that are difficult to track (we do not always clearly understand how many users we have and what changes we want to see).
2. Favorable (for yourself) interpretation of the experimental results (the problem of peeping; the lack of results on the chosen metric and changing this metric for another after receiving the test results).
3. Intuition (you shouldn't use it during tests at all).
- new/not new;
- traffic sources (one traffic source gets the A version of the test, another traffic source gets the B version);
- paying/not paying.
5. Too few users (only very noticeable hypotheses can be tested on a small audience).
6. Running several tests at the same time (multivariate testing is fine, but you need to test hypotheses that influence each other as little as possible).
7. The quality of tests can vary from project to project, over time, etc. (what worked for one project may not work for another project; moreover, what worked for this project may not work for it again in a year).
8. Statistical significance.
9. Lack of prior testing.
10. Wrong choice of metrics.
11. Wrong sample size (too small or too large).
A/B testing is not so simple. And, strangely enough, 51 is not always more than 50.