Featured Blog

A/B Tests for Analysing LiveOps. Part 3

A/B testing of LiveOps part 3: statistical significance, approaches to the interpretation of results, and some conclusions

I continue with my series of articles about A/B testing of LiveOps. In the first article, I talked about ideas for A/B tests, deteriorating tests, A/A, and A/A/B tests. In the second I explained how to choose the right metrics, generate options and prepare a sample. This time I will talk about statistical significance, approaches to the interpretation of results, and draw some conclusions.

Interpretation of Results

While collecting data, you will be tempted to take a peep at the results. When we start the test, we often have an internal favorite or a version that the producer really likes. And it happens that the test is still running, however, we’ve peeped at the results and we see that the favorite is already winning. We decide to stop the test to save time and money, and often this is a mistake. It is possible that if we waited for the whole allocated audience to finish the test, then the results could be different. This is the same as if a match between FC Barcelona and a little-known team would be stopped 15 minutes before the end if Barcelona managed to score a goal. But we all know that 90 minutes of a match spring surprises. Therefore, you can peep at the results, but you always need to wait for the entire sample to go through the test.

Statistical Significance

Significance is a measure of certainty. We can say that the test is significant if we are confident that the results that we obtained on a small sample will be exactly the same for a larger audience. Significance depends on many factors and we ourselves choose the percentage of mistakes we can make.

There are two approaches to measuring significance:

1. The frequency approach. The one students learn about at universities.

  • The probability here is the frequency of an event.
  • It is used to test statistical hypotheses.
  • The output is p-value.

The frequency approach is more trivial; it is described in most textbooks on mathematical statistics. We need to meet many conditions in order to take into account all the features of the frequency approach.

2. The Bayesian approach.

  • The probability here is a degree of confidence (subjective probability).
  • As a result, we get the probability of success for each of the options.

This approach is less demanding on the source data: we do not have to check the distribution and we need fewer data in general, but the price for this is a complexity of Bayesian calculations.

Application of the Frequency and Bayesian Approaches to Various Metrics

LiveOps influence various metrics. There are:

1. Binomial metrics. They are usually measured as a percentage:

  • 0 or 1 (Yes or No).
  • Conversion (paid/didn’t pay, clicked/didn’t click).
  • Retention (returned/didn’t return).

2. Non-binomial metrics. They are not measured in percentages, but, for example, in money or minutes:

Let’s take a look at several cases:

The frequency approach and binomial metrics (e.g. retention or conversions)

Here you can use the classic t-test (Student's t-test) or z-test (Fisher’s z-test).

The Bayesian approach and binomial metrics

In this case, the probability is somewhat subjective. It is evaluated before the test (a priori probability) and after the test (a posteriori probability). It is important to understand that this approach works somewhat differently than the frequency approach.

The advantage of this method is that as a result, you get, figuratively speaking, the probability of success for each of the test groups and not a simple p-value. It is quite convenient to interpret, although difficult to calculate.

Other methods of statistical analysis are used for non-binomial metrics. This is important to consider because in LiveOps most of the changes are about money.

Sum it up. Common Errors in A/B Testing

1. Wrong hypothesis and testing of changes that are difficult to track (we do not always clearly understand how many users we have and what changes we want to see).

2. Favorable (for yourself) interpretation of the experimental results (the problem of peeping; the lack of results on the chosen metric and changing this metric for another after receiving the test results).

3. Intuition (you shouldn't use it during tests at all).

4. Audience:

  • new/not new;
  • traffic sources (one traffic source gets the A version of the test, another traffic source gets the B version);
  • paying/not paying.

5. Too few users (only very noticeable hypotheses can be tested on a small audience).

6. Running several tests at the same time (multivariate testing is fine, but you need to test hypotheses that influence each other as little as possible).

7. The quality of tests can vary from project to project, over time, etc. (what worked for one project may not work for another project; moreover, what worked for this project may not work for it again in a year).

8. Statistical significance.

9. Lack of prior testing.

10. Wrong choice of metrics.

11. Wrong sample size (too small or too large).

A/B testing is not so simple. And, strangely enough, 51 is not always more than 50.

Latest Jobs

Manticore Games

San Mateo, California
Senior Software Engineer - Mobile

Sony PlayStation

San Diego, California
Sr. Online Programmer

The Walt Disney Company

Glendale, California
Associate Marketing Manager - Walt Disney Games

Insomniac Games

Burbank, California
Accessibility Design Researcher
More Jobs   


Explore the
Subscribe to
Follow us

Game Developer Job Board

Game Developer Newsletter


Explore the

Game Developer Job Board

Browse open positions across the game industry or recruit new talent for your studio

Subscribe to

Game Developer Newsletter

Get daily Game Developer top stories every morning straight into your inbox

Follow us


Follow us @gamedevdotcom to stay up-to-date with the latest news & insider information about events & more