I continue with my series of articles about A/B testing of LiveOps. In the previous article, I talked about ideas for A/B tests, deteriorating tests, A/A and A/A/B tests. This time I will explain how to choose the right metrics, generate options and prepare a sample.
It is better to use the metrics to which you can apply cohort analysis. These are the metrics that tell us about the quality of the project and are tied to the specific registration date. E.g.
Conversions of various types (for example, what percentage of users converted to payment at the end of the week, or what percentage of users converted to fifth and higher payments at the end of the month).
D1, D7, D30 retention.
Monetization metrics, e.g. ARPU, first payment, etc.
Any other custom metrics that you want to influence with your test.
Take, for example, cumulative ARPU metric:
If we look at the day 7 cumulative ARPU on this chart, we can understand how much money a user pays for their first 7 days in the game. This is a very good metric, because it indirectly takes into account the retention and, at the same time, directly takes into account monetization. It all means that this metric very clearly indicates a qualitative change in the product. If the change we introduce is at least somewhat related to monetization, then this metric can be used for A/B testing.
It seems that an A/B test can only have two options - A and B, but in fact, there can be many more since there is also multivariate testing. For example, we want to run two tests - a price test and a text test. The price is $2 or $3 and the text is “buy now” or “buy”. We can run these tests either simultaneously (two tests at the same time), or sequentially (first we carry out one, then the other). The first option will be multivariate testing. That is, we have two price options and two text options, we multiply them and get 4 options and we check them at the same time. This is convenient only if we have a small number of options. But if we have more than two of them, for example, 10 color options and 10 button size options, then we get 100 options total. In this case, there will not be enough people in each group to achieve statistical significance (read about it in the next article).
The most important thing when running an A/B test is that you use a group of people that never encountered the changed functionality. Therefore, tests are often run on a tutorial or on the download form, that is, on those people who first install the game/application. You can also run an A/B test on experienced users who, for example, have never visited an in-game store. But most A/B tests are done on beginners.
When planning an A/B test, we need to understand:
How many groups do you plan to have?
What level of significance do you want to achieve (you need more people for the test if you don’t want to make a mistake)?
What kind of change are you making?
This is a rough formula that I do not recommend to use:
But it has an interesting indicator, which is worth paying attention to. The numerator has a squared standard deviation (the less stable the metric is, the more users we need to run the test), and the denominator is the change that we want to achieve. Thus, the smaller the change we want to achieve, the more users (squared!) we need for this. For example, if we have 30% retention, but by adding a Live Ops event, we want to make it 40%, then we use a small number of users to run the test. If retention was 30%, and we want to check whether it became 30.1%, and the result should be statistically significant, then we will need a lot more users.
Most importantly, the picking must be random. It is fundamentally wrong to send users who, for example, came on Wednesday, into one group, and users who came on Thursday, into another. These are different users, they are motivated differently, they joined the project on different days. The right way to do it is to combine both of them in one segment and randomly distribute them into two groups.
Next Monday I will post an article about results interpretation.