Hacking Google Optimize: From Bayes, p-values, A/A tests and forgotten metrics


Google Optimize is one of my favorite tools because it allows anyone to quickly build A/B tests; in my courses, participants are often amazed at how quickly such a test can be online. Of course, the preparatory work, the clean creation of a hypothesis, is not done so quickly, but it is also no fun to wait months for a test to go live. I don’t want to go into more detail about the advantages of Google Optimize, but instead point out three subtleties that are not so obvious.

Use Google Optimize data in Google Analytics raw data

The Google Analytics API also allows access to Google Optimize data that runs into Analytics, which allows the raw Analytics data to be analyzed into a Google Optimize test. This is especially interesting if something can’t be used as a KPI in Optimize, you forgot to set a KPI in Google Optimize, or you want to analyze side effects. Some of it also goes afterwards with segments, but hey, this is about hacking (in the sense of tinkerer, not criminal), you also do things because you can do them, not because they are always necessary

The two important Optimize dimensions are called ga:experimentId and ga:experimentVariant, and there is now also a combination called ga:experimentCombination. However, if you only run one test, then it is also sufficient to query only the dimension ga:experimentVariant. 0 is the original variant (control group), after which it is counted up per variant. If you have several tests running, simply look up the ID in the Google Optimize interface; it can be found in the right-hand column under Google Analytics. It is usually very cryptic, as you can see in the picture.

In my example, I have two experiments running, so I can output the combination next to three custom dimensions (Client ID, Hit Type and UNIX Timestamp) and page title (I cut off the Client ID a bit on the image, since it is only a pseudonymized date). In the second picture, we see the two experiments and the respective variants in a field. In the test, which starts with c-M, a student hypothesized that visitors to my site would see more pages and spend more time if the search window was higher up. I didn’t believe in it, but believing is not knowing, so we ran the test with the KPI Session Duration. I had forgotten to set the number of searches as the second KPI. Well, it’s good that I have the raw data, even if I could of course build a segment for it.

As we can also see in the screenshot, users are in two tests at the same time, as the other test should not affect the first test. Now, during the test period of 4 weeks, there were only 3 users on my site who searched for something, one of the users searched for a query several times, one user searched for two different terms. With such a small number of cases, we don’t even need to think about significance. In the meantime, it looked as if the search window up variant would actually win, but more on that in the last section. The question now is, why can the variant be better at all, if hardly any search was made? Or did the presence of the search box alone lead to a longer session duration? Very unlikely!

Let’s take a closer look…

It should be noted in the raw data that there are now two entries for each hit of a user, one per test. Also, not every user will be in a test, even if 100% of the traffic is targeted, which can already be seen in Google Analytics. We can also check whether the random selection of test and control group participants has resulted in a reasonably even distribution of users (e.g. mobile versus desktop, etc.). Of course, this is also possible with the interface.

The first thing I notice when I pull the data from the API is that the values don’t match those from the GUI. First of all, this is quite worrying. If I only look at users and sessions, the values match exactly. If I add the experimentCombination dimension, the numbers don’t fit anymore, and it’s not because of the differences between API v3 and v4. It’s not uncommon for the data to mismatch, most often through sampling, but that can’t be the case here. Interestingly, the numbers within the GUI also don’t match when I look at the data under Experiments and compare it to the audience dashboard. However, the figures from the API agree with the data from the experiments report. So be careful who forms segments!

If I drag the data including my ClientID dimension, I have a little less users, which is explained by the fact that not every user writes such an ID into the Custom Dimension, i.e. he probably has this Client ID (or certainly, because otherwise GA would not be able to identify him as an individual user), but I somehow don’t manage to write the ID into the dimension, so that there is e.g. “False”.

Now let’s take a look at some data. For example, I’m interested in whether Optimize manages to get the same distribution across devices as I have on the site:

The majority of my traffic still takes place on the desktop. What does it look like in Optimize?

The distribution is definitely different. This is not surprising, because no Optimize experiment should be played out on AMP pages; so it is rather surprising why experiments on mobile devices have taken place here at all. And these cases have different values in relation to the target KPI, as you can also see in Analytics:

So we can’t draw conclusions about the whole page from the test results, but we also don’t know how big the effect of the unexpected mobile users is on the test result. To do this, we would have to redetermine the winner. But how is the winner determined in the first place? For example, we could use a chi-square test with the observation of the average SessionDuration:

chisq.test(x) Pearson’s Chi-squared test with Yates’ continuity correction data: x X-squared = 1.5037, df = 1, p-value = 0.2201′

In this case, p is above 0.05, more to p in the next section. If the Chi-Square test is the correct test at all, it would show that the difference is not statistically significant. However, this is not the test that Google Optimize uses.

Bayesian Inference versus NHST

What exactly is happening under the hood? Let’s take a look at how Google Optimize calculates whether a variant won or not. Unlike Adobe Test & Target, for example, or most significance calculators like Conversion’s (although Convertibility doesn’t even say what kind of test they’re using), Google Optimize isn’t based on a t-test, Mann-Whitney-U, or Chi Square test, but on a Bayesian inference method. What does that mean?

Two different ideas collide here, that of the so-called frequentists (NHST stands for Null Hypothesis Significance Testing) and that of the Bayesian inference supporters. Some of these have been and still are intensively discussed in statistics, and I am not the right person to make a judgement here. But I try to shed light on these two approaches for non-statisticians.

Most A/B testing tools perform hypothesis testing. You have two groups of roughly the same size, one group is subjected to a “treatment”, and then it is observed whether the defined KPI in the test group changes “significantly”. For significance, the p-value is usually looked at; if this is below 0.05 or however the significance level has been defined, the null hypothesis is rejected. Although you don’t see anything about null hypotheses etc. on the tool interfaces, probably so as not to confuse the users, the thinking construct behind them assumes that. For example, if it is tested whether a red button is clicked more often than a blue one, the null hypothesis would be that both are clicked on equally often. The background to this is that a hypothesis cannot always be proven. But if the opposite of the hypothesis is rather unlikely, then it can be assumed that a hypothesis is rather likely. The p-value is about nothing else.

Now the p-value is not a simple story, not even scientists manage to explain the p-value in such a way that it is understandable, and there is a discussion as to whether it makes sense at all. The p-value says nothing about how “true” a test result is. It simply says something about how likely it is that this result will occur if the null hypothesis is true. With a p-value of 0.03, this means that the probability that a result will occur with a true null hypothesis is 3%. Conversely, this does not mean how “true” the alternative hypothesis is. The inverse p-value (97%) does not mean a probability that one variant will beat another variant.

Another common problem with A/B testing is that the sample size is not defined beforehand. The p-value can change over the course of an experiment, and so statistically significant results can no longer be significant after a few days because the number of cases has changed. In addition, it is not only the significance that is of interest, but also the strength/selectivity/power of a test, which is only displayed in very few test tools.

But these are mainly problems with the tools, not the frequentist approach used by most tools. The “problem” with the Frequentists approach is that a model doesn’t change when new data comes in. For example, with returning visitors, a change on the page can be learned at some point, so that an initial A/B test predicts a big impact, but the actual effect is much smaller because the Frequentists approach simply counts the total number of conversions, not the development. In Bayesian inference, however, newly incoming data is taken into account in order to refine the model; decreasing conversion rates would influence the model. Data that exists “before”, so to speak, and influences the assumptions about the influence in an experiment are called initial probability or “priors” (I write priors because it’s faster). The example in the Google Help Center (which is also often used elsewhere) is that if you misplace your cell phone in the house, according to Bayesian inference, you can use the knowledge that you like to forget your cell phone in the bedroom and also “run after” a ring. You are not allowed to do that with the Frequentists.

And this is exactly where the problem arises: How do we know that the “priors” are relevant to our current question? Or, as it is said in the Optimizely blog:

The prior information you have today may not be equally applicable in the future.

The exciting question now is how Google gets the Priors in Optimize? The following statement is made about this:

Despite the nomenclature, however, priors don’t necessarily come from previous data; they’re simply used as logical inputs into our modeling.

Many of the priors we use are uninformative – in other words, they don’t affect the results much. We use uninformative priors for conversion rates, for example, because we don’t assume that we know how a new variant is going to perform before we’ve seen any data for it.

These two blog excerpts already make it clear how different the understanding of the usefulness of Bayesian inference is. At the same time, it is obvious that, as in any other tool, we lack transparency about how exactly the calculations were achieved. Another reason that if you want to be on the safe side, you need the raw data to carry out your own tests.

The Bayesian approach requires more computational time, which is probably why most tools don’t use this approach. There is also criticism of Bayesian inference. The main problem, however, is that most users know far too little about what exactly the a/b testing tools do and how reliable the results are.

Why an A/A test can also be healing

Now the question arises as to why there was any difference in session duration at all, when hardly anyone was looking. This is where an A/A test can help. A/A test? That’s right. There is also something like that. And such a test helps to identify the variance of one’s own side. So I had a wonderful test where I tested the AdSense click-through rate after a design change. The change was very successful. To be on the safe side, I tested again; this time the change had worse values. Now, of course, it may be that worse ads were simply placed and therefore the click-through rate had deteriorated. But it could also simply be that the page itself has a variance. And these can be found out by running an A/A test (or using the past raw data for such a test). In such a test, nothing is changed in the test variant and then it is seen whether one of the main KPIs changes or not. Theoretically, nothing should change. But if it does? Then we identified a variance that lies in the page and the traffic itself. And that we should take into account in future tests.

Result

  • Differences in test results can be caused by a pre-existing variance of a page. This is where an A/A test helps to get to know the variance.
  • The results may differ if different tools are used, as the tools have different approaches to how they determine the “winners”.
  • Raw data can help to use your own test statistics or verify test results, as the tools offer little transparency about how they got to the respective tests. For example, as in my example, it can be that the test was not played evenly at all and therefore the results are not so clearly usable.
  • The raw data is sometimes very different from the values in the GUI, which cannot be explained.
  • The p-value is only part of the truth and is often misunderstood.
  • For A/B tests, you should think about how large the sample size should be in advance with the Frequentists approaches.
  • A fool with a tool is still a fool.