When AB testing doesn’t cut it
Today I am going to talk about experimentation in data science, why it is so important and some of the different techniques that we might consider using when AB testing is not appropriate. Experiments are designed to identify causal relationships between variables and this is a really important concept in many fields and particularly relevant for data scientists today. Let’s say we are a data scientist working in a product team. In all likelihood, a large part of our role will be to identify whether new features will have a positive impact on the metrics we care about. i.e. if we introduce a new feature making it easier for users to recommend our app to their friends, will this improve user growth? These are the types of questions that product teams will be interested in and experiments can help provide an answer. However, causality is rarely easy to identify and there are many situations where we will need to think a bit deeper about the design of our experiments so we do not make incorrect inferences. When this is the case, we can use often use techniques taken from econometrics and I will discuss some of these below. Hopefully, by the end, you will get a better understanding of when these techniques apply and also how to use them effectively.
Most people reading this have probably heard of AB testing as it is an extremely common method of experimentation used in industry to understand the impact changes we make to our product. It could be as simple as changing the layout on a web page or the colour of a button and measuring the effect this change has on a key metric such as click-through rates. I won’t get it into the specifics too much here as I want to focus more on the alternative techniques but for those interested in learning more about AB testing the following course on Udacity provides a really good overview. In general, we can take two different approaches to AB testing. We can use the Frequentist approach and the Bayesian approach, each of which has its own advantages and disadvantages.
I would say frequentist AB testing is by far the most common type of AB testing done and follows directly from the principles of frequentist statistics. The goal here is to measure the causal effect of our treatment by seeing if the difference between our metric in the A and B groups is statistically significant at some significance level, 5 or 1 per cent is typically chosen. More specifically, we will need to define a null and alternate hypothesis and determine if we can or cannot reject the null. Depending on the type of metric we choose we might use a different statistical test but chi-square and t-tests are commonly used in practice. However, there are some limitations to the frequentist methodology and I think it is a bit harder to interpret and explain compared to the Bayesian approach but perhaps because the underlying maths is more complex in the Bayesian setting it is not as commonly used. A key point about the frequentist approach is that the parameter or metric we compute is a constant. Therefore, there is no probability distribution associated with it.
The key difference in the Bayesian approach is that our metric is a random variable and therefore has a probability distribution. This is quite useful as we can now incorporate uncertainty about our estimates and make probabilistic statements which are often much more intuitive to people than the frequentist interpretation. Another advantage of using a Bayesian approach is that we may reach a solution faster compared to AB testing as we do not necessarily need to assign equal numbers of data to each variant. This means that a Bayesian approach may converge to a solution faster using fewer resources. Choosing which approach to take will obviously depend on the individual situation and is largely up to the data scientist. Whichever method you choose they are both nonetheless powerful ways to identify causal effects.
In many cases, however, AB testing is just not a suitable technique to identify causality. For example, for AB testing to be valid we must have random selection into both the A and B groups. This is not always possible as some interventions may target individuals for a specific reason making them fundamentally different than other users. In other words, selection into each group is non-random. I will discuss and provide code for a specific example of this that I ran into recently a bit later in the post.
Another reason AB testing may not be valid is when we have confounding. In this situation, looking at correlations between variables may be misleading. We want to know if X causes Y but it may be the case that some other variable Z drives both. This makes it impossible to disentangle the effect of just X on Y making it very difficult to infer anything about causality. This is also often called omitted variable bias and will result in us either over or underestimating the true impact of X on Y. In addition to this, it may not be feasible from a business standpoint to design a randomised experiment as it may cost too much money or be seen as unfair if we gave some users new features and not provide those features to other users as well. In these circumstances, we must rely on quasi-random experiments.
Ok, so we have discussed some of the reasons we may not be able to apply AB testing but what can we do instead? This is where econometrics comes in. Compared to machine learning, econometrics is much more focused on causality and as a result, economists/social scientists have developed a breadth of statistical techniques aimed at understanding the causal impact of one variable on another. Below I will show some of the techniques that data scientists could and should borrow from econometrics to avoid making incorrect inferences from experiments that suffer from the problems mentioned above.