Roy Hung

Bayesian A/B testing is widely used in many online applications due to the advantages it has over its frequentist counterpart. One such advantage is the ability to learn and exploit the superior variant on-the-fly even before the experiment is concluded. Another is the ease of interpretation of the estimates. In this post, I will provide a walkthrough on how a bayesian A/B testing framework can be implemented in django.

The first part of the post will be a demostration of bayesian A/B testing using simulations, followed by the math behind the main concepts. The second part will discuss the implementation of the testing framework in a django application. The code for the django project can be found here, and code for the simulator here.

Bayesian A/B Test Simulator
The Beta Posterior Distribution

\mu_A \mu_B \mu_C

Table 1: Summary of Results
	n	\alpha	\beta	Conversions	Impressions	Conversion Rate
A
B
C

Scenario and Simulation

Let's put ourselves in the context of running an ecommerce website. We have three variants of a new homepage design and we'd like to know which version (A/B/C) will generate the highest clickthrough rate on the "Add To Cart" button on the page. For our purposes, we consider a user clicking on the button to be the desirable outcome which we will term as a conversion. All users visiting the page will also generate am impression - a count of the number of times the variant is shown.

In a bayesian A/B test, an agent will, based on a predetermined strategy, decide which variant to serve a user visiting the page. The agent then records the impression and the conversion (if any) generated by the user visiting the page. After which, the agent updates its beliefs on how well each variant is performing. Its updated beliefs will in turn inform its strategy going forward on how often it should serve out each variant.

In the Bayesian A/B Test Simulator mini-app above, you get to be an all-knowing, all-powerful being that gets to decide the "true" conversion rates (\mu_A,\mu_B, \mu_C) behind each variant A, B, and C. These rates represent the probability that a user exposed to the variant will generate a conversion. These "groud truth" conversion rates are not known to the agent running the test, and it is the job of the agent to estimate what these conversions rates are.

When you click Simulate, an experiment is run with N=10000 simulated page views to our "homepage". With each simulated page view, our agent will update its belief system according to whether the visiting user generates a conversion or not. The agent models the mean conversion rates using a beta distribution as its posterior probability belief (more on that below). You can use the range slider in the mini-app to see how the agent's beliefs (i.e., the distributions) change over time, as more data is collected.

Two observations can be made by playing with the simulator.

The first is that the agent's beliefs will eventually converge towards the "true" conversion rates that you have set. By N=10000, its estimates of the conversion rates as displayed in Table 1 are very close to the ground truth that you have set (\mu_A,\mu_B, \mu_C), and this is consistent across all strategies available (i.e., thompson sampling, epsilon-greedy, uniform, UCB1).

The second observation is that, by the end of the experiment, the agent is serving up the best performing variant much more often than the other two variants. This means that it has already found the best performer and is exploiting it as much as possible. In fact, this behavior can be seen as early as N=100. The exception to this case is when the "uniform" strategy is used, where each variant has an equal probability to be selected regardless of its performance (details of each strategy are discussed below).

It is perhaps the second observation that makes bayesian A/B tests attractive. The agent is already adding value before the experiment ends, by showing the better variant more often. Compare this to frequentist experiments where you can only determine the better variant after the experiment ends. Moreover, with a bayesian setup, you potentially require a much smaller sample size to crown the winner.

The Bayesian Paradigm and Conjugate Priors

When estimating a paremeter \theta, bayesian statistics view \theta as a random variable, as opposed to the frequentist view that there is only one true value of \theta. Frequentists will try to fit a distribution to the observed data, and the \hat{\theta} estimate will be the one that maximizes the probability of producing the observed data with the given distribution. In bayesian statistics, we are estimating the posterior P(\theta | X) instead, where X is the data observed.

Estimating P(\theta | X) is a difficult task, especially if we do not know how P(\theta | X) is distributed (and usually we don't). However, as we will see, by using Bayes' theorem and choosing the right prior distribution, we can simplify this problem greatly.

P(\theta | X) = \frac{P(X | \theta)P(\theta)}{P(X)} = \frac{P(X, \theta)}{P(X)}

We know that the likelihood P(X | \theta) follows a binomial distribution as each data point is generated with a probability of success or failure and X is the number of successes (conversions) out of all impressions made. Looking at the numerator, we have:

P(X,\theta) = {N\choose X}\theta^X(1-\theta)^{N-X}P(\theta)

P(X,\theta) = \frac{\Gamma(N+1)}{\Gamma(X+1)\Gamma(N-X+1)}\theta^X(1-\theta)^{N-X}P(\theta)

Where X is the total number of successes,

X=\sum^{N}_{i}x_i

x_i = \begin{cases} 1, \text{if page view generates conversion} \\ 0, \text{otherwise} \\ \end{cases}

Remember that if we can get the posterior P(\theta|X) to follow a known distribution, we can greatly simplify the problem because we won't have to take integrals over a unwieldy expression to calculate probabilities. With that in mind, let's consider using a Beta distribution for the prior P(\theta), where \theta \sim Beta(a,b).

P(X,\theta) = \frac{\Gamma(N+1)}{\Gamma(X+1)\Gamma(N-X+1)}\theta^X(1-\theta)^{N-X}\frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1}

P(X,\theta) = \frac{\Gamma(N+1)\Gamma(a+b)}{\Gamma(X+1)\Gamma(N-X+1)\Gamma(a)\Gamma(b)}\theta^{X+a-1}(1-\theta)^{N-X+b-1}

P(X,\theta) = \lambda \theta^{X+a-1}(1-\theta)^{N-X+b-1}

Where \lambda = \frac{\Gamma(N+1)\Gamma(a+b)}{\Gamma(X+1)\Gamma(N-X+1)\Gamma(a)\Gamma(b)}

Now let's deal with the denominator P(X)

P(X) = \displaystyle\int^1_0 P(X,\theta) d\theta = \int^1_0 \lambda \theta^{X+a-1}(1-\theta)^{N-X+b-1}d\theta

With some manipulation, we can coerce the expression to be integrated over a beta distribution.

P(X) = \lambda\frac{\Gamma(X+a)\Gamma(N-X+b)}{\Gamma(N+a+b)} \displaystyle\int^1_0 \frac{\Gamma(N+a+b)}{\Gamma(X+a)\Gamma(N-X+b)}\theta^{X+a-1}(1-\theta)^{N-X+b-1} d\theta

The integral portion is now over a beta distribution Beta(X+a,N-x+b) and that integrates to 1.

P(X) = \lambda\frac{\Gamma(X+a)\Gamma(N-X+b)}{\Gamma(N+a+b)}

Back to our posterior, we have:

P(\theta | X) = \frac{P(X, \theta)}{P(X)}

= \frac{\lambda \theta^{X+a-1}(1-\theta)^{N-X+b-1}}{\lambda\frac{\Gamma(X+a)\Gamma(N-X+b)}{\Gamma(N+a+b)}}

= \frac{\Gamma(N+a+b)}{\Gamma(X+a)\Gamma(N-X+b)}\theta^{X+a-1}(1-\theta)^{N-X+b-1}

\therefore \ \theta | X \sim Beta(X+a, N-X+b)

Our posterior now has a closed form solution which is the density function of a typical beta distribution.

To summarise, we have chosen a beta distribution for a prior, and in doing so, we have shown that the posterior also follows a beta distribution. When the posterior and the prior follow the same distribution, they are known to be conjugate to each other. The beta prior is known as the conjugate prior for a binomial likelihood function.

This is the reason why we model our posterior after a beta distribution. The decision to choose a beta distribution for the prior was motivated by the fact that it gives us a posterior that has a familiar distribution to work with. Nevertheless, there are other useful properties. For example, the expectation of the beta distibution E(X)=\frac{\alpha}{\alpha + \beta} in our context is \frac{conversions}{impressions} which turns out to be our conversion rate.

In the next section, we'll implement a bayesian A/B testing app in an application setting using django, and discuss the different explore-exploit algorithms used by the agent to serve the different content variations.

Implementing Bayesian A/B Testing in Django

How would bayesian A/B testing look like in a real application setting? You can see a demonstration of a web page undergoing A/B testing here. The variant of the page served to you is decided the A/B testing agent. Each visit generates an impression and if you were to click the "Add To Cart" button, it will generate a conversion. Note that this demo is separate from the simulation mini-app in above.

With bayesian A/B testing, you can see the results in real-time. For the demo page under testing, you can view the dashboard of the results for each variant here. You may also simulate page views and reset the experiment. However, do take note that this demo is open to public, other users can view, simulate, and reset the experiment as well.

You can view the code for the django implementation here.

Views

In django, the typical function-based view looks like this:

from django.shortcuts import render

def homepage(request):

    template = "myproject/homepage.html"
    context = {
        'foo':'Hello World',
    }
    return render(
        request,
        template,
        context
    )

In our A/B testing context, we have three versions to serve (A/B/C) and each version has its own HTML template. The decision-making to decide which variant to serve is done by ab_assign function which returns the Variant selected.

from django.shortcuts import render
from abtest.models import Campaign

def homepage(request):

    assigned_variant = ab_assign(
        request=request,
        campaign=Campaign.objects.get(name="Test Homepage"),
        default_template='abtest/homepage.html',
        algo='thompson'
    )
    template = assigned_variant['html_template']
    context = {
        'foo':'Hello World',
    }
    return render(
        request,
        template,
        context
    )

While this experiment uses server-side rendered views, the A/B test framework here is not limited in this respect. One can easily use the same ab_assign function for an API that indicates to the client-side which version to render.

Models

We only need two models for the A/B test app. The purpose of these models are simply to let you record the experiments that are being run, the different variants for each experiment, and to store the impression and conversion counts for each variant (i.e., your \alpha and \beta shape parameters for the beta distribution).

Essentially, the Variant model has a many-to-one relationship with the Campaign model since each Campaign experiment can have several number of versions (Variants) for an A/B test.

The models shown here are simplified for this post.

class Campaign(models.Model):

    """ A record for AB Tests conducted
    """
    
    code = models.UUIDField(
        default=uuid.uuid4, 
        editable=False,
        help_text='AB test campaign code'
    )
    name = models.CharField(
        unique=True,
        max_length=255,
        help_text='Name of AB test'
    )
    description = models.TextField(
        blank=True,
        default='',
        help_text='Description of AB test'
    )

class Variant(models.Model):

    """ Model to store the different variants 
    for an AB test campaign. Variants are the different
    versions served to users (A/B/C...)
    """

    campaign = models.ForeignKey(
        Campaign,
        related_name='variants',
        on_delete=models.CASCADE,
    )
    code = models.CharField(
        max_length=32,
        help_text='Variant code, (i.e., A, B, C etc)'
    )
    name = models.CharField(
        max_length=64,
        help_text='Name of variant'
    )
    impressions = models.IntegerField(
        default=1,
        help_text='Number of times variant was shown/visited'
    )
    conversions = models.IntegerField(
        default=1,
        help_text='Number of conversions for variant'
    )
    conversion_rate = models.FloatField(
        default=1.0,
        help_text='conversions / impressions'
    )
    html_template = models.FilePathField(
        null=True,
        help_text='Path to HTML template for variant View'
    )
    def beta_pdf(self, x_vals):
        # Get beta distribution values given corresponding X values where 0 < X <1
        y_vals = list(scipy.stats.beta.pdf(
            x_vals, 
            max(self.conversions,1),
            max(self.impressions-self.conversions,1)
            )
        )
        return y_vals

Explore-Exploit Strategies

Suppose there are K variants and when shown k \in \{1,...,K\} , the user generates a conversion with probability \theta_k \in [0,1] which is unknown to the agent. The problem the agent faces in deciding which variant to serve the user is commonly known as the multi-armed bandit problem. Faced with uncertainty, the agent is posed a dilemma: it can get a higher payoff (conversion rate) by exploiting the variant that has the best conversion rate based on data collected so far, or it can explore by serving up alternatives that could possibly turn out to be the better performer in the long run.

Within our main ab_assign function, the agent takes in a parameter value which we can set to determine which strategy to use in assigning the variant to the user. The strategies that we have in the application are:

Uniform
Epsilon-Greedy
Thompson Sampling
Upper Confidence Bound (UCB1)

Uniform

Starting with the simplest possible strategy, the agent simply selects a variant at random with equal probability. As such, the choice of variant is independent of the belief system of the agent.

assigned_variant = random.sample(list(variants), 1)[0]

This strategy is more akin to the frequentist way of testing but we include it in here as a form of a baseline strategy to which we compare all other strategies to.

Epsilon-Greedy

In the epsilon-greedy algorithm, we choose a value for \epsilon where \epsilon \in [0,1] that determines the probability that the agent will choose to explore rather than exploit. This means that the agent will choose to select a variant at random with a probability of \epsilon and will select the current best variant with probability of 1-\epsilon. The best-performing variant at a given point in time is the variant with the highest conversion rate, \hat{\mu}_k=\frac{\alpha_k}{\alpha_k+\beta_k}

def epsilon_greedy(variant_vals, eps=0.1):
    """Epsilon-greedy algorithm implementation 
    on Variant model values.
    """

    if random.random() < eps: 
        # If random number < eps, exploration is chosen over 
        # exploitation
        selected_variant = random.sample(list(variant_vals), 1)[0]

    else:
        # If random number >= eps, exploitation is chosen over
        # exploration
        best_conversion_rate = 0.0
        selected_variant = None
        for var in variant_vals:
            if var['conversion_rate'] > best_conversion_rate:
                best_conversion_rate = var['conversion_rate']
                selected_variant = var
            if var['conversion_rate'] == best_conversion_rate:
                # Break tie - randomly choose between current and best
                selected_variant = random.sample([var, selected_variant], 1)[0]

    return selected_variant

One drawback for using this algorithm would be that it does not adapt very well to the data collected. For example, if a clear winner emerges early in the experiment that is shown to have a much higher conversion rate, the algorithm will under-exploit it and continue to serve inferior variants with a probability of \frac{\epsilon(K-1)}{K}. In addition, one would have to also decide on the value of \epsilon which can be a pretty arbitrary process.

Thompson Sampling

In Thompson Sampling, we sample \hat{\theta}_k \sim Beta(\alpha_k,\beta_k) for each k variant and serve the variant that corresponds with the largest \hat{\theta} sampled. \alpha and \beta here are determined by the conversion count and impression count of the Variant at the point in time.

def thompson_sampling(variant_vals):
    """Thompson Sampling algorithm implementation 
    on Variant model values."""

    selected_variant = None
    best_sample = 0.0
    for var in variant_vals:
        sample = np.random.beta(
            max(var['conversions'], 1), 
            max(var['impressions'] - var['conversions'],1 )
        )
        if sample > best_sample:
            best_sample = sample
            selected_variant = var

    return selected_variant

It turns out that this process overcomes the problems faced by epsilon-greedy. On one hand, the better variants get sampled more often and will have smaller variances. Poorer performing variants get shown less often and thus have larger variances due to the lower impression and conversion counts. The larger variances in turn increases the probability of the variant to be shown.

In addition, if the estimated means differ greatly, the poorer performing variant will get sampled much less, even if its variance is high. This means that the worser the performance of a variant, the less likely it will be explored. In this respect, the thompson sampling algorithm does not under-exploit the best performing variant.

Upper Confidence Bound (UCB1)

k^* = argmax_k(\hat{\mu}_k + \sqrt{\frac{2lnN}{N_k}})

In UCB1, the agent selects the variant with the highest UCB1 score. The score is calculated as the estimated mean plus an extra exploration term \sqrt{\frac{2lnN}{N_k}}. This added term allows the agent to give greater weight to variants that are relatively less explored. If N is high relative to N_k, the term grows larger and increases the score.

def UCB1(variant_vals):
    """Upper Confidence Bound algorithm implementation 
    on Variant model values.
    """

    selected_variant = None
    best_score = 0.0
    total_impressions = sum([ var['impressions'] for var in variant_vals ])
    for var in variant_vals:
        score = var['conversion_rate'] + np.sqrt(2*np.log(total_impressions)/var['impressions'])
        if score > best_score:
            best_score = score
            selected_variant = var
        if score == best_score:
            # Tie breaker
            selected_variant = random.sample([var, selected_variant], 1)[0]

    return selected_variant

Similar to Thompson Sampling, the larger the difference in means, the less often the inferior variant will be shown as a greater value for the exploration term is required for that variant to be selected. As such, the testing behaviour of UCB1 closely resembles Thompson Sampling, where a larger proportion of the variants served come from the best-performing variant.

Putting it all together...

With all the strategies at hand, the ab_assign function serves as a wrapper for the algorithms, and returns the selected variant based on the chosen strategy.

def ab_assign(campaign, algo='thompson', eps=0.1):

    """ Main function for A/B testing. Used in Django Views.
    Determines the HTML template to serve for a given request
    (i.e., Variant A/B/C ). 

    Common explore-exploit algorithms for Bayesian A/B testing 
    are available, and are used to determine the stochastic 
    assignment of the variant to the user/request.
    """

    variants = campaign.variants.all().values(
        'code',
        'impressions',
        'conversions',
        'conversion_rate',
        'html_template',
    ) 
    if algo == 'thompson':
        assigned_variant = thompson_sampling(variants)
    if algo == 'egreedy':
        assigned_variant = epsilon_greedy(variants, eps=eps)
    if algo == 'UCB1':
        assigned_variant = UCB1(variants)
    if algo == 'uniform':
        assigned_variant = random.sample(list(variants), 1)[0]

    return assigned_variant

Sticky sessions & Repeat Observations

In an our experimental design, we have assumed that observations are independent of each other. However, in our current setup, this assumption is easily violated by the same user visiting the page multiple times. The first problem would be that the user may see different versions across repeat visits. We can mitigate this issue by creating sticky sessions. With sticky sessions, we record the assigned variant on the user's session variable, and we will show the user the same variant that was assigned to him/her on their first visit. We can modify our ab_assign function to do this.

def ab_assign(request, campaign, default_template, 
            sticky_session=True, algo='thompson', eps=0.1):

    """ Main function for A/B testing. Used in Django Views.
    Determines the HTML template to serve for a given request
    (i.e., Variant A/B/C ). 
    """

    variants = campaign.variants.all().values(
        'code',
        'impressions',
        'conversions',
        'conversion_rate',
        'html_template',
    ) 

    # Sticky sessions - User gets previously assigned template
    campaign_code = str(campaign.code)
    if request.session.get(campaign_code):
        if request.session.get(campaign_code).get('code') and sticky_session:
           return request.session.get(campaign_code)
    else:
        # Register new session variable
        request.session[campaign_code] = {
            'i': 1, # Session impressions
            'c': 0, # Session conversions
        }

    if algo == 'thompson':
        assigned_variant = thompson_sampling(variants)
    if algo == 'egreedy':
        assigned_variant = epsilon_greedy(variants, eps=eps)
    if algo == 'UCB1':
        assigned_variant = UCB1(variants)
    if algo == 'uniform':
        assigned_variant = random.sample(list(variants), 1)[0]

    # Record assigned template in session variable
    request.session[campaign_code] = {
        **request.session[campaign_code], 
        **assigned_variant 
    }
    request.session.modified = True

    return assigned_variant

Unless the session expires (which is 2 weeks by default in django) or the user clears the cache, the same user will only be shown one version consistently when sticky sessions is turned on.

The second problem relates to the uniqueness of observations. Some users may contribute to a disproportionate amount of impressions and conversions simply by being more active on your application. However, whether this is actually a problem depends entirely on the context of the experiment. Nevertheless, there may be situations where we only want to register a maximum of one impression and one conversion per user. One situation could be a test on a site-wide design change (e.g., change in logo or corporate colours) against a cart checkout event. In that scenario, a user may generate many impressions as he/she browses he site but will only register one conversion upon checkout.

Addressing this second problem will depend on how we collect our responses and update the beta parameters of the Variants, which will be discussed in the following section.

Updating Beliefs

We now need to write the code to collect the information we need from our visitor on the test page, and update the parameters of each Variant's posterior distribution. We'll create an API endpoint to accept POST requests that will indicate if an impression, or a conversion occurred in the current session. This API will be called via an AJAX Javascript function.

For the API backend, we'll use the Django Rest Framework.

serializers.py

class ABResponseSerializer(serializers.Serializer):

    campaign_code = serializers.CharField(max_length=36) 
    variant_code = serializers.CharField(max_length=32)
    register_impression = serializers.BooleanField()
    register_conversion = serializers.BooleanField()
    params = serializers.JSONField(required=False)

api.py

What this API does in essence is to update the Variant's overall count for impressions and conversions. In doing so, it will also update the session variable to register the number of impressions and conversions for the particular session.

class ABResponse(APIView):

    """ API to collect responses from users.
    This API registers the impressions generated from page views 
    and is also used to register conversions.
    AJAX call to be made using Javascript in the A/B test page. 
    """

    def post(self, request, format=None):

        serializer = ABResponseSerializer(data=request.data)

        if serializer.is_valid(raise_exception=True):
            campaign_code = serializer.data.get('campaign_code')
            variant_code = serializer.data.get('variant_code')
            register_impression = serializer.data.get('register_impression')
            register_conversion = serializer.data.get('register_conversion')
            params = serializer.data.get('params')

            try:
                campaign = Campaign.objects.get(code=campaign_code)
            except Campaign.DoesNotExist:
                return Response(
                    {'details':'Campaign not found'}, 
                    status=status.HTTP_404_NOT_FOUND
                )

            if campaign.active == False:
                return Response({'details':'Campaign is inactive'})

            try:
                variant = Variant.objects.get(
                    code=variant_code,
                    campaign=campaign,
                )
            except Variant.DoesNotExist:
                return Response(
                    {'details':'Variant not found'}, 
                    status=status.HTTP_404_NOT_FOUND
                )

            session_vars = request.session.get(campaign_code)
            if not session_vars:
                return Response(
                    {'details':'Unable to find session variables for campaign.'}, 
                    status=status.HTTP_404_NOT_FOUND
                ) 

            try:
                session_impressions = session_vars['i']
                session_conversions = session_vars['c']
            except:
                return Response(
                    {'details':'Unable to retrieve session impressions and conversions'}, 
                    status=status.HTTP_404_NOT_FOUND
                )

            ## Update variant impressions / conversions
            if campaign.allow_repeat:
                # When repeated impressions and conversions are allowed for 
                # The same user/session
                variant.impressions = variant.impressions + int(register_impression)
                variant.conversions = variant.conversions + int(register_conversion)
                variant.conversion_rate = variant.conversions / variant.impressions
                variant.save()
            else:
                # Not allowing repeated impressions / conversions

                if session_impressions == 1:
                    # Add to variant impressions as this is first impression
                    variant.impressions = variant.impressions + int(register_impression)
                if session_conversions == 0 and register_conversion:
                    # Add to variant conversions as this is first conversion
                    variant.conversions = variant.conversions + int(register_conversion)

                variant.conversion_rate = variant.conversions / variant.impressions
                variant.save()
                
            ## Update session impressions / conversions
            request.session[campaign_code]['i'] = session_impressions + int(register_impression)
            request.session[campaign_code]['c'] = session_conversions + int(register_conversion)
            request.session.modified = True

            return Response({'details':'Response registered'})

In our urls.py, we'll register this API under the url path /api/experiment/response.

Registering Impressions and Conversions

On the client-side, once the user's DOM content is loaded, we'll make an AJAX call to the API to register an impression. The javascript code is as follows:

function submitResponseAB(campaign_code, variant_code, 
  register_impression, register_conversion, params){

  // AJAX Post API call function to submit response
  // To register if an impression or a conversion has been made

  var params = typeof params !== 'undefined' ? params : {};
  var xhttp = new XMLHttpRequest();
  xhttp.open('POST', '/api/experiment/response', true);
  xhttp.setRequestHeader('Content-Type', 'application/json');
  xhttp.setRequestHeader('X-CSRFToken', getCookie('csrftoken'));
  xhttp.send(
    JSON.stringify({
      campaign_code : campaign_code,
      variant_code : variant_code,
      register_impression : register_impression,
      register_conversion : register_conversion,
      params : params,
    })
  );
}

function getCookie (name) {
  var value = '; ' + document.cookie;
  var parts = value.split('; ' + name + '=');
  if (parts.length === 2) {
    return parts
      .pop()
      .split(';')
      .shift()
  }
}

In the HTML template of each variant (in this case, variant A), the AJAX call is made only when the DOM content is loaded, to ensure that the user has at least seen the page before counting it as an impression.

<script>
  document.addEventListener("DOMContentLoaded", function(event) { 
    // Register an impression only when DOM content is loaded
    submitResponseAB('', 'A', true, false);
  });
</script>

To register conversions, we can simply make the function call after the event of interest. In our example, we register a conversion after an add to cart event.

function addToCart() {
  // Assuming clicking the add to cart function is deemed a conversion
  // ...Regular code goes here...
  
  // Register a conversion 
  submitResponseAB('', 'A', false, true);
}

Stopping Rules

At a certain point in the experiment, we'll need to ask ourselves when can we stop the test. This may not matter in a frequentist setting where the sample sizes may be fixed, or the A/B user groups have been pre-determined beforehand. However, in the bayesian context, the sequential updating of the agent's beliefs may reveal a clear winning variant earlier than expected. As such, we can and should apply stopping rules to determine the end of an experiment.

A good place to start would be to ask, at a given point in the experiment, what is the probability that variant X has indeed a better conversion rate than variant Y? Where X \sim Beta(a, b) and Y \sim Beta(c, d) . Here, groundwork has been laid by Evan Miller and Chris Stucchio in computing this probability in closed form.

P(X>Y) = h(a,b,c,d) = 1 - \sum\limits^{c-1}_{j=0} \frac{B(a+j,b+d)}{(d+j)B(1+j,d)B(a,b)}

Intuitively, the experimenter can set a threshold and stop the experiment when P(X>Y) is larger than say, 0.9. This can be interpreted as the probability of variant X being superior than Y is 90%. Note that this interpretation is a lot more straightforward than p-values, or confidence intervals from the frequentist approach.

However, one drawback from this metric would be that it does not take into account the cost of making a terrible mistake. In other words, the previous metric treats small errors and large errors as equally bad. Instead, we can do better by asking: What is our expected loss in conversion rate when we pick variant X over Y when in fact Y is the better variant? The experimenter can pick a loss threshold (i.e., the maximum loss in conversion rate the experimenter is willing to bear if the variant selected is wrong) and stop the experiment when the threshold is crossed.

The expected loss function as derived by Chris Stucchio:

\frac{B(a+1,b)}{B(a,b)}h(a+1,b,c,d)- \frac{B(c+1,d)}{B(c,d)}h(a,b,c+1,d)

from scipy.special import betaln

def h(a, b, c, d):
    """Closed form solution for P(X>Y)
    Where X ~ Beta(a,b), Y ~ Beta(c,d)  
    """

    total = 0.0 
    for j in range(c):
        total += np.exp(betaln(a+j, b+d) - np.log(d+j) - betaln(1+j, d) - betaln(a, b))
    return 1 - total

def loss(a, b, c, d):
    """Expected loss function built on P(X>Y)
    Where X ~ Beta(a,b), Y ~ Beta(c,d)  
    """

    return np.exp(betaln(a+1,b)-betaln(a,b))*h(a+1,b,c,d) - \
           np.exp(betaln(c+1,d)-betaln(c,d))*h(a,b,c+1,d)

We won't cover how the loss function is incorporated into the experiment as this can be highly situational. Possibly, one can invoke a function to check if the expected loss reaches the user-defined threshold at scheduled times (i.e., once a day using a cron job), or at fixed checkpoints in the experiment (i.e., once every X page views).

Summary

With that, we have successfully implemented a basic bayesian A/B testing framework for our application. From the back-end models to front-end AJAX calls, this implementation is my take on a straightforward way to perform bayesian A/B testing in django. Still, this is by no means a complete solution, but a starting point. There are other issues to address, (i.e., stopping rules for three or more variants), and also other improvements that can be made on top on this basic implementation (i.e., logging page requests for the test variants). For completeness sake, one should also perform the standard t-tests when the experiment ends. Nevertheless, I hope this post would be useful to anyone who is looking to build a more robust A/B testing framework for their application.

Implementing a Bayesian A/B Testing Framework in Django

Implementation and simulation of Bayesian A/B tests in Django