In this post we will be exploring various techniques and their applications, particularly within the landscape of online marketplaces, or in situations where standard methods fail.

## The fundamentals of A/B testing

A/B testing, at its core, involves comparing two or more groups (A and B) to determine the effectiveness of a specific treatment or change. A “treatment” could be anything from a new pricing model to a redesigned user interface.

In some situations—for example, web design—it is simple to split your visitors into group A or B easily by diverting traffic on arrival, making randomisation of groups straightforward. This type of randomisation isn’t always possible, or suitable, for testing the treatment effects of data science models. For example, if you need to apply your model to a predefined subgroup, how do you design your control? We will be examining 3 techniques that could prove useful in situations where standard A/B testing is not available.

### Successful A/B testing relies on:

**Randomisation:**Ensuring both groups are similar.**Independence:**Individual behaviours within each group don’t influence each other.**Sufficient sample size:**Having enough data points for statistically significant results.**No carryover effects:**Previous exposure to treatments doesn’t impact future responses.**No contamination:**External factors don’t disproportionately influence one group.

Upon visiting client’s site, each user is randomly allocated to either test or control. The test group receive the treatment and the control group do not.

## Navigating real-world challenges

In a perfect world, these assumptions would always hold true. However, real-world scenarios, especially within online marketplaces, introduce complexities that demand adaptable solutions.

### Case study: Marketplace price guidance

In an online marketplace, sellers can set their own prices on items. If an item sells, the marketplace host receives a commission on this sale. If it does not sell, no money is made by the marketplace host or the seller. The marketplace host may want to test the impact of offering price guidance to its sellers such that they can set their prices in such a way that maximises the likelihood of a sale.

The aim for the host here is to increase the likelihood of a sale — and therefore the likelihood of a commission for the host.

A naïve approach to testing the effectiveness of price guidance of increasing revenue for the host would be to randomly split individual sellers into test and control groups and only apply price guidance to the test group.

However, this violates the independence assumption:

**Cannibalisation:**Buyers might shift purchases from control sellers (with potentially higher prices) to test sellers, creating a false positive. Search results for items will return a mix of items in test and control. We may see increased sales for the test group, but at no net positive to the marketplace host. These sales may have happened anyway. We’d end up measuring an increase in sales due to the test group that is not truly present.**Spillover effects:**Control sellers might observe price changes from test sellers and adjust their pricing strategies, further skewing the results. They are no longer a good control.

## Cluster-based randomisation to the rescue

A smarter approach utilises cluster-based randomisation. Instead of individual sellers, entire product categories become the unit of randomisation.

We make the argument that while a customer may be willing to switch between shopping for an iPhone 15 and an iPhone 15 Pro, they are much less likely to switch between shopping for an iPhone 15 to a hamster cage.

Splitting at this level minimises cannibalisation and spillover effects, as buyers are less likely to switch between unrelated categories. This therefore allows us to measure treatment effects more accurately, and carry out our causal inference with more confidence, without inferring there is an improvement in the test cohort over control, when there actually isn’t.

## Beyond straightforward A/B testing: Synthetic control

What if you can’t conduct a randomised experiment? Consider synthetic control analysis, a powerful technique for situations where historical data is abundant.

Synthetic Control (SC) is a particularly useful causal inference technique for when you have a single treatment unit and very few control units, but you have repeated observation of each unit through time.

The **canonical use case** is when you want to know the **impact of the treatment in one geography **(like a state) and you use the** other untreated states as controls**. Here we estimate the effect of Proposition 99 (a bill passed in 1988 that increased cigarette tax in California) in cigarette sales.

By using optimisation techniques, we can find an optimal set of weights combining control units to replicate the behaviour of California before the introduction of the policy.

## Sequential A/B testing: Battling impatient stakeholders

Before conducting any A/B test, you will need to determine your required sample size.

This number is influenced by:

**Minimum detectable effect:**The smallest change in our success variable we will be able to detect, given it is there**Significance level:**The percent of the time a difference will be detected, assuming one does not exist (false positive)**Statistical power:**The percent of the time the minimum effect size will be detected, assuming it exists (true positive)

The time taken to gather sufficient observations to declare statistical significance can sometimes be long, but stakeholders often crave quick insights. Sequential A/B testing offers a (potential) solution. Instead of waiting for a fixed sample size, you continuously analyse the data.

### The problem with peeking early

There have been times stakeholders have asked—mid-A/B test—‘I know you can’t say for sure yet, but, how does it look?’. It is a mistake to make conclusions before you have gathered your stated required number of observations. The following example will show you why.

*Say we calculate that for our chosen significance level, we require 500 observations. The following table shows the four possible scenarios after 200, 500 observations and the conclusions made at the end of the experiment if we allow it to run its course.*

Scenarios 2 and 4 result in a significant effect detected.

*Say instead, that we observe each scenario and stop the experiment once we observe a significant result, or we reach 500 observations, whichever is faster. What we then see is that at the end of the experiment, we are making the opposite conclusion about scenario 3.*

In this situation, we conclude that scenario 3 is reporting a significant result, since we observed at 200 observations a ‘significant’ uplift. However, because we didn’t allow for collection of sufficient observations—our reported significance level is incorrect! We have increased the ratio of significant relative to insignificant results.

### Sequential testing procedure

Reference: https://www.evanmiller.org/sequential-ab-testing.html

- At the beginning of the experiment, choose a sample size 𝑁.
- Assign subjects randomly to the treatment and control, with 50% probability each.
- Track the number of incoming successes from the treatment group. Call this number 𝑇.
- Track the number of incoming successes from the control group. Call this number 𝐶.
- If 𝑇−𝐶 reaches 2√𝑁, stop the test. Declare the treatment to be the winner.
- If 𝑇+𝐶 reaches 𝑁, stop the test. Declare no winner.

T-C can be described as a *one-dimensional random walk.*

If the treatment has the same conversion rate as the control, the random walk will be symmetrical; because the next success is as likely to come from the treatment as from the control, after each success, 𝑑 is just as likely to increase as it is to decrease.

But if the conversion rates are not equal, the random walk will tend to be biased in one direction or the other.

This method can reduce the number of required observations by over 50%, without having to make assumptions about the distributions of treatment effects.

In this simulation, the test group has a 30% uplift over control. The stated sample size, N, is 200, therefore if T-C reaches 2*sqrt(200), we can stop the experiment and declare the test cohort to be the winner. We can see it reaches this value at approx 170 observations. We have therefore made a saving of 30 observations (15%).

In this simulation, the test group and control group have the same conversion rate. We have the same stated sample size as above, but we do not see the difference T-C reaching our cutoff value, and so in the example we are not able to stop the test early.

## Conclusion

A/B testing is a powerful tool for proving the value of data science work. However, it’s essential to be mindful of real-world complexities and adapt your approach accordingly. Whether leveraging cluster-based randomisation, synthetic control analysis, or sequential testing, the key lies in choosing the technique that best suits your specific needs and constraints.

With a successful A/B test, you can be confident that your project was a success. And, if not, decide to kill the project early and move onto the next project in your backlog, iterate on the current model, or stop dev work on the current model and move it to a full productionised roll out.

If you're looking for help implementing a successful A/B testing programme in your team—or if you'd like to discuss how AI can help your business run more efficiently—reach out today.

**References and further reading**

https://mixtape.scunning.com/10-synthetic_control

https://www.evanmiller.org/sequential-ab-testing.html

https://innovation.ebayinc.com/tech/research/the-design-of-a-b-tests-in-an-online-marketplace/