Frequentist sequential testing

7/27/2023

You can find the knitr code for this analysis here, along with a package of related functions here. We were interested in switching to this method, but we wanted to examine this advantage more closely, and thus ran some simulations and analyses. At any point in time we can use this model to determine if our observations support a winning conclusion, or if there still is not enough evidence to make a call. Swrve offers a similar justification in Why Use a Bayesian Approach to A/B Testing:Īs we observe results during the test, we update our model to determine a new model (a posteriori distribution) which captures our belief about the population based on the data we’ve observed so far. The first is that unlike the Student T-Test, you can stop the test early if there is a clear winner or run it for longer if you need more samples. This A/B testing procedure has two main advantages over the standard Students T-Test. Similarly, Chris Stucchio writes in Easy Evaluation of Decision Rules in Bayesian A/B Testing: (In other words, it is immune to the “peeking” problem described in my previous article). For instance, the author of “How Not To Run an AB Test” followed up with A Formula for Bayesian A/B Testing:īayesian statistics are useful in experimental contexts because you can stop a test whenever you please and the results will still be valid. It is often claimed that Bayesian methods, unlike frequentist tests, are immune to this problem, and allow you to peek at your test while it’s running and stop once you’ve collected enough evidence. (For more on this, see A/B Testing Rigorously (without losing your job)).Īn often-proposed alternative is to rely on a Bayesian procedure rather than frequentist hypothesis testing. But this is impractical in a business setting, where you might want to stop a test early once you see a positive change, or keep a not-yet-significant test running longer than you’d planned. One solution is to pre-commit to running your experiment for a particular amount of time, never stopping early or extending it farther.

How Not To Run an A/B Test gives a good explanation of this problem. This seems reasonable, but in doing so, you’re making the p-value no longer trustworthy, and making it substantially more likely you’ll implement features that offer no improvement. Unfortunately, this leads to a common pitfall in performing A/B testing, which is the habit of looking at a test while it’s running, then stopping the test as soon as the p-value reaches a particular threshold- say. Our current approach relies on computing a p-value to measure our confidence in a new feature. Since I joined Stack Exchange as a Data Scientist in June, one of my first projects has been reconsidering the A/B testing system used to evaluate new features and changes to the site.

0 Comments

Frequentist sequential testing

Leave a Reply.

Author

Archives

Categories