Wednesday, June 17, 2020

Intro Statistics for RPGs: The Wheaton Dice Curse

I watched Critical Role when I first got into RPGs, though I didn't get too far into the series. Still, I remember the episodes guest-starring Will Wheaton, and the 'Wheaton Dice Curse', his tendency to get very bad rolls, including far more natural 1s than should be, well, natural. I looked it up, and found the folks over at Critical Role Stats (which is a thing that exists, apparently) had already addressed it, although their analysis was very shallow to me. 

Given that I took an intro stats course this past quarter, I thought it would be worthwhile to give the topic a deeper look, and in doing so both refresh my own knowledge and teach y'all some basic stats. 

Today's lesson will be introducing variance, Bernoulli and Binomial distributions, p-values, elementary hypothesis testing, z-scores and the CLT. I'll be using RStudio, a free dev environment for the R programming language. If you have it on your computer, I'll put down the code as we go so you can follow along. Otherwise, I'll just show my work, and you can take my word that the program spits out what it does. For anyone wanting to learn some basic stats on their own time, I recommend OpenIntroStats, which is a free digital textbook. It's what I used in my college class, it's well written and clear, and has  ton of practice problems. 

Examining the Dataset

The dataset is Wheaton's rolls over two Critical Role sessions, in which he rolled 54 times. Not great, but I've used smaller, and the measurements we'll be using take scale into account. CR Stats made a histogram of the frequency of each number on the d20, from Wheaton's 54 separate rolls. 


This is pretty far off our expected frequencies. In particular, the number of natural 1s and 2s are way off. We'll copy down the numbers into a list, and apply the c() function to turn them into a vector. Then we'll assign it to an easy to remember and convenient name to call it later. We can use the var() function on our dataset to return the variance value. Then we can take the square root of the variance to find the standard deviation.

x<- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 6, 6, 7, 7, 8, 8, 9, 9, 9, 9, 11, 11, 12, 13, 13, 14, 14, 14, 15, 15, 17, 18, 19, 19, 20)
var(x)

Returns 32.90566, standard deviation 5.736346. Variation and standard deviation are both sequence invariant, so you could put it in the original order of rolls or order them like this, and get the same answer. However, these measures are sensitive to scale; if you maintained the same ratio of rolls, but over a larger number of rolls, the variance would be lower. Interestingly, the variance of this dataset is lower than that of an expected set of similar length. It's not just biased towards lower numbers, it's more concentrated there are well.

Natural 1s

Predicting natural 1s is a Bernoulli variable. Bernoulli variables are those which can either fail or succeed a criterion with a certain probability, on a set number of trials. The classic Bernoulli variable is the fair coin flip, which is either heads or tails with a probability of 1/2. 

Our d20, landing on either a natural 1 (success) or anything else (a failure) is a Bernoulli 1/20 variable, Bern~(.05). So, how far out of line are Wheaton's natural 1s? Well, our expected number of natural 1s is simply the number of trials times the probability, so on 54 trials, we should expect 2.7 natural 1s. Instead, we get 10.

10 instead of 2.7 is a big difference, but how far off is it if Wheaton just had bad luck on this set of rolls? Here we'll do some basic hypothesis testing.

Hypothesis Testing the Curse

Hypothesis testing is where we set some hypotheses and a test metric for them with a preselected alpha-level, the threshold for p-values. In this case, we're going to perform a hypothesis test on our number of natural 1s, to see how rare this outcome really is under our null hypothesis.

Our null hypothesis, H0, is our assumption that the die is fair and Wheaton's rolls are luck alone. Our alternate hypothesis, HA or H1, is that Wheaton is, in reality cursed. First, we're going to select our alpha level. The alpha level is a number between 0 and 1 that reflects the likelihood of finding an example more extreme than our current observation by sheer chance. The lower the number, the less you expose yourself to false rejections.

A Digression on P-Values

You've likely heard of statistical significance, and linked it to a p-value less than .05. This means that if the observed p-value is less than .05, you are confident enough to reject the null, and the finding is considered 'statistically significant'. If not, you are unwilling to, and it isn't. What the .05 value reflects is a tolerance for false rejections 1 in 20 times. That is, if you had 20 studies with a p-value of .05, you'd expect one to be wrong. If you set your p-value to .01, you're expecting a false rejection by pure chance 1 in 100 times. 

The expected p-values are going to differ based on the size and rarity of the effect you're trying to measure; in high-energy physics, you're routinely looking at p-values on the order of 10^-6 or -7, due to the regularity of the systems being studied. In more chaotic fields, like biology, you'll get laughed out of the room if you have a p-value that small since very few experiments in the field have that kind of sensitivity and it means you probably screwed up. In the social sciences, even p<.05 is often too generous.

[This is a very basic explanation, and real life is quite different. Your p-value is only valuable if you've adequately controlled for confounders; if you haven't, you probably have some systematic bias in your data, and your study might be worthless if your mistake was really bad. P-hacking, which is the act of modifying your p-values by messing with your experiment, is rampant, especially in the social sciences, and far more than 1 in 20 such studies fail to replicate. If you want to dig in more, I recommend this somewhat technical post and this much more entertaining less technical post]

DND Meme - What a nat 1 dex save looks like - YouTube

Back to the Curse

How precise do we want to be? Well, dice rolling is mechanical physics, and while using math to determine which face will come up is very complex, the effects are regular and, like coin flipping, we expect the chaos to even out. There's also the possibility of the dice in use being biased towards lower numbers. But the hypothesis we're looking to test here is whether or not Will Wheaton is cursed, and this is going to depend a lot based on how much you believe in curses. So we'll use multiple confidence levels to illustrate.

Alice is a big believer in witchcraft, hexes and curses, and thinks Wheaton got what was coming to him for his portrayal of Wesley Crusher. Still, she's not certain this is the effect of a curse, so she'll put forward an alpha level of .05. 

Bob is a little superstitious, but is much less bullish on the topic than Alice. He'll put forward an alpha level of 1 in a thousand, .001. 

Cassandra is a hard-nosed secularist, and disdains the very notion of a curse. She puts forward an alpha level of one in ten million, .0000001. She doesn't expect to be convinced. 

And before we go forward, let's formalize our hypotheses.
H0, the null: Will Wheaton's d20s are regular polyhedral objects, and rolling natural 1s on them is adequately described by Bern~(.05).
HA, the alternate: Will Wheaton is cursed, and as a result his dice roll natural 1s with a probability greater than .05. 

Now to set up the actual measurement. Bernoulli variables are neat, but they're a measure focused around single observations. How do we analyze a collection of many Bernoulli variables?

We use the Binomial distribution. There's a few criteria we need to pass before using the Binomial is valid; the trials have to be independent, meaning that each die roll has no effect on another, the number of trials has to be fixed, each must be classified as either a success or failure, and must have the same probability across the trials. So long as the curse modifies the probability across the whole on a consistent basis, we're good.

The code looks like this: pbinom(observed value, number of trials, success probability)

We'll also add another argument at the end, lower=F, which means we're looking at the upper end of the distribution. It's equivalent to using 1-pbinom(), but it looks neater. We'll also use binomial definitions for mean and standard deviation to find those for later.

μ = np = .05*54 = 2.7
sd=sqrtn*(p*(1-p)) = sqrt(54*.05*(.95)) = 1.601562
Observed value = 10

pbinom(10, 54, .05, lower=F) 
Returns 6.315881e-05, or .00006315881

Also, if you don't have RStudio, you can still find this using a calculator, or even get pretty close with a pen and paper. We established that the mean in 2.7, and our observed value is 10. 10-2.7 = 7.3, and we can divide this by our standard deviation to get a value of around 4.5. This means that our observed value is about 4.5 standard deviations away from our estimate. You might hear this called '4.5 sigma', it means the same thing.

However you do it, you find a very big difference. 4.5 Sigma is WAAAY out of the ordinary, and the 6.315881e-05 figure tells us that the chances of getting this by chance is around 6 in a hundred thousand. 

Cassandra is unimpressed. It's less likely than she expected, but nowhere near unlikely enough to challenge her low prior belief in curses.

Bob and Alice's estimates, on the other hand, are burst open. They now believe that Will Wheaton is cursed, and have a fancy-shmancy number to show for it.

Nat 1 (@Natural_One1) | Twitter

Testing the Mean

Mind you, all we've shown here is that natural 1s are occurring with more frequency than should be expected. We can do the same for every other number if we want. The natural 2s are also interesting, and with the work done above, finding the answer for that roll by either method should be trivial. I've included the answer at the bottom. 

But that's just certain rolls. A more common form of hypothesis testing looks at the mean of these distributions. We can also look at these with similar methods. When hypothesis testing means, the hypotheses look like this.
H0: μ = expected mean (10.5 in our case)
HA: μ = greater or less than expected mean, depending on investigator belief.
The null is sometimes formalized as (observed mean) - (expected mean) = 0, which better fits the normal distribution which is most often used for this.

Expected mean for a null distribution: 10.5
mean(x) = 7

Huh. 7 exactly. So, if the true mean should be 10.5, how unlikely is it we get to 7 by chance instead? I'll quickly calculate this out using a method of normal approximation called the Central Limit Theorem and the pnorm() function, since we're looking at probabilities on the normal distribution instead. We'll use a formula to find a value called the z-score, which is stabilized to the standard normal distribution. Note also that we'll use the standard deviation we calculated at the start of the post. This can be risky with small sample sizes, since it might be calculated from unusual data, but as a rule of thumb it's usually permissible to do so with sample sizes greater than 30, and here we have 54 trials.

P(μ < 7) = P( z <((7-10.5)/(5.736346/sqrt(54)))) = P(z < -4.483628) 
We then take pnorm(-3.84311), without adding anything else, since we're looking at the lower bound and the z-value is made for the standard normal, and find...

pnorm(-4.483628) = 3.669227e-06

Another long shot, even further off than our estimate for the natural 1s, around 36 in 10 million. Though I'm not confident enough to talk about the general shape of the curse, Alice and Bob certainly have reason to believe in the effect.

In conclusion, if you have a prior belief in curses and hexes around or greater than the order of 10e-5, you have substantial reason to believe that the Wheaton dice curse, even from this single set of games in isolation, let alone a more general trend claimed by Wheaton. 

Oh, and if you've read through this post with a consistent sense of disgust that I'm even applying statistics to something so obviously ridiculous, I recommend you look into the natural control group that is parapsychological research as a similar touchstone. If a mediocre first year stats student can pull a 10e-5 order effect out of a tiny dataset relating to an absurd concept, you might consider being more skeptical about other random statistics you read about on the internet.

I hope you all enjoyed the post! I'm not 100% sure about all the numbers here, especially in the correspondence between the sigmas and the probabilities, but it's about right. If you have any questions, corrections, or want to see more like this in the future, be sure to write me down in the comments below! Until then, I'm in the middle of two family birthdays and have to get back to getting wasted.



pbinom(7, 54, .05, lower=F)
Returns 0.005180117
4.3/1.601562
Returns 2.684879 sigma

4 comments:

  1. Intriguing!
    I like how the rationalist/secularist "character" happens to be Cassandra, the cursed soothsayer :)

    ReplyDelete
  2. The curse is real and I have lived with it my whole board gaming life. It's not just the statistical outcome of the total rolls, but more importantly is the timing of the good and bad rolls. If I need lots of good rolls at one time it's non-existent, but when it's overkill and I only need one good die roll every die rolls well and evens out the statistical calculation. Experiencing chronic bad dice rolls is much like experiencing subtle racism. It's obviously there but you can't prove it, and people who don't experience it regularly have trouble believing it actually exists.

    ReplyDelete
    Replies
    1. I just found this and you make a really good comment.

      It's been 30 years since my stats class, but wouldn't a better test be total successes vs. fails. In D&D, even a roll of 2 could be a success depending on the difficulty of the test.

      What I saw in the Critical role episodes was a lot of failures, even with otherwise "good" rolls. Wheaton's character, a fighter dedicated to attacking, has a +10 to his attack. Yet, he failed to hit the majority of the time. IIRC, he made 11 or 12 attacks and missed with all but 2 of them. As an aside, one of the ones he hit with, he rolled a "1" for damage, but that wasn't a d20.

      I would submit that you would get an even more unlikely chance of those numbers being random considering it in that way.

      Delete
    2. That could be a better analysis, but it would also be more intensive. You'd have to record the success/failure of each roll as well as the probability of success on each one, which I don't think CRStats had. It would also end up being an analysis of many binomials, which would get tedious.

      Delete