# Chapter 15 Introduction to Bayesian statistics

In our reasonings concerning matter of fact, there are all imaginable degrees of assurance, from the highest certainty to the lowest species of moral evidence. A wise man, therefore, proportions his belief to the evidence.– David Hume^{79}.

The ideas so far in this book describe inferential statistics from the frequentist perspective. Almost every textbook given to undergraduate psychology students presents the opinions of the frequentist statistician as *the* theory of inferential statistics, the one true way to do things. The frequentist view of statistics dominated the academic field for most of the 20th century, and this dominance is even more extreme among applied scientists. It was and is the current practice among psychologists to use frequentist methods. Because frequentist methods are ubiquitous in scientific papers, every statistics student needs to understand them; otherwise, they will be unable to make sense of what those papers are saying! Unfortunately, the current practice in psychology is often misguided, and the reliance on frequentist methods is partly to blame. In this chapter, we will introduce Bayesian statistics, an approach that – some, including Danielle, think – is generally superior to the orthodox approach.

From a Bayesian perspective, statistical inference is about **belief revision**.

We start with a set of candidate hypotheses \(h\) about the world. We don’t know which of these hypotheses is true, but we do have some beliefs about which hypotheses are plausible and which are not. When we observe the data \(d\), we have to revise those beliefs. If the data are consistent with a hypothesis, our belief in that hypothesis is strengthened. Should the data be inconsistent with the hypothesis, our belief is weakened.

Consider the following reasoning problem:

Attila is carrying an umbrella. Do you think it will rain today?

In this problem, we have presented a single piece of data (\(d =\) Attila is carrying the umbrella), and we’re asking you to tell us your beliefs about whether it’s raining. You have two possible **hypotheses**, \(h\): either it rains today or it does not. How should you solve this problem?

## 15.1 Priors: what you believed before

Firstly, you must ignore everything about the umbrella and write down your pre-existing beliefs about rain. This is important: if you want to be honest about how your beliefs have been revised in the light of new evidence, then you *must* say something about what you believed before those data appeared!

So, what might you believe about whether it will rain today? You probably know that Attila lives in Budapest and that the weather here is mild and relatively dry. In fact, you might have decided to take a quick look on Wikipedia^{80} and discovered that Budapest gets an average of 7.3 days of rain across the 31 days of January, let’s say. Without knowing anything else, you might conclude that the probability of January rain in Budapest is about (\(\frac{7.3}{31}=\)) 23.5%, and the probability of a dry day is ($1 - 0.235 = $)76.5%. If this is what you believe about Budapest rainfall, then what we have written here is your **prior distribution**, written \(P(h)\):

Hypothesis \(h\) | Degree of Belief \(P(h)\) |
---|---|

Rainy day | 0.235 |

Dry day | 0.765 |

## 15.2 Likelihoods: theories about the data

To solve the reasoning problem, you need a theory about Attila’s behaviour. When does he carry an umbrella? You might guess that he only carries umbrellas on rainy days. But he’s also a busy professor. Might he be forgetful?

Let’s suppose^{81} that on rainy days he remembers his umbrella about 30% of the time. Let’s say that on dry days, he’s only about 5% likely to be carrying an umbrella. So you might write out a little table like this:

Hypothesis \(h\) | Umbrella | No umbrella |
---|---|---|

Rainy day | 0.30 | 0.70 |

Dry day | 0.05 | 0.95 |

It’s important to remember that each cell in this table describes your beliefs about what data \(d\) will be observed, *given* the truth of a particular hypothesis \(h\). This “conditional probability” is written \(P(d|h)\), which you can read as “the probability of \(d\) given \(h\)”. In Bayesian statistics, this is referred to as **likelihood** of data \(d\) given hypothesis \(h\).^{82}

## 15.3 The joint probability of data and hypothesis

At this point, all the elements are in place. Having written down the priors and the likelihood, you have all the information you need to do Bayesian reasoning. The question now becomes, *how* do we use this information?

Let’s start out with one of the rules of probability theory.

The rule talks about the probability that *two* things are true. In our example, you might want to calculate the probability that today is rainy (i.e. hypothesis \(h\) is true) *and* that Attila is carrying an umbrella (i.e. data \(d\) is observed). The **joint probability** of the hypothesis and the data is written \(P(d,h)\), and you can calculate it by multiplying the prior \(P(h)\) by the likelihood \(P(d|h)\). Mathematically, we say that:
\[
P(d,h) = P(d|h) P(h)
\]

So, what is the probability that today is a rainy day *and* that he carries an umbrella? The prior tells us that the probability of a rainy day is 23.5%, and the likelihood tells us that the probability of Attila carrying an umbrella on a rainy day is 30%. So the probability that both of these things are true is calculated by multiplying the two:

\[ \begin{array}{rcl} P(\mbox{rainy}, \mbox{umbrella}) & = & P(\mbox{umbrella} | \mbox{rainy}) \times P(\mbox{rainy}) \\ & = & 0.30 \times 0.235 \\ & = & 0.0705 \end{array} \]

In other words, before being told anything about what actually happened, you think there is a 7.1% probability that today will be a rainy day and that Attila will remember to bring an umbrella. However, there are, of course, *four* possible things that could happen, right? So let’s repeat the exercise for all four. If we do that, we end up with the following table:

Day | Umbrella | No umbrella | Total |
---|---|---|---|

Rainy | 0.0705 | 0.1645 | 0.2350 |

Dry | 0.0383 | 0.7268 | 0.7650 |

Total | 0.1088 | 0.8912 | 1.0000 |

This is a handy table, so it’s worth taking a moment to think about what all these numbers are telling us. First, notice that the row sums aren’t telling us anything new at all. For example, the first row tells us that if we ignore all this umbrella business, the chance that today will be a rainy day is 23.5%. That’s not surprising, of course: that’s our prior. The important thing isn’t the number itself: rather, the important thing is that it gives us some confidence that our calculations are sensible! Now take a look at the column sums, and notice that they tell us something that we haven’t explicitly stated yet.

In the same way that the row sums tell us the probability of rain, the column sums tell us the probability of Attila carrying an umbrella. Specifically, the first column tells us that the probability of him carrying an umbrella is, on average, 10.88%, regardless of the weather. Finally, notice that when we sum across all four logically-possible events, everything adds up to 1. In other words, what we have written down is a proper probability distribution defined over all possible combinations of data and hypothesis.

Now, because this table is so useful, let’s make sure you understand what all the elements correspond to and how they are written:

Day | Umbrella | No umbrella | Total |
---|---|---|---|

Rainy | \(P\)(Umbrella, Rainy) | \(P\)(No umbrella, Rainy) | \(P\)(Rainy) |

Dry | \(P\)(Umbrella, Dry) | \(P\)(No umbrella, Dry) | \(P\)(Dry) |

Total | \(P\)(Umbrella) | \(P\)(No umbrella) |

Finally, let’s use “proper” statistical notation. So we’ll let \(d_1\) refer to the possibility that you observe Attila carrying an umbrella, and \(d_2\) refers to you observing him not carrying one. Similarly, \(h_1\) is your hypothesis that today is rainy, and \(h_2\) is the hypothesis that it is not. Using this notation, the table looks like this:

\(d_1\) | \(d_2\) | ||
---|---|---|---|

\(h_1\) | \(P(h_1, d_1)\) | \(P(h_1, d_2)\) | \(P(h_1)\) |

\(h_2\) | \(P(h_2, d_1)\) | \(P(h_2, d_2)\) | \(P(h_2)\) |

\(P(d_1)\) | \(P(d_2)\) |

## 15.4 Updating beliefs using Bayes’ rule

The table we laid out in the last section is a powerful tool for solving the rainy day problem because it considers all four logical possibilities and states exactly how confident you are in each of them before being given any data. It is now time to consider what happens to our beliefs when we are actually given the data.

In the rainy day problem, you are told that Attila really *is* carrying an umbrella. This is something of a surprising event: according to our table, the probability of him carrying an umbrella is only 10.88%. But that makes sense, right? No matter how unlikely you thought it was, you must now adjust your beliefs to accommodate the fact that you now *know* that he has an umbrella. To reflect this new knowledge, our *revised* table must have the following numbers:

Umbrella | No umbrella | |
---|---|---|

Rainy day | 0 | |

Dry day | 0 | |

Total | 1 | 0 |

In other words, the facts have eliminated any possibility of “no umbrella”, so we have to put zeros into any cell in the table that implies that. Also, you know for a fact that he is carrying an umbrella, so the column sum on the left must be 1 to correctly describe the fact that \(P(\mbox{umbrella})=1\).

What two numbers should we put in the empty cells? Again, let’s not worry about the maths and instead use our intuitions. We worked out that the joint probability of “rain and umbrella” was 7.05%, and the joint probability of “dry and umbrella” was 3.83%. But notice that *both* of these possibilities are consistent with the fact that he actually is carrying an umbrella. So what we expect to see in our final table are some numbers that preserve the fact that “rain and umbrella” is more plausible than “dry and umbrella” while still ensuring that the numbers in the table add up. Something like this, perhaps?

Umbrella | No umbrella | |
---|---|---|

Rainy day | 0.648 | 0 |

Dry day | 0.352 | 0 |

Total | 1.000 | 0 |

This table infers that, after being told that Attila is carrying an umbrella, you believe that there is a 64.8% chance that today will be a rainy day and a 35.2% chance that it will not. That is the answer to our problem! The **posterior probability** of rain \(P(h|d)\) given that Attila is carrying an umbrella is 64.8%.

How did we calculate these numbers? You can probably guess: take the 0.0705 probability of “rain and umbrella” and divide it by the 0.1088 chance of “umbrella”. This produces a table that satisfies our need to have everything sum to 1 and our need not to interfere with the relative plausibility of the two events that are actually consistent with the data.

To say the same thing using fancy statistical jargon: divide the joint probability of the hypothesis and the data \(P(d,h)\) by the **marginal probability** of the data \(P(d)\), and this is what gives us the posterior probability of the hypothesis *given* that we know the data have been observed. To write this as an equation:^{83}
\[
P(h | d) = \frac{P(d,h)}{P(d)}
\]

Remember that the joint probability \(P(d,h)\) is calculated by multiplying the prior \(P(h)\) by the likelihood \(P(d|h)\). In real life, the things we actually know how to write down are the priors and the likelihood, so let us substitute those back into the equation. This gives us the following formula for the posterior probability: \[ P(h | d) = \frac{P(d|h) P(h)}{P(d)} \]

And this formula is known as **Bayes’ rule**. It describes how a learner starts out with prior beliefs about the plausibility of different hypotheses and tells you how those beliefs should be revised in the face of data. In the Bayesian paradigm, all statistical inference flows from this one simple rule.

I’m confident he is anything but forgetful. At least, that is how I know him. I hope he won’t mind this little easter egg of an example. – Robert↩︎

Some statisticians would object to using the word “likelihood” here. The problem is that the word “likelihood” has a very specific meaning in frequentist statistics, and it’s not quite the same as what it means in Bayesian statistics. Bayesians didn’t originally have any agreed-upon name for the likelihood, and so it became common practice for people to use the frequentist terminology. This wouldn’t have been a problem, except for the fact that the way that Bayesians use the word turns out to be quite different to the way frequentists do. This isn’t the place for yet another lengthy history lesson, but to put it crudely: when a Bayesian says “

*a*likelihood function” they’re usually referring to one of the*rows*of the table. When a frequentist says the same thing, they’re referring to the same table, but to them, “*a*likelihood function” almost always refers to one of the*columns*. This distinction matters in some contexts, but it’s not important for our purposes.↩︎You might notice that this equation is actually a restatement of the same basic rule from the start of the last section. If you multiply both sides of the equation by \(P(d)\), then you get \(P(d) P(h| d) = P(d,h)\), which is the rule for how joint probabilities are calculated.↩︎