# Chapter 9 Hypothesis testing

In Chapter 8, we discussed the ideas behind estimation, which is one of the two “big ideas” in inferential statistics. It’s now time to turn out attention to the other big idea, which is **hypothesis testing**. In its most abstract form, hypothesis testing is really a very simple idea: the researcher has some theory about the world and wants to determine whether or not the data actually support that theory. However, the details are messy, and most people find the theory of hypothesis testing to be the most frustrating part of statistics.

The structure of the chapter is as follows. Firstly, we’ll talk about how hypothesis testing works in a fair amount of detail, using a simple running example to show you how a hypothesis test is “built”. We’ll focus on the underlying logic of the testing procedure rather than being too dogmatic about it. Afterwards, we’ll spend a bit of time talking about the dogmas, rules and heresies surrounding the hypothesis testing theory.

## 9.1 A menagerie of hypotheses

Eventually, we all succumb to madness. Let’s suppose that this glorious day has come, and we indulge in a most thoroughly unproductive line of psychological research: the search for extrasensory perception (ESP).^{35} Our first study is a simple one in which we seek to test whether clairvoyance exists. Each participant sits down at a table and is shown a card by an experimenter. The card is black on one side and white on the other. The experimenter takes the card away and places it on a table in an adjacent room. The card is placed black side up or white side up entirely at random, with the randomisation occurring after the experimenter has left the room with the participant. A second experimenter comes in and asks the participant which side of the card is now facing upwards. It’s purely a one-shot experiment. Each person sees only one card and gives only one answer. At no stage is the participant in contact with someone who knows the correct answer.

The data set, therefore, is very simple. We have asked the question of, say, $N = 100$ people and $X = 62$ of these people have given the correct response. It’s a surprisingly large number, sure, but is it large enough for us to feel safe in claiming we’ve found evidence for ESP? This is the situation where hypothesis testing comes in useful. However, before we talk about how to *test* hypotheses, we need to be clear about what we mean by hypotheses.

### 9.1.1 Research hypotheses versus statistical hypotheses

The first distinction that you need to keep clear in your mind is between research hypotheses and statistical hypotheses. In our ESP study, the overall scientific goal is to demonstrate that clairvoyance exists. In this situation, we have a clear research goal: we are hoping to discover evidence for ESP. In other cases, we might be a lot more neutral than that, so we might say our goal is to determine whether or not clairvoyance exists. Regardless of how we want to portray it, the basic point that we’re trying to convey here is that a research hypothesis involves making a substantive, testable scientific claim. If you are a psychologist, your research hypotheses are fundamentally *about* psychological constructs. Any of the following would count as **research hypotheses**:

*Listening to music reduces your ability to pay attention to other things.*This is a claim about the causal relationship between two psychologically meaningful concepts (listening to music and paying attention to things), so it’s a perfectly reasonable research hypothesis.*Intelligence is related to personality*. Like the last one, this is a relational claim about two psychological constructs (intelligence and personality), but the claim is weaker: correlational, not causal.*Intelligence is the speed of information processing*. This hypothesis has quite a different character: it’s not a relational claim at all. It’s an ontological claim about the fundamental character of intelligence. It’s worth expanding on this one actually: It’s usually easier to think about how to construct experiments to test research hypotheses of the form “does X affect Y?” than it is to address claims like “what is X?” And in practice, what usually happens is that you find ways of testing relational claims that follow from your ontological ones. For instance, if we believe that intelligence*is*the speed of information processing in the brain, our experiments will often involve looking for relationships between measures of intelligence and measures of speed. Consequently, most everyday research questions tend to be relational in nature, but they’re almost always motivated by deeper ontological questions about the state of nature.

Notice that in practice, our research hypotheses could overlap a lot. The ultimate goal in the ESP experiment might be to test an ontological claim like “ESP exists”. But we might operationally restrict ourselves to a narrower hypothesis like “Some people can ‘see’ objects in a clairvoyant fashion”. That said, some things really don’t count as proper research hypotheses in any meaningful sense:

*Love is a battlefield*. This is too vague to be testable. While it’s okay for a research hypothesis to have a degree of vagueness to it, it has to be possible to operationalise your theoretical ideas. If this cannot be converted into any concrete research design, then this isn’t a scientific research hypothesis: it’s a pop song.*The first rule of the tautology club is the first rule of the tautology club*. This is not a substantive claim of any kind. It’s true by definition. No conceivable state of nature could possibly be inconsistent with this claim. As such, we say this is an*unfalsifiable*hypothesis, and as such, it is outside the domain of science. Whatever else you do in science, your claims must have the possibility of being wrong.*More people in my experiment will say “yes” than “no”*. This one fails as a research hypothesis because it’s a claim about the data set, not about psychology (unless, of course, your actual research question is whether people have some kind of “yes” bias!). As we’ll see shortly, this hypothesis is starting to sound more like a statistical hypothesis than a research hypothesis.

As you can see, research hypotheses can be somewhat messy at times; and ultimately, they are *scientific* claims. **Statistical hypotheses** are neither of these two things. They must be mathematically precise and correspond to specific claims about the characteristics of the “population”. Even so, the intent is that statistical hypotheses clearly relate to the substantive research hypotheses you care about! For instance, in our ESP study, the research hypothesis is that some people are able to see through walls or whatever. We want to “map” this onto a statement about how the data were generated. So let’s think about what that statement would be. The quantity that we’d be interested in within the experiment is $P(\mbox{"correct"})$, the true-but-unknown probability with which the participants in my experiment answer the question correctly. Let’s use the Greek letter $\theta$ (theta) to refer to this probability. Here are four different statistical hypotheses:

- If ESP doesn’t exist and if the experiment is well designed, then the participants are just guessing. So we should expect them to get it right half of the time, and so the statistical hypothesis is that the true probability of choosing correctly is $\theta = 0.5$.
- Alternatively, suppose ESP does exist, and participants can see the card. If that’s true, people will perform better than chance. The statistical hypothesis would be that $\theta > 0.5$.
- A third possibility is that ESP does exist, but the colours are all reversed, and people don’t realise it. If that’s how it works, then you’d expect people’s performance to be
*below*chance. This would correspond to a statistical hypothesis that $\theta < 0.5$. - Finally, suppose ESP exists, but we have no idea whether people are seeing the right colour or the wrong one. In that case, the only claim to be made about the data would be that the probability of making the correct answer is
*not*equal to 50. This corresponds to the statistical hypothesis that $\theta \neq 0.5$.

These are legitimate examples of statistical hypotheses because they are statements about a population parameter and are meaningfully related to my experiment. What this discussion hopefully makes clear is that when attempting to construct a statistical hypothesis test, the researcher has two quite distinct hypotheses to consider. They have a *research hypothesis* (a claim about psychology) corresponding to a *statistical hypothesis* (a claim about the data generating population).

Research hypothesis | Statistical hypothesis |
---|---|

ESP exists | $\theta \neq 0.5$ |

And the critical thing to recognise is this: *a statistical hypothesis test is a test of the statistical hypothesis, not the research hypothesis*. If your study is poorly designed, the link between your research hypothesis and your statistical hypothesis is broken. To give a silly example, suppose that the ESP study was conducted in a situation where the participant can actually see the card reflected in a window; if that happens, we would be able to find robust evidence that $\theta \neq 0.5$, but this would tell us nothing about whether “ESP exists”.

### 9.1.2 Null hypotheses and alternative hypotheses

We have a research hypothesis that corresponds to the question we want to ask about the world, and we can map it onto a statistical hypothesis. Our statistical hypothesis will have a claim. This claim will become our “alternative” hypothesis, $H_1$. In constrast, the “null” hypothesis, $H_0$, will correspond to the exact opposite of what the claim was. Then, we’ll focus exclusively on the null hypothesis.

In our ESP example, the null hypothesis is that $\theta = 0.5$, since that’s what we’d expect if ESP *didn’t* exist. The hope, of course, is that ESP is totally real, and so the *alternative* to this null hypothesis is $\theta \neq 0.5$. In essence, what we’re doing here is dividing up the possible values of $\theta$ into two groups: those values that we really hope aren’t true (the null) and those values that we’d be happy with if they turn out to be correct (the alternative). Having done so, the important thing to recognise is that the goal of a hypothesis test is *not* to show that the alternative hypothesis is (probably) true; the goal is to show that the null hypothesis is (probably) false. Most people find this pretty weird.

According to Danielle, the best way to think about it is to imagine that a hypothesis test is a criminal trial: *the trial of the null hypothesis*. The null hypothesis is the defendant, the researcher is the prosecutor, and the statistical test itself is the judge. Just like a criminal trial, there is a presumption of innocence: the null hypothesis is *deemed* to be true unless you, the researcher, can prove beyond a reasonable doubt that it is false. You are free to design your experiment however you like (within reason, obviously!), and your goal when doing so is to maximise the chance that the data will yield a conviction for the crime of being false. The catch is that the statistical test sets the rules of the trial, which are designed to protect the null hypothesis – specifically to ensure that if the null hypothesis is true, the chances of a false conviction are guaranteed to be low. This is pretty important: after all, the null hypothesis doesn’t get a lawyer. And given that the researcher is trying desperately to prove it to be false, *someone* has to protect it.

## 9.2 Two types of errors

Ideally, we would like to construct our test so that we never make any errors. Unfortunately, this is never possible. Sometimes you’re just really unlucky: for instance, suppose you flip a coin 10 times in a row and it comes up heads all 10 times. That feels like powerful evidence that the coin is biased (and it is!), but of course, there’s a 1 in 1024 chance that this would happen even if the coin was totally fair. In other words, in real life, we *always* have to accept that there’s a chance that we did the wrong thing. Consequently, statistical hypothesis testing aims not to *eliminate* errors but to *minimise* them.

At this point, we need to be more precise about what we mean by “errors”. Firstly, let’s state the obvious: it is either the case that the null hypothesis is true, or it is false. And our test will either reject the null hypothesis or retain it. So, as the table below illustrates, after we run the test and make our choice, one of four things might have happened:

Retain $H_0$ | Reject $H_0$ | |
---|---|---|

$H_0$ is true | Correct decision | Error (type I) |

$H_0$ is false | Error (type II) | Correct decision |

As a consequence, there are actually *two* different types of error here. If we reject a null hypothesis that is actually true, then we have made a **type I error**. On the other hand, if we retain the null hypothesis when it is, in fact, false, then we have made a **type II error**.

A criminal trial requires that you establish “beyond a reasonable doubt” that the defendant did it. The trial is designed to protect the rights of a defendant. In other words, a criminal trial doesn’t treat the two types of error in the same way: punishing the innocent is deemed to be much worse than letting the guilty go free. A statistical test is pretty much the same: the single most important design principle of the test is to *control* the probability of a type I error to keep it below some fixed probability. This probability, which is denoted $\alpha$, is called the **significance level** of the test (or sometimes, the *size* of the test). A hypothesis test is said to have a significance level $\alpha$ if the type I error rate is no larger than $\alpha$.

So, what about the type II error rate? Well, we’d also like to keep those under control too, and we denote this probability by $\beta$. However, it’s much more common to refer to the **power** of the test, which is the probability with which we reject a null hypothesis when it is false, which is $1-\beta$. To help keep this straight, here’s the same table again, but with the relevant numbers added:

Retain $H_0$ | Reject $H_0$ | |
---|---|---|

$H_0$ is true | $1-\alpha$ (probability of correct retention) | $\alpha$ (type I error rate) |

$H_0$ is false | $\beta$ (type II error rate) | $1-\beta$ (power of the test) |

A “powerful” hypothesis test has a small value of $\beta$ while still keeping $\alpha$ fixed at some (small) desired level. By convention, scientists make use of three different $\alpha$ levels: $.05$, $.01$ and $.001$. Notice the asymmetry here: the tests are designed to *ensure* that the $\alpha$ level is kept small, but there’s no corresponding guarantee regarding $\beta$. We’d certainly *like* the type II error rate to be small, and we try to design tests that keep it small, but this is very much secondary to the overwhelming need to control the type I error rate. Paraphrasing Blackstone had he been a statistician: it is “better to retain ten false null hypotheses than to reject a single true one”.

## 9.3 Test statistics and sampling distributions

At this point, we need to start talking specifics about how to construct a hypothesis test. To that end, let’s return to the ESP example. Let’s ignore the actual data we obtained, for the moment, and think about the structure of the experiment. Regardless of the actual numbers, the *form* of the data is that $X$ out of $N$ people correctly identified the colour of the hidden card. Moreover, let’s suppose for the moment that the null hypothesis really is true: ESP doesn’t exist, and the true probability that anyone picks the correct colour is exactly $\theta = 0.5$. What would we *expect* the data to look like? Well, obviously, we’d expect the proportion of people who make the correct response to be pretty close to 50%. Or, to phrase this in more mathematical terms, we’d say that $X/N$ is approximately $0.5$. Of course, we wouldn’t expect this fraction to be *exactly* 0.5: if, for example, we tested $N=100$ people, and $X = 53$ of them got the question right, we’d probably be forced to concede that the data are quite consistent with the null hypothesis. On the other hand, if $X = 99$ of our participants got the question right, then we’d feel pretty confident that the null hypothesis is wrong. Similarly, if only $X=3$ people got the answer right, we’d be similarly confident that the null was wrong. Let’s be a little more technical about this: we have a quantity $X$ that we can calculate by looking at our data; after looking at the value of $X$, we decide whether to believe that the null hypothesis is correct or to reject the null hypothesis in favour of the alternative. The name for this thing that we calculate to guide our choices is a **test statistic**.

Having chosen a test statistic, now the next step is to state precisely which values of the test statistic would cause us to reject the null hypothesis and which values would cause us to keep it. To do so, we need to determine what the **sampling distribution of the test statistic** would be if the null hypothesis were true (we discussed sampling distributions earlier in Chapter 8.3.1). Why do we need this? Because this distribution tells us precisely what values of $X$, our null hypothesis would lead us to expect. And therefore, we can use this distribution to assess how closely the null hypothesis agrees with our data.

How do we determine the sampling distribution of the test statistic? Fortunately, our ESP example provides us with one of the most uncomplicated cases. Our population parameter $\theta$ is just the overall probability that people respond correctly when asked the question, and our test statistic $X$ is the *count* of the number of people who did so out of a sample size of $N$. We’ve seen a distribution like this in Chapter 7.4.1: that’s exactly what the binomial distribution describes! So, to use the notation and terminology introduced in that section, we would say that the null hypothesis predicts that $X$ is binomially distributed, which is written
$X \sim \mbox{Binomial}(\theta,N)$

Since the null hypothesis states that $\theta = 0.5$ and our experiment has $N=100$ people, we have the sampling distribution we need. This sampling distribution is plotted in Figure 9.1. No surprises, really: the null hypothesis says that $X=50$ is the most likely outcome, and it says that we’re almost certain to see somewhere between 40 and 60 correct responses.

## 9.4 Making decisions

We’ve constructed a test statistic ($X$), and we chose it so that we’re pretty confident that if $X$ is close to $N/2$, then we should retain the null, and if not, we should reject it. The question remains: exactly which values of the test statistic should we associate with the null hypothesis, and which values go with the alternative hypothesis? In the ESP study, for example, we’ve observed a value of $X=62$. What decision should we make? Should we choose to believe the null hypothesis or the alternative hypothesis?

### 9.4.1 Critical regions and critical values

To answer this question, we need to introduce the concept of a **critical region** for the test statistic $X$. The critical region of the test corresponds to those values of $X$ that would lead us to reject the null hypothesis (which is why the critical region is also sometimes called the rejection region). How do we find this critical region? Well, let’s consider what we know:

- $X$ should be very big or very small to reject the null hypothesis.
- If the null hypothesis is true, the sampling distribution of $X$ is Binomial$(0.5, N)$.
- If $\alpha =.05$, the critical region must cover 5% of this sampling distribution.

You must understand this last point: the critical region corresponds to those values of $X$ for which we would reject the null hypothesis, and the sampling distribution in question describes the probability that we would obtain a particular value of $X$ if the null hypothesis were actually true.

Now, let’s suppose that we chose a critical region that covers 20% of the sampling distribution, and assume that the null hypothesis is actually true. What would be the probability of incorrectly rejecting the null? The answer is, of course, 20%. And therefore, we would have built a test that had an $\alpha$ level of $0.2$. If we want $\alpha = .05$, the critical region is only *allowed* to cover 5% of the sampling distribution of our test statistic.

As it turns out, those three things uniquely solve the problem: our critical region consists of the most *extreme values*, known as the **tails** of the distribution (illustrated in Figure 9.2). As it turns out, if we want $\alpha = .05$, then our critical regions correspond to $X \leq 40$ and $X \geq 60$.^{36} If the number of people saying “true” is between 41 and 59, we should retain the null hypothesis. We should reject the null hypothesis if the number is between 0 to 40 or between 60 to 100. The numbers 40 and 60 are often referred to as the **critical values** since they define the edges of the critical region.

At this point, our hypothesis test is essentially complete: - (1) we choose an $\alpha$ level (e.g. $\alpha = .05$), - (2) we come up with some test statistic (e.g., $X$) that does a good job (in some meaningful sense) of comparing $H_0$ to $H_1$, - (3) we figure out the sampling distribution of the test statistic on the assumption that the null hypothesis is true (in this case, binomial) and then - (4) we calculate the critical region that produces an appropriate $\alpha$ level (0-40 and 60-100).

Now, we have to calculate the value of the test statistic for the real data (e.g., $X = 62$) and then compare it to the critical values to make our decision. Since 62 is greater than the critical value of 60, we would reject the null hypothesis. Or, to phrase it slightly differently, we say that the test has produced a **significant** result.

### 9.4.2 A note on statistical “significance”

A very brief digression is in order regarding the word “significant”. It is a misnomer. The concept of statistical significance is very simple, but has a very unfortunate name. If the data allows us to reject the null hypothesis, we say that “the result is *statistically significant*”, often shortened to “the result is significant”. This terminology dates back to a time when “significant” just meant something like “indicated” rather than its modern meaning, which is much closer to “important”. As a result, many modern readers get very confused when they start learning statistics because they think a “significant result” must be an important one. It doesn’t mean that at all. All that “statistically significant” means is that the data allowed us to reject a null hypothesis. Whether or not the result is actually important in the real world is a very different question and depends on all sorts of other things.

### 9.4.3 The difference between one sided and two sided tests

There’s one more thing to point out about the constructed hypothesis test. Let us take a moment to think about the statistical hypotheses: $\begin{array}{cc} H_0 : & \theta = .5 \\ H_1 : & \theta \neq .5 \end{array}$

We notice that the alternative hypothesis covers *both* the possibility that $\theta < .5$ and the possibility that $\theta > .5$. This makes sense if we think ESP could produce better-than-chance performance *or* worse-than-chance performance. This is an example of a **two-sided test** in statistical language. It’s called this because the alternative hypothesis covers the area on both “sides” of the null hypothesis. As a consequence, the critical region of the test covers both tails of the sampling distribution (2.5% on either side provided that $\alpha =.05$), as illustrated earlier in Figure 9.2.

However, that’s not the only possibility. For example, it might be the case that we’re only willing to believe in ESP if it produces better than chance performance. If so, then the alternative hypothesis would only cover the possibility that $\theta > .5$, and as a consequence, the null hypothesis now becomes $\theta \leq .5$: $\begin{array}{cc} H_0 : & \theta \leq .5 \\ H_1 : & \theta > .5 \end{array}$

When this happens, we have what’s called a **one-sided test**, and when this happens, the critical region only covers one tail of the sampling distribution. This is illustrated in Figure 9.3.

## 9.5 The $p$ value of a test

In one sense, our hypothesis test is complete; we’ve constructed a test statistic, figured out its sampling distribution if the null hypothesis is true, and then constructed the critical region for the test. Nevertheless, we’ve actually omitted the most important number of all: **the $p$ value**. It is to this topic that we now turn.

There are two somewhat different ways of interpreting a $p$ value, one proposed by Sir Ronald Fisher and the other by Jerzy Neyman. Both versions are legitimate, though they reflect very different ways of thinking about hypothesis tests. Most introductory textbooks only give Fisher’s version, but that’s a bit of a shame. Danielle is believes Neyman’s version is cleaner and that it better reflects the logic of the null hypothesis test. You might disagree, though, so both are included. We’ll start with Neyman’s version.

### 9.5.1 A softer view of decision making

One problem with the hypothesis testing procedure is that it makes no distinction between a “barely significant” and a “highly significant” result. For instance, in the ESP study, the data only just fell inside the critical region - so we did get a significant effect, but it was a pretty near thing. In contrast, suppose we run a study in which $X=97$ out of the $N=100$ participants got the answer right. This would obviously be significant, too, but by a much larger margin. There is no real ambiguity about this at all. The procedure makes no distinction between the two. If we adopt the standard convention of allowing $\alpha = .05$ as an acceptable Type I error rate, then both are significant results.

This is where the $p$ value comes in handy. To understand how it works, let’s suppose that we ran many hypothesis tests on the same data set: but with a different value of $\alpha$ in each case. When we do that for our ESP data, what we’d get is something like this:

Value of $\alpha$ | Reject the null? |
---|---|

0.05 | Yes |

0.04 | Yes |

0.03 | Yes |

0.02 | No |

0.01 | No |

When we test ESP data ($X=62$ successes out of $N=100$ observations) using $\alpha$ levels of .03 and above, we always reject the null hypothesis. We always retain the null hypothesis for $\alpha$ levels of .02 and below. Therefore, somewhere between .02 and .03, there must be the smallest value of $\alpha$ that would allow us to reject the null hypothesis for this data. This is the $p$ value; as it turns out the ESP data has $p = .021$. In short:

$p$ is defined to be the smallest Type I error rate ($\alpha$) that you have to be willing to tolerate if you want to reject the null hypothesis.

If it turns out that $p$ describes an error rate that you find intolerable, then you must retain the null. If you’re comfortable with an error rate equal to $p$, then it’s okay to reject the null hypothesis in favour of your preferred alternative.

In effect, $p$ is a summary of all the possible hypothesis tests you could have run, taken across all possible $\alpha$ values. And as a consequence, it has the effect of “softening” our decision process. For those tests in which $p \leq \alpha$, you would have rejected the null hypothesis, whereas for those tests in which $p > \alpha$, you would have retained the null.

In our ESP study, we obtained $X=62$, and as a consequence we’ve ended up with $p = .021$. So the error rate we have to tolerate is 2.1%. In contrast, suppose the experiment had yielded $X=97$. What happens to the $p$ value now? This time it’s shrunk to $p = 1.36 \times 10^{-25}$, which is a tiny, tiny^{37} Type I error rate. For this second case, we would be able to reject the null hypothesis with a lot more confidence because we only have to be “willing” to tolerate a type I error rate of about 1 in 10 trillion trillion to justify our decision to reject.

### 9.5.2 The probability of extreme data

The second definition of the $p$-value comes from Sir Ronald Fisher, which you tend to see in most introductory statistics textbooks. Notice how, when we constructed the critical region, it corresponded to the *tails* (i.e. extreme values) of the sampling distribution? That’s not a coincidence: almost all “good” tests have this characteristic (good in minimising our type II error rate, $\beta$). The reason for that is that a good critical region almost always corresponds to those values of the test statistic that are least likely to be observed if the null hypothesis is true. If this rule is true, then we can define the $p$-value as the probability that we would have observed a test statistic that is at least as extreme as the one we actually did get.

In other words, if the data are extremely implausible according to the null hypothesis, then the null hypothesis is probably wrong.

### 9.5.3 A common mistake

You can see that there are two somewhat different but legitimate ways to interpret the $p$ value, one based on Neyman’s approach to hypothesis testing and the other based on Fisher’s. Unfortunately, there is a third explanation that people sometimes give, especially when they’re first learning statistics, and it is *absolutely and completely wrong*. This mistaken approach is to refer to the $p$ value as “the probability that the null hypothesis is true”. It’s an intuitively appealing way to think, but it’s wrong in two key respects:

- null hypothesis testing is a frequentist tool, and the frequentist approach to probability does
*not*allow you to assign probabilities to the null hypothesis. According to this view of probability, the null hypothesis is either true or not, but it cannot have a “5% chance” of being true. - even within the Bayesian approach, which does let you assign probabilities to hypotheses, the $p$ value would not correspond to the probability that the null is true; this interpretation is entirely inconsistent with the mathematics of how the $p$ value is calculated.

Put bluntly, despite the intuitive appeal of thinking this way, there is *no* justification for interpreting a $p$ value this way. Never do it.

## 9.6 Reporting the results of a hypothesis test

When writing up the results of a hypothesis test, there are usually several pieces of information that you need to report, but it varies a fair bit from test to test. In the chapters discussing each statistical tool, we’ll spend a little time talking about how to report the results correctly. However, regardless of what test you’re doing, the one thing that you always have to do is say something about the $p$ value and whether or not the outcome was significant.

The fact that you have to do this is unsurprising: it’s the whole point of doing the test. It might be surprising, though, that there is some contention over exactly how you’re supposed to do it. Leaving aside those people who completely disagree with the entire framework underpinning null hypothesis testing, there’s a certain amount of tension that exists regarding whether or not to report the exact $p$ value that you obtained, or if you should state only that $p < \alpha$ for a significance level that you chose in advance (e.g., $p<.05$).

### 9.6.1 The issue

To see why this is an issue, the key thing to recognise is that $p$ values are *terribly* convenient. In practice, the fact that we can compute a $p$ value means that we don’t have to specify any $\alpha$ level at all to run the test. Instead, what you can do is calculate your $p$ value and interpret it directly: if you get $p = .062$, then it means that you’d have to be willing to tolerate a Type I error rate of 6.2% to justify rejecting the null. If you personally find 6.2% intolerable, then you retain the null.

Therefore, the argument goes, why don’t we just report the actual $p$ value and let the reader make up their own minds about what an acceptable Type I error rate is? This approach has the big advantage of “softening” the decision-making process – in fact, if you accept the Neyman definition of the $p$ value, that’s the whole point of the $p$ value. We no longer have a fixed significance level of $\alpha = .05$ as a bright line separating “accept” from “reject” decisions; and this removes the rather pathological problem of being forced to treat $p = .051$ in a fundamentally different way to $p = .049$.

This flexibility is both the advantage and the disadvantage to the $p$ value. Many people don’t like the idea of reporting an exact $p$ value because it gives the researcher a bit *too much* freedom. In particular, it lets you change your mind about what error tolerance you’re willing to put up with *after* you look at the data.

For instance, consider the ESP experiment. Suppose we ran the test and ended up with a $p$ value of .09. Should we accept or reject? Now, we haven’t yet bothered to think about what level of Type I error we’re “really” willing to accept. But we *do* have an opinion about whether or not ESP exists. Regardless, we could always decide that a 9% error rate isn’t so bad, especially compared to how annoying it would be to admit to the world that the experiment has failed. So, to avoid looking like we just made it up after the fact, we now say that our $\alpha$ is .1: a 10% type I error rate isn’t too bad, and at that level, our test is significant! We win.

In other words, the worry here is that we might have the best of intentions and be the most honest of people, but the temptation to just “shade” things a little bit here and there is really, really strong. Anyone who has ever run an experiment can attest that it’s a long and difficult process, and you often get *very* attached to your hypotheses. It’s hard to let go and admit the experiment didn’t find what you wanted it to find. And that’s the danger here. If we use the “raw” $p$-value, people will start interpreting the data in terms of what they *want* to believe, not what the data are actually saying. And if we allow that, well, why are we bothering to do science at all? Why not let everyone believe whatever they like about anything, regardless of what the facts are? Okay, that’s a bit extreme, but that’s where the worry comes from. According to this view, you really *must* specify your $\alpha$ value in advance and then only report whether the test was significant or not. It’s the only way to keep ourselves honest.

### 9.6.2 Two proposed solutions

In practice, it’s pretty rare for a researcher to specify a single $\alpha$ level ahead of time. Instead, the convention is that scientists rely on three standard significance levels: .05, .01 and .001. When reporting your results, you indicate which (if any) of these significance levels allow you to reject the null hypothesis. This is summarised in Table 9.2. This allows us to soften the decision rule a little bit since $p<.01$ implies that the data meet a stronger evidentiary standard than $p<.05$ would. Nevertheless, since these levels are fixed in advance by convention, it does prevent people from choosing their $\alpha$ level after looking at the data.

Usual notation | Signif. stars | Meaning | The null is… |
---|---|---|---|

$p>.05$ | The test wasn’t significant | Retained | |

$p<.05$ | * | The test was significant at $\alpha = .05$ but not at $\alpha =.01$ or $\alpha = .001$. | Rejected |

$p<.01$ | ** | The test was significant at $\alpha = .05$ and $\alpha = .01$ but not at $\alpha = .001$ | Rejected |

$p<.001$ | *** | The test was significant at all levels | Rejected |

Nevertheless, many people still prefer to report exact $p$ values. To many people, the advantage of allowing the reader to make up their own mind about how to interpret $p = .06$ outweighs any disadvantages. In practice, however, even among those researchers who prefer exact $p$ values, it is quite common to just write $p<.001$ instead of reporting an exact value for small $p$. This is in part because a lot of software doesn’t actually print out the $p$ value when it’s that small (e.g., SPSS just writes $p = .000$ whenever $p<.001$), and in part because a very small $p$ value can be kind of misleading. The human mind sees a number like .0000000001, and it’s hard to suppress the gut feeling that the evidence in favour of the alternative hypothesis is a near certainty. In practice, however, this is usually wrong. Every statistical test ever invented relies on simplifications, approximations and assumptions. As a consequence, it’s probably not reasonable to walk away from *any* statistical analysis with a feeling of confidence stronger than $p<.001$ implies. In other words, $p<.001$ is really code for “as far as *this test* is concerned, the evidence is overwhelming.”

In light of all this, you might wonder what exactly you should do. There’s a fair bit of contradictory advice on the topic, with some people arguing that you should report the exact $p$ value and other people arguing that you should use the tiered approach illustrated in Table 9.2. As a result, the best advice is that you look at papers/reports written in your field and see what the convention seems to be. If there doesn’t seem to be any consistent pattern, then use whichever method you prefer.

For any hypothesis test that CogStat will run for you, whether that be a t-test, ANOVA, regression, etc., the output will include a $p$ value. If you want to report the $p$ value, you can just copy and paste it from the output.

## 9.7 Effect size, sample size and power

In previous sections, we emphasised the fact that the major design principle behind statistical hypothesis testing is that we try to control our Type I error rate. When we fix $\alpha = .05$, we are attempting to ensure that only 5% of true null hypotheses are incorrectly rejected. However, this doesn’t mean that we don’t care about Type II errors. In fact, from the researcher’s perspective, the error of failing to reject the null when it is actually false is an extremely annoying one. With that in mind, a secondary goal of hypothesis testing is to try to minimise $\beta$, the Type II error rate, although we don’t usually *talk* in terms of minimising Type II errors. Instead, we talk about maximising the *power* of the test. Since power is defined as $1-\beta$, this is the same thing.

### 9.7.1 The power function

Let’s take a moment to think about what a Type II error actually is. A Type II error occurs when the alternative hypothesis is true, but we are nevertheless unable to reject the null hypothesis. Ideally, we’d be able to calculate a single number $\beta$ that tells us the Type II error rate, similarly to how we can set $\alpha = .05$ for the Type I error rate. Unfortunately, this is a lot trickier to do.

To see this, notice that in the ESP study, the alternative hypothesis actually corresponds to lots of possible values of $\theta$. In fact, the alternative hypothesis corresponds to every value of $\theta$ *except* 0.5. Let’s suppose that the true probability of someone choosing the correct response is 55% (i.e., $\theta = .55$). If so, then the *true* sampling distribution for $X$ is not the same one that the null hypothesis predicts: the most likely value for $X$ is now 55 out of 100. Not only that, the whole sampling distribution has now shifted, as shown in Figure 9.4. The critical regions, of course, do not change: by definition, the critical regions are based on what the null hypothesis predicts. What we’re seeing in this figure is the fact that when the null hypothesis is wrong, a much larger proportion of the sampling distribution falls in the critical region. And, of course, that’s what should happen: the probability of rejecting the null hypothesis is larger when the null hypothesis is actually false! However, $\theta = .55$ is not the only possibility that is consistent with the alternative hypothesis.

Let’s instead suppose that the true value of $\theta$ is 0.7. What happens to the sampling distribution when this occurs? The answer, shown in Figure 9.5, is that almost the entirety of the sampling distribution has now moved into the critical region. Therefore, if $\theta = 0.7$, the probability of us correctly rejecting the null hypothesis (i.e. the power of the test) is much larger than if $\theta = 0.55$. In short, while $\theta = .55$ and $\theta = .70$ are both part of the alternative hypothesis, the Type II error rate is different.

This means that the power of a test (i.e., $1-\beta$) depends on the true value of $\theta$. To illustrate this, we’ve calculated the expected probability of rejecting the null hypothesis for all values of $\theta$ and plotted it in Figure 9.6. This plot describes what is usually called the **power function** of the test. It’s a nice summary of how good the test is because it actually tells you the power ($1-\beta$) for all possible values of $\theta$. As you can see, when the true value of $\theta$ is very close to 0.5, the power of the test drops very sharply, but when it is further away, the power is large.

### 9.7.2 Effect size

The plot shown in Figure 9.6 captures a basic point about hypothesis testing. If the true state of the world is very different from what the null hypothesis predicts, then your power will be very high; but if the true state of the world is similar to the null (but not identical), then the power of the test is going to be very low. Therefore, it’s useful to be able to have some way of quantifying how “similar” the true state of the world is to the null hypothesis. A statistic that does this is called a measure of **effect size** (e.g. Cohen, 1988; Ellis, 2010).

Effect size is defined slightly differently in different contexts, but the qualitative idea that it tries to capture is always the same: how big is the difference between the *true* population parameters and the parameter values that are assumed by the null hypothesis.

In our ESP example, if we let $\theta_0 = 0.5$ denote the value assumed by the null hypothesis, and let $\theta$ denote the true value, then a simple measure of effect size could be something like the difference between the true value and null (i.e., $\theta - \theta_0$), or possibly just the magnitude of this difference, $\mbox{abs}(\theta - \theta_0)$.

Big effect size | Small effect size | |
---|---|---|

Significant result |
difference is real, and of practical importance | difference is real, but might not be interesting |

Non-significant result |
no effect observed | no effect observed |

Why calculate effect size? Let’s assume that you’ve run your experiment, collected the data, and gotten a significant effect when you ran your hypothesis test. Isn’t it enough just to say that you’ve gotten a significant effect? Surely that’s the *point* of hypothesis testing? Well, sort of. Yes, the point of doing a hypothesis test is to try to demonstrate that the null hypothesis is wrong, but that’s hardly the only thing we’re interested in.

If the null hypothesis claimed that $\theta = .5$, and we show that it’s wrong, we’ve only really told half of the story. Rejecting the null hypothesis implies that we believe that $\theta \neq .5$, but there’s a big difference between $\theta = .51$ and $\theta = .8$. If we find that $\theta = .8$, then not only have we found that the null hypothesis is wrong, it appears to be *very* wrong.

On the other hand, suppose we’ve successfully rejected the null hypothesis, but it looks like the true value of $\theta$ is only .51 (this would only be possible with a large study). Sure, the null hypothesis is wrong, but it’s not at all clear that we actually *care*, because the effect size is so small.

In the context of the ESP study, we might still care, since any demonstration of real psychic powers would actually be pretty cool^{38}, but in other contexts, a 1% difference isn’t very interesting, even if it is a real difference.

For instance, suppose we’re looking at differences in high school exam scores between males and females, and it turns out that the female scores are 1% higher on average than the males. If I’ve got data from thousands of students, then this difference will almost certainly be *statistically significant*, but regardless of how small the $p$ value is, it’s just not very interesting. You’d hardly want to go around proclaiming a crisis in boys’ education based on such a tiny difference, would you? For this reason, it is becoming more standard (slowly but surely) to report some kind of standard measure of effect size along with the results of the hypothesis test. The hypothesis test itself tells you whether you should believe that the effect you have observed is real (i.e., not just due to chance); the effect size tells you whether or not you should care.

### 9.7.3 Increasing the power of your study

Not surprisingly, scientists are fairly obsessed with maximising the power of their experiments. We want our experiments to work, so we want to maximise the chance of rejecting the null hypothesis if it is false (and we usually want to believe it is false!).

As we’ve seen, one factor that influences power is the *effect size*. So the first thing you can do to increase your power is to increase the effect size. In practice, this means that you want to design your study so that the effect size gets magnified. For instance, in the ESP study, we might believe that psychic powers work best in a quiet, darkened room; with fewer distractions to cloud the mind. Therefore, we would try to conduct the experiments in such an environment. If we can strengthen people’s ESP abilities somehow, then the true value of $\theta$ will go up^{39} and, therefore, our effect size will be larger. In short, clever experimental design is one way to boost power; because it can alter the effect size.

Unfortunately, it’s often the case that even with the best experimental designs, you may have only a small effect. Perhaps, for example, ESP really does exist, but even under the best of conditions, it’s very very weak. Under those circumstances, your best bet for increasing power is to increase the sample size. Generally, the more observations you have available, the more likely you can discriminate between two hypotheses.

If we had 10 participants, and 7 of them correctly guessed the colour of the hidden card, you wouldn’t be terribly impressed. But if we had 10,000 participants and 7,000 of them got the answer right, you would likely think we had discovered something. In other words, power increases with the sample size. This is illustrated in Figure 9.7, which shows the power of the test for a true parameter of $\theta = 0.7$, for all sample sizes $N$ from 1 to 100, where I’m assuming that the null hypothesis predicts that $\theta_0 = 0.5$.

Because power is important, it would be pretty useful to know how much power you’re likely to have when you’re contemplating running an experiment. It’s never possible to know for sure since you can’t possibly know what your effect size is. However, it’s sometimes possible to guess how big it should be. If so, you can guess what sample size you need!

This idea is called **power analysis**. If it’s feasible to do it, it’s very helpful since it can tell you whether you have enough time or money to run the experiment successfully. It’s increasingly common to see people arguing that power analysis should be a required part of experimental design, so it’s worth knowing about, but we won’t be discussing it in this book in details.

## 9.8 Some issues to consider

What we’ve discussed in this chapter is the orthodox framework for null hypothesis significance testing (NHST). Understanding how NHST works is an absolute necessity since it has been the dominant approach to inferential statistics since its prominence in the early 20th century. It’s what the vast majority of working scientists rely on for their data analysis, so even if you hate it, you need to know it. However, the approach is not without problems. There are several quirks in the framework, historical oddities in how it came to be, theoretical disputes over whether or not the framework is proper, and many practical traps for the unwary. Without going into much detail on this topic, it’s worth briefly discussing a few of these issues.

### 9.8.1 Neyman versus Fisher

The first thing you should be aware of is that orthodox NHST is a mash-up of two somewhat different approaches to hypothesis testing, one proposed by Sir Ronald Fisher and the other proposed by Jerzy Neyman (for a historical summary, see Lehmann, 2011). The history is messy because Fisher and Neyman were real people whose opinions changed over time, and at no point did either of them offer “the definitive statement” of how we should interpret their work many decades later. That said, here’s a quick summary of these two approaches.

First, let’s talk about Fisher’s approach. Fisher assumed that you only had one hypothesis (the null), and what you want to do is find out if the null hypothesis is inconsistent with the data. From his perspective, what you should do is check to see if the data are “sufficiently unlikely” according to the null. In fact, if you remember back to our earlier discussion, that’s how Fisher defines the $p$-value. According to Fisher, if the null hypothesis provided a very poor account of the data, you could safely reject it. But, since you don’t have any other hypotheses to compare it to, there’s no way of “accepting the alternative” because you don’t necessarily have an explicitly stated alternative. That’s more or less all that there was to it.

In contrast, Neyman thought that the point of hypothesis testing was as a guide to action, and his approach was somewhat more formal than Fisher’s. His view was that there are multiple things you could *do* (accept the null or accept the alternative), and the point of the test was to tell you which one the data support. From this perspective, it is critical to specify your alternative hypothesis properly. If you don’t know the alternative hypothesis, then you don’t know how powerful the test is or which action makes sense. His framework genuinely requires competition between different hypotheses. For Neyman, the $p$ value didn’t directly measure the probability of the data (or data more extreme) under the null; it was more of an abstract description about which “possible tests” were telling you to accept the null and which “possible tests” were telling you to accept the alternative.

As you can see, what we have today is an odd mishmash of the two. We talk about having both a null hypothesis and an alternative (Neyman) but usually define the $p$ value in terms of extreme data (Fisher), but we still have $\alpha$ values (Neyman). Some statistical tests have explicitly specified alternatives (Neyman), but others are quite vague about it (Fisher). And, according to some people, we’re not allowed to talk about accepting the alternative (Fisher). It’s a mess.

### 9.8.2 Bayesians versus frequentists

Earlier in this chapter, we emphasised how you *cannot* interpret the $p$ value as the probability that the null hypothesis is true. NHST is fundamentally a frequentist tool (see Chapter 7), and as such, it does not allow you to assign probabilities to hypotheses: the null hypothesis is either true or not. The Bayesian approach to statistics interprets probability as a degree of belief, so it’s totally okay to say that there is a 10% chance that the null hypothesis is true: that’s just a reflection of your degree of confidence in this hypothesis. You aren’t allowed to do this within the frequentist approach. Remember, if you’re a frequentist, a probability can only be defined in terms of what happens after a large number of independent replications (i.e. a long-run frequency). If this is your interpretation of probability, talking about the “probability” that the null hypothesis is true is complete gibberish.

Most importantly, this *isn’t* an ideological matter purely. If you decide that you are a Bayesian and are okay with making probability statements about hypotheses, you *must* follow the Bayesian rules for calculating those probabilities. We’ll talk more about this in Chapter 15, but for now, understand that $p$ value is a *terrible* approximation to the probability that $H_0$ is true. If what you want to know is the probability of the null, then the $p$ value is not what you’re looking for!

### 9.8.3 Traps

As you can see, the theory behind hypothesis testing is a mess, and even now, there are arguments in statistics about how it “should” work. However, disagreements among statisticians are not our real concern here. Our real concern is practical data analysis. And while the “orthodox” approach to null hypothesis significance testing has many drawbacks, even an unrepentant Bayesian would agree that they can be useful if used responsibly. They usually give sensible answers, and you can use them to learn interesting things. Setting aside the various ideologies and historical confusions that we’ve discussed, the fact remains that the biggest danger in all of statistics is *thoughtlessness*. Not stupidity but thoughtlessness. The rush to interpret a result without thinking through what each test says about the data and checking whether that’s consistent with how you’ve interpreted it. That’s where the biggest trap lies.

To give an example of this, consider the following example (see Gelman & Stern, 2006). Suppose we run the ESP study, and we’ve decided to analyse the data separately for the male and female participants. Of the male participants, 33 out of 50 guessed the card’s colour correctly. This is a significant effect ($p = .03$). Of the female participants, 29 out of 50 guessed correctly. This is not a significant effect ($p = .32$). Upon observing this, it is extremely tempting for people to start wondering why there is a difference between males and females regarding their psychic abilities. However, this doesn’t seem right. If you think about it, we haven’t *actually* run a test that explicitly compares males to females. All we have done is compare males to chance (binomial test was significant) and compare females to chance (binomial test was non-significant). If we want to argue that there is a real difference between the males and the females, we should probably run a test of the null hypothesis that there is no difference! We can do that using a different hypothesis test, but when we do that, it turns out that we have no evidence that males and females are significantly different ($p = .54$). *Now* do you think that there’s anything fundamentally different between the two groups? Of course not. What’s happened here is that the data from both groups (male and female) are pretty borderline: by pure chance, one of them happened to end up on the magic side of the $p = .05$ line, and the other one didn’t. That doesn’t imply that males and females are different. This mistake is so common that you should always be wary of it: the difference between significant and not-significant is *not* evidence of a real difference – if you want to say that there’s a difference between two groups, then you have to test for that difference!

Think about *what* it is you want to test, *why* you want to test it, and whether or not the answers that your test gives could possibly make any sense in the real world.

## 9.9 Summary

Null hypothesis testing is one of the most ubiquitous elements of statistical theory. The vast majority of scientific papers report the results of some hypothesis test or another. As a consequence, it is almost impossible to get by in science without having at least a cursory understanding of what a $p$-value means, making this one of the most important chapters in the book. Here is a quick recap of the key ideas that we’ve talked about:

- Research hypotheses and statistical hypotheses. Null and alternative hypotheses. (Section 9.1).
- Type 1 and Type 2 errors (Section 9.2)
- Test statistics and sampling distributions (Section 9.3)
- Hypothesis testing as a decision making process (Section 9.4)
- $p$-values as “soft” decisions (Section 9.5)
- Writing up the results of a hypothesis test (Section 9.6)
- Effect size and power (Section 9.7)
- A few issues to consider regarding hypothesis testing (Section 9.8)

Later in the book, in Chapter 15, we’ll revisit the theory of null hypothesis tests from a Bayesian perspective and introduce a number of new tools that you can use if you aren’t particularly fond of the orthodox approach. But for now, though, we’re done with the abstract statistical theory, and we can start discussing specific data analysis tools.

### References

*Statistical power analysis for the behavioral sciences*(2nd ed.). Lawrence Erlbaum.

*The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results*. Cambridge University Press.

*The American Statistician*,

*60*, 328–331.

*Fisher, Neyman, and the creation of classical statistics*. Springer.

My apologies to anyone who actually believes in this stuff, but on my reading of the literature on ESP, it’s just not reasonable to think this is real. To be fair, though, some of the studies are rigorously designed; so it’s actually an interesting area for thinking about psychological research design. And of course, it’s a free country, so you can spend your own time and effort proving me wrong if you like, but I wouldn’t think that’s a terribly practical use of your intellect. – Danielle↩︎

Strictly speaking, the test has $\alpha = .057$, which is a bit too generous. However, if we’d chosen 39 and 61 as the boundaries for the critical region, then the critical region only covers 3.5% of the distribution. For the sake of the example, we’re willing to tolerate a 5.7% type I error rate since that’s as close as we can get to a value of $\alpha = .05$.↩︎

That’s $p = .000000000000000000000000136$ for folks that don’t like or know scientific notation!↩︎

Although, in practice, a very small effect size is worrying because even very minor methodological flaws might be responsible for the effect. And in practice, no experiment is perfect, so there are always methodological issues to worry about.↩︎

Notice that the true population parameter $\theta$ doesn’t necessarily correspond to an immutable fact of nature. In this context, $\theta$ is just the true probability that people would correctly guess the colour of the card in the other room. As such, the population parameter can be influenced by all sorts of things. Of course, this is all on the assumption that ESP actually exists!↩︎