Chapter 10 Categorical data analysis
Now that we’ve got the basic theory behind hypothesis testing, it’s time to start looking at specific tests commonly used in psychology. We’ll start with “$\chi^2$ tests” (pronounced as ‘chi square’) in this chapter and “$t$tests” (Chapter 11) in the next one. Both of these tools are very frequently used in scientific practice for comparing groups. While they’re not as powerful as “analysis of variance” (Chapter 12) and “regression” (Chapter 14), they’re much easier to understand.
The term “categorical data” is the term preferred by data analysis people, but it’s just another name for “nominal scale data”. To refresh your memory on data types, please revisit our introductory chapter on scales of measurement and types of variables (see Chapter 2.2).
In any case, categorical data analysis refers to a collection of tools that you can use when your data are nominal scale. We can use many tools for categorical data analysis, but this chapter covers the ones used by CogStat along some more common ones.
10.1 The $\chi^2$ goodnessoffit test
The $\chi^2$ goodnessoffit test is one of the oldest hypothesis tests around: it was invented by Karl Pearson around the turn of the century (Pearson, 1900), with some corrections made later by Sir Ronald Fisher (Fisher, 1922a). Let’s start with some psychology to introduce the statistical problem it addresses.
Over the years, there have been a lot of studies showing that humans have a lot of difficulties in simulating randomness. Try as we might to “act” random, we think in terms of patterns and structure, and so when asked to “do something at random”, what people do is anything but random. Consequently, the study of human randomness (or nonrandomness, as the case may be) opens up a lot of deep psychological questions about how we think about the world. With this in mind, let’s consider a very simple study. Suppose we asked people to imagine a shuffled deck of cards and mentally pick one card from this imaginary deck “at random”. After they’ve chosen one card, we ask them to select a second one mentally. For both choices, we’re going to look at the suit (hearts, clubs, spades or diamonds) that people chose. After asking, say, $N=200$ people to do this, we’d like to look at the data and figure out whether or not the cards that people pretended to select were random. The data are contained in the cards.csv
file, which we will load into CogStat. For the moment, let’s just focus on the first choice that people made (choice_1
).
Important note: CogStat currently doesn’t support singlevariable hypothesis testing for nominal scale data. However, this chapter will still be useful for you to understand the tools used in hypothesis testing, and you can use them as described here in other software packages.
We can see that the data are nominal scale, so we’ll use the $\chi^2$ goodnessoffit test to analyze them. We’ll also use the “Fisher’s exact test” option, which is a more powerful version of the $\chi^2$ test that is appropriate when the sample size is small (less than 40). We’ll also use the “Bonferroni correction” option, which is a way of correcting for multiple comparisons. For now, let’s just run the analysis.
That little frequency table in Figure 10.1 is quite helpful. Looking at it, there’s a bit of a hint that people might be more likely to select hearts than clubs, but it’s not completely obvious just from looking at it whether that’s really true, or if this is just due to chance. So we’ll probably have to do some kind of statistical analysis to find out, which is what we’re going to talk about in the next section.
A quick sidenote here: the mathematical notation of observations (i.e. an element in the data set) is $0_i$, where $O$ stands for observation (but could very well be the traditional $X$ or $Y$ etc.) and $i$ is the index of the observation. So $0_1$ is the first observation, $0_2$ is the second observation, and so on.
10.1.1 The null hypothesis and the alternative hypothesis
Our research hypothesis is that “people don’t choose cards randomly”. What we’re going to want to do now is translate this into some statistical hypotheses, and construct a statistical test of those hypotheses. The test is Pearson’s $\chi^2$ goodness of fit test.
As is so often the case, we have to begin by carefully constructing our null hypothesis. In this case, it’s pretty easy. First, let’s state the null hypothesis in words.
Null hypothesis ($H_0$): All four suits are chosen with equal probability.
Now, because this is statistics, we have to be able to say the same thing mathematically. Let’s use the notation $P_j$ to refer to the true probability that the $j$th suit is chosen. If the null hypothesis is true, then each of the four suits has a 25% chance of being selected: in other words, our null hypothesis claims that $P_1 = 0.25$, $P_2 = 0.25$, $P_3 = 0.25$ and finally that $P_4 = 0.25$. We can use $P$ to refer to the probabilities corresponding to our null hypothesis. So if we let the vector $P = (P_1, P_2, P_3, P_4)$ refer to the collection of probabilities that describe our null hypothesis, then we have
$H_0: {P} = (0.25, 0.25, 0.25, 0.25)$
If the experimental task were for people to imagine they were drawing from a deck that had twice as many clubs as any other suit, then the null hypothesis would correspond to something like $P = (0.4, 0.2, 0.2, 0.2)$. As long as the probabilities are all positive numbers, and they all sum to 1, then it’s a perfectly legitimate choice for the null hypothesis. However, the most common use of the goodness of fit test is to test a null hypothesis that all categories are equally likely, so we’ll stick to that for our example.
What about our alternative hypothesis, $H_1$? We’re interested in demonstrating that the probabilities involved aren’t all identical (that is, people’s choices weren’t entirely random). As a consequence, the “humanfriendly” versions of our hypotheses look like this:
Null hypothesis ($H_0$): All four suits are chosen with equal probability.
Alternative hypothesis ($H_1$): At least one of the suitchoice probabilities isn’t 0.25.
and the “mathematician friendly” version is
$H_0$  $H_1$ 

$P = (0.25, 0.25, 0.25, 0.25)$  $P \neq (0.25,0.25,0.25,0.25)$ 
10.1.2 The “goodness of fit” test statistic
What we now want to do is construct a test of the null hypothesis. As always, if we want to test $H_0$ against $H_1$, we will need a test statistic. The basic trick that a goodness of fit test uses is to construct a test statistic that measures how “close” the data are to the null hypothesis. If the data don’t resemble what you’d “expect” to see if the null hypothesis were true, then it probably isn’t true.
So, what would we expect to see if the null hypothesis were true? Or, to use the correct terminology, what are the expected frequencies?
There are $N=200$ observations, and (if the null is true) the probability of any one of them choosing a heart is $P_3 = 0.25$, so we’re expecting $200 \times 0.25 = 50$ hearts, right? Or, more specifically, if we let $E_i$ refer to “the number of category $i$ responses that we’re expecting if the null is true”, then $E_i = N \times P_i$
Clearly, what we want to do is compare the expected number of observations in each category ($E_i$) with the observed number of observations in that category ($O_i$). And on the basis of this comparison, we ought to be able to come up with a good test statistic. To start with, let’s calculate the difference between what the null hypothesis expected us to find and what we actually did find. That is, we calculate the “observed minus expected” difference score, $O_i  E_i$. This is illustrated in the following table.
$\clubsuit$  $\diamondsuit$  $\heartsuit$  $\spadesuit$  

Expected frequency  $E_i$  50  50  50  50 
Observed frequency  $O_i$  35  51  64  50 
Difference score  $O_i  E_i$  15  1  14  0 
It’s clear that people chose more hearts and fewer clubs than the null hypothesis predicted. However, a moment’s thought suggests that these raw differences aren’t quite what we’re looking for. Intuitively, it feels like it’s just as bad when the null hypothesis predicts too few observations (which is what happened with hearts) as it is when it predicts too many (which is what happened with clubs). So it’s a bit weird that we have a negative number for clubs and a positive number for hearts.
One easy way to fix this is to square everything so that we now calculate the squared differences, $(E_i  O_i)^2$.
$\clubsuit$  $\diamondsuit$  $\heartsuit$  $\spadesuit$  

Expected frequency  $E_i$  50  50  50  50 
Observed frequency  $O_i$  35  51  64  50 
Difference score  $O_i  E_i$  15  1  14  0 
Squared differences  $\left(O_i  E_i\right)^2$  225  1  196  0 
Now we’re making progress. Now, we’ve got a collection of numbers that are big whenever the null hypothesis makes a lousy prediction (clubs and hearts) but small whenever it makes a good one (diamonds and spades).
Next, let’s also divide all these numbers by the expected frequency $E_i$, so we’re calculating $\frac{(E_iO_i)^2}{E_i}$. Since $E_i = 50$ for all categories in our example, it’s not a very interesting calculation, but let’s do it anyway.
$\clubsuit$  $\diamondsuit$  $\heartsuit$  $\spadesuit$  

Expected frequency  $E_i$  50  50  50  50 
Observed frequency  $O_i$  35  51  64  50 
Difference score  $O_i  E_i$  15  1  14  0 
Squared differences  $\left(O_i  E_i\right)^2$  225  1  196  0 
Squared differences divided by expected frequency  $\frac{\left(O_i  E_i\right)^2}{E_i}$  4.5  0.02  3.92  0 
In effect, what we’ve got here are four different “error” scores, each one telling us how big a “mistake” the null hypothesis made when we tried to use it to predict our observed frequencies. So, in order to convert this into a useful test statistic, one thing we could do is just add these numbers up. We get $X^2 = 8.42$
The result is called the goodness of fit statistic, conventionally referred to either as $X^2$ or GOF. If we let $k$ refer to the total number of categories (i.e. $k=4$ for our cards data), then the $X^2$ statistic is given by the following formula: $X^2 = \sum_{i=1}^k \frac{(O_i  E_i)^2}{E_i}$
Intuitively, it’s clear that if $X^2$ is small, then the observed data $O_i$ are very close to what the null hypothesis predicted $E_i$, so we’re going to need a large $X^2$ statistic in order to reject the null. As we’ve seen from our calculations, we’ve got a value of $X^2 = 8.44$ in our cards data set. So now the question becomes, is this a big enough value to reject the null?
10.1.3 The sampling distribution of the GOF statistic (advanced)
To determine whether or not a particular value of $X^2$ is large enough to justify rejecting the null hypothesis, we will need to figure out what the sampling distribution for $X^2$ would be if the null hypothesis were true. If you want to cut to the chase and are willing to take it on faith that the sampling distribution is a chisquared ($\chi^2$) distribution with $k1$ degrees of freedom, you can skip the rest of this section. However, if you want to understand why the goodness of fit test works the way it does, read on.
Let’s suppose that the null hypothesis is true. If so, then the true probability that an observation falls in the $i$th category is $P_i$. After all, that’s the definition of our null hypothesis. If you think about it, this is kind of like saying that “nature” decides whether or not the observation ends up in category $i$ by flipping a weighted coin (i.e. one where the probability of getting a head is $P_j$). And therefore, we can think of our observed frequency $O_i$ by imagining that nature flipped $N$ of these coins (one for each observation in the data set). And exactly $O_i$ of them came up heads. Obviously, this is a pretty weird way to think about the experiment. But it reminds you that we’ve seen this scenario before. It’s exactly the same setup that gave rise to the binomial distribution in Chapter 7.4.1. In other words, if the null hypothesis is true, then it follows that our observed frequencies were generated by sampling from a binomial distribution: $O_i \sim \mbox{Binomial}(P_i, N)$ Now, if you remember from our discussion of the central limit theorem (Section 8.3.3), the binomial distribution starts to look pretty much identical to the normal distribution, especially when $N$ is large and when $P_i$ isn’t too close to 0 or 1.
In other words, as long as $N \times P_i$ is large enough – or, to put it another way, when the expected frequency $E_i$ is large enough – the theoretical distribution of $O_i$ is approximately normal. Better yet, if $O_i$ is normally distributed, then so is $(O_i  E_i)/\sqrt{E_i}$ … since $E_i$ is a fixed value, subtracting off $E_i$ and dividing by $\sqrt{E_i}$ changes the mean and standard deviation of the normal distribution.
Okay, so now let’s have a look at what our goodness of fit statistic actually is. What we’re doing is taking a bunch of things that are normally distributed, squaring them, and adding them up. As we discussed in Chapter 7.4.4, when you take a bunch of things that have a standard normal distribution (i.e. mean 0 and standard deviation 1), square them, then add them up, then the resulting quantity has a chisquare distribution. So now we know that the null hypothesis predicts that the sampling distribution of the goodness of fit statistic is a chisquare distribution.
There’s one last detail to talk about, namely the degrees of freedom. If you remember back to Chapter 7.4.4, if the number of things you’re adding up is $k$, then the degrees of freedom for the resulting chisquare distribution is $k$. Yet, at the start of this section, we said that the actual degrees of freedom for the chisquare goodness of fit test is $k1$. What’s up with that? The answer here is that what we’re supposed to be looking at is the number of genuinely independent things that are getting added together. And, even though there are $k$ things that we’re adding, only $k1$ of them are truly independent; and so the degrees of freedom are actually only $k1$.
10.1.4 Degrees of freedom
When discussing the chisquare distribution in Chapter 7.4.4, we didn’t elaborate on what “degrees of freedom” actually mean. Looking at Figure 10.2, you can see that if we change the degrees of freedom, then the chisquare distribution changes shape substantially. But what exactly is it? It’s the number of “normally distributed variables” that we are squaring and adding together. But, for most people, that’s kind of abstract and not entirely helpful. What we really need to do is try to understand degrees of freedom in terms of our data. So here goes.
The basic idea behind degrees of freedom is quite simple: you calculate it by counting up the number of distinct “quantities” that are used to describe your data; and then subtracting off all of the “constraints” that those data must satisfy.^{40} This is a bit vague, so let’s use our cards.csv
data as a concrete example.
We describe our data using four numbers, $O_1$, $O_2$, $O_3$ and $O_4$ corresponding to the observed frequencies of the four different categories (hearts, clubs, diamonds, spades). These four numbers are the random outcomes of our experiment. But, the experiment has a fixed constraint built into it: the sample size $N$.^{41} That is, if we know how many people chose hearts, how many chose diamonds and how many chose clubs, then we’d be able to figure out exactly how many chose spades. In other words, although our data are described using four numbers, they only actually correspond to $41 = 3$ degrees of freedom. A slightly different way of thinking about it is to notice that there are four probabilities that we’re interested in (again, corresponding to the four different categories), but these probabilities must sum to one, which imposes a constraint. Therefore, the degrees of freedom is $41 = 3$. Regardless of whether you want to think about it in terms of the observed frequencies or in terms of the probabilities, the answer is the same. In general, when running the chisquare goodness of fit test for an experiment involving $k$ groups, then the degrees of freedom will be $k1$.
10.1.5 Testing the null hypothesis
The final step in constructing our hypothesis test is to figure out what the rejection region is. That is, what values of $X^2$ would lead us to reject the null hypothesis? As we saw earlier, large values of $X^2$ imply that the null hypothesis has done a poor job of predicting the data from our experiment, whereas small values of $X^2$ imply that it’s actually done pretty well. Therefore, a pretty sensible strategy would be to say there is some critical value, such that if $X^2$ is bigger than the critical value, we reject the null; but if $X^2$ is smaller than this value, we retain the null.
In other words, to use the language we introduced in Chapter 9, the chisquared goodness of fit test is always a onesided test. If we want our test to have a significance level of $\alpha = .05$ (that is, we are willing to tolerate a Type I error rate of 5%), then we have to choose our critical value so that there is only a 5% chance that $X^2$ could get to be that big if the null hypothesis is true. Meaning that we want the 95th percentile of the sampling distribution. This is illustrated in Figure 10.3. So if our $X^2$ statistic is bigger than 7.814728, then we can reject the null hypothesis. Since we calculated that before (i.e. $X^2 = 8.44$), we can reject the null.
The corresponding $p$value is 0.03774185. This is the probability of getting a value of $X^2$ as big as 8.44, or bigger, if the null hypothesis is true. Since this is less than our significance level of $\alpha = .05$, we can reject the null hypothesis.
And that’s it, basically. You now know Pearson’s $\chi^2$ test for the goodness of fit.
10.1.6 How to report the results of the test
If we wanted to write this result up for a paper or something, the conventional way to report this would be to write something like this:
Of the 200 participants in the experiment, 64 selected hearts for their first choice, 51 selected diamonds, 50 selected spades, and 35 selected clubs. A chisquare goodness of fit test was conducted to test whether the choice probabilities were identical for all four suits. The results were significant ($\chi^2(3) = 8.44, p<.05$), suggesting that people did not select suits purely at random.
This is pretty straightforward, and hopefully it seems pretty unremarkable. There are a few things that you should note about this description:
 The statistical test is preceded by descriptive statistics. That is, we told the reader something about what the data looked like before going on to do the test. In general, this is good practice: remember that your reader doesn’t know your data anywhere near as well as you do. So unless you describe it to them adequately, the statistical tests won’t make sense to them.
 The description tells you what the null hypothesis being tested is. Writers don’t always do this, but it’s often a good idea in those situations where some ambiguity exists; or when you can’t rely on your readership being intimately familiar with the statistical tools you’re using. Quite often, the reader might not know (or remember) all the details of the test that your using, so it’s a kind of politeness to “remind” them! As far as the goodness of fit test goes, you can usually rely on a scientific audience knowing how it works (since it’s covered in most intro stats classes). However, it’s still a good idea to explicitly state the null hypothesis (briefly!) because the null hypothesis can differ depending on your test. For instance, in the cards example our null hypothesis was that all the four suit probabilities were identical (i.e. $P_1 = P_2 = P_3 = P_4 = 0.25$), but there’s nothing special about that hypothesis. We could just as easily have tested the null hypothesis that $P_1 = 0.7$ and $P_2 = P_3 = P_4 = 0.1$ using a goodness of fit test. So it’s helpful to the reader to explain your null hypothesis to them. Also, we described the null hypothesis in words, not in maths. That’s perfectly acceptable. You can describe it in maths if you like, but since most readers find words easier to read than symbols, most writers tend to describe the null using words if they can.
 A “stat block” is included. When reporting the results of the test itself, we didn’t just say that the result was significant; we included a “stat block” (i.e. the dense mathematicallooking part in the parentheses), which reports all the “raw” statistical data. For the chisquare goodness of fit test, the information that gets reported is the test statistic (that the goodness of fit statistic was 8.44), the information about the distribution used in the test ($\chi^2$ with 3 degrees of freedom, which is usually shortened to $\chi^2(3)$), and then the information about whether the result was significant (in this case $p<.05$). The particular information that needs to go into the stat block is different for every test, and so each time we introduce a new test, we’ll show you what the stat block should look like.
 The results are interpreted. In addition to indicating that the result was significant, we provided an interpretation of the result (i.e. that people didn’t choose randomly). This is also a kindness to the reader because it tells them what they should believe about your data. If you don’t include something like this, it’s tough for your reader to understand what’s going on.^{42}
As with everything else, your overriding concern should be that you explain things to your reader.
10.1.7 A comment on statistical notation (advanced)
If you’ve been reading very closely, there is one thing about how we wrote up the chisquare test in the last section that might be bugging you a little bit. There’s something that feels a bit wrong with writing “$\chi^2(3) = 8.44$”, you might be thinking. After all, it’s the goodness of fit statistic that is equal to 8.44, so shouldn’t I have written $X^2 = 8.44$ or maybe GOF$=8.44$? This seems to be conflating the sampling distribution (i.e. $\chi^2$ with $df = 3$) with the test statistic (i.e. $X^2$). You figured it was a typo since $\chi$ and $X$ look pretty similar. Oddly, it’s not. Writing $\chi^2(3) = 8.44$ is essentially a highly condensed way of writing “the sampling distribution of the test statistic is $\chi^2(3)$, and the value of the test statistic is 8.44”.
In one sense, this is kind of stupid. There are lots of different test statistics out there that have a chisquare sampling distribution: the $X^2$ statistic that we’ve used for our goodness of fit test is only one of many (albeit one of the most commonly encountered ones). In a sensible, perfectly organised world, we’d always have a separate name for the test statistic and the sampling distribution: that way, the stat block itself would tell you precisely what it was that the researcher had calculated. Sometimes this happens.
For instance, the test statistic used in the Pearson goodness of fit test is written $X^2$; but there’s a closely related test known as the $G$test^{43} (Sokal & Rohlf, 1994), in which the test statistic is written as $G$. As it happens, the Pearson goodness of fit test and the $G$test both test the same null hypothesis; and the sampling distribution is exactly the same (i.e. chisquare with $k1$ degrees of freedom). If we’d done a $G$test for the cards data rather than a goodness of fit test, then we’d have ended up with a test statistic of $G = 8.65$, which is slightly different from the $X^2 = 8.44$; and produces a slightly smaller $p$value of $p = .034$. Suppose that the convention was to report the test statistic, then the sampling distribution, and then the $p$value. If that were true, then these two situations would produce different stat blocks: the original result would be written $X^2 = 8.44, \chi^2(3), p = .038$, whereas the new version using the $G$test would be written as $G = 8.65, \chi^2(3), p = .034$. However, using the condensed reporting standard, the original result is written $\chi^2(3) = 8.44, p = .038$, and the new one is written $\chi^2(3) = 8.65, p = .034$, and so it’s actually unclear which test was actually run.
So why don’t we live in a world where the stat block’s contents uniquely specify what tests were run? Any test statistic that follows a $\chi^2$ distribution is commonly called a “chisquare statistic”; anything that follows a $t$distribution is called a “$t$statistic” and so on. But, as the $X^2$ versus $G$ example illustrates, two different things with the same sampling distribution are still, well, different. Consequently, it’s sometimes a good idea to be clear about what the actual test was that you ran, especially if you’re doing something unusual. If you just say “chisquare test”, it’s unclear what test you’re talking about. Although, since the two most common chisquare tests are the goodness of fit test and the independence test (Section 10.2), most readers with stats training can probably guess. Nevertheless, it’s something to be aware of.
10.2 The $\chi^2$ test of independence (or association)
The other day Danielle was watching an animated documentary examining the quaint customs of the natives of the planet Chapek 9. Apparently, in order to gain access to their capital city, a visitor must prove that they’re a robot, not a human. In order to determine whether or not the visitor is human, they ask whether the visitor prefers puppies, flowers or large, properly formatted data files. But what if humans and robots have the same preferences? That probably wouldn’t be a very good test then, would it? In order to determine whether or not a visitor is human, the natives of Chapek 9 need to know whether or not the visitor’s preferences are independent of their species. In other words, they need to know whether or not the visitor’s preferences are associated with their species.
In total, there are 180 entries in the data frame, one for each person (counting both robots and humans as “people”) who was asked to make a choice. Specifically, there are 93 humans and 87 robots.
What we want to do is look at the choices
broken down by species
. That is, we need to crosstabulate the data. We cannot use the Pivot table
option in CogStat for strings, but we can use the Compare groups
option instead. We’ll use the species
variable as the grouping variable and the choices
variable as the variable to compare.
The overwhelmingly preferred choice is the data file
. You can see a visual representation of this in Figure 10.6.
Scrolling down, you can see the descriptives for the groups in the Sample properties
section:
Let’s put these results in a table for our discussion on the $\chi^2$ test of independence.
Robot  Human  Total  
Puppy  13  15  28 
Flower  30  13  43 
Data file  44  65  109 
Total  87  93  180 
It’s quite clear that most humans chose the data file, whereas the robots tended to be a lot more even in their preferences. Leaving aside the question of why humans might be more likely to choose the data file for the moment, first, we must determine if the discrepancy between human choices and robot choices in the data set is statistically significant.
10.2.1 Constructing our hypothesis test
How do we analyse this data manually? Specifically, since our research hypothesis is that “humans and robots answer the question in different ways”, how can we construct a test of the null hypothesis that “humans and robots answer the question the same way”? As before, we begin by establishing some notation to describe the data:
Robot  Human  Total  

Puppy  $O_{11}$  $O_{12}$  $R_{1}$ 
Flower  $O_{21}$  $O_{22}$  $R_{2}$ 
Data file  $O_{31}$  $O_{32}$  $R_{3}$ 
Total  $C_{1}$  $C_{2}$  $N$ 
In this notation, we say that $O_{ij}$ is a count (observed frequency) of the number of respondents that are of species $j$ (robot
or human
) who answered $i$ (puppy
, flower
or data
) when asked to make a choice. The total number of observations is written $N$, as usual. Finally, $R_i$ denotes the row totals (e.g. $R_1$ is the total number of people who chose the flower), and $C_j$ denotes the column totals (e.g., $C_1$ is the total number of robots). To use the terminology from another mathematical statistics textbook (Hogg et al., 2005), we should technically refer to this situation as a chisquare test of homogeneity; and reserve the term chisquare test of independence for the situation where both the row and column totals are random outcomes of the experiment.
So now, let’s think about what the null hypothesis says. If robots and humans are responding in the same way to the question, it means that the probability that “a robot says puppy” is the same as the probability that “a human says puppy”, and so on for the other two possibilities. So, if we use $P_{ij}$ to denote “the probability that a member of species $j$ gives response $i$”, then our null hypothesis is that:
$H_0$:  All of the following are true:  

$P_{11} = P_{12}$  same probability of saying puppy 

$P_{21} = P_{22}$  same probability of saying flower 

$P_{31} = P_{32}$  same probability of saying data file 
Since the null hypothesis claims that the true choice probabilities don’t depend on the species of the person making the choice, we can let $P_i$ refer to this probability: e.g. $P_1$ is the true probability of choosing the puppy.
Next, in much the same way we did with the goodness of fit test, we need to calculate the expected frequencies. For each of the observed counts $O_{ij}$, we need to figure out what the null hypothesis would tell us to expect. Let’s denote this expected frequency by $E_{ij}$. This time, it’s a little bit trickier. If there are a total of $C_j$ people that belong to species $j$, and the true probability of anyone (regardless of species) choosing option $i$ is $P_i$, then the expected frequency is just: $E_{ij} = C_j \times P_i$
This is all very well and good, but we have a problem. Unlike the situation we had with the goodness of fit test, the null hypothesis doesn’t specify a particular value for $P_i$. It’s something we have to estimate (Chapter 8) from the data! Fortunately, this is pretty easy to do. If 28 out of 180 people selected the flowers, then a natural estimate for the probability of choosing flowers is $28/180$, which is approximately $.16$. If we phrase this in mathematical terms, what we’re saying is that our estimate for the probability of choosing option $i$ is just the row total divided by the total sample size: $\hat{P}_i = \frac{R_i}{N}$
Therefore, our expected frequency can be written as the product (i.e. multiplication) of the row total and the column total, divided by the total number of observations:^{44} $E_{ij} = \frac{R_i \times C_j}{N}$
Now that we’ve figured out how to calculate the expected frequencies, it’s straightforward to define a test statistic following the same strategy we used in the goodness of fit test. It’s pretty much the same statistic. For a contingency table with $r$ rows and $c$ columns, the equation that defines our $X^2$ statistic is $X^2 = \sum_{i=1}^r \sum_{j=1}^c \frac{({E}_{ij}  O_{ij})^2}{{E}_{ij}}$ The only difference is that we have to include two summation signs (i.e. $\sum$) to indicate that we’re summing over both rows and columns. As before, large values of $X^2$ suggest that the null hypothesis provides a poor description of the data, whereas small values of $X^2$ indicate that it does a good job of accounting for the data. Therefore, just like last time, we want to reject the null hypothesis if $X^2$ is too large.
Not surprisingly, this statistic is $\chi^2$ distributed. All we need to do is figure out how many degrees of freedom are involved, which actually isn’t too hard. You can think of the degrees of freedom as equal to the number of data points you’re analysing minus the number of constraints. A contingency table with $r$ rows and $c$ columns contains a total of $r \times c$ observed frequencies, so that’s the total number of observations.
What about the constraints? Here, it’s slightly trickier. The answer is always the same: $df = (r1)(c1)$
But the explanation for why the degrees of freedom take this value is different depending on the experimental design. For the sake of argument, let’s suppose that we had honestly intended to survey exactly 87 robots and 93 humans (column totals fixed by the experimenter) but left the row totals free to vary (row totals are random variables). Let’s think about the constraints that apply here. Well, since we deliberately fixed the column totals, we have $c$ constraints right there. There’s more to it than that. Remember how our null hypothesis had some free parameters (i.e. we had to estimate the $P_i$ values)? Those matter too.
Every free parameter in the null hypothesis is rather like an additional constraint. So, how many of those are there? Well, since these probabilities have to sum to 1, there’s only $r1$ of these. So our total degree of freedom is: $\begin{array}{rcl} df &=& \mbox{(number of observations)}  \mbox{(number of constraints)} \\ &=& (rc)  (c + (r1)) \\ &=& rc  c  r + 1 \\ &=& (r  1)(c  1) \end{array}$
Alternatively, suppose that the only thing that the experimenter fixed was the total sample size $N$. That is, we quizzed the first 180 people that we saw, and it just turned out that 87 were robots and 93 were humans. This time around, our reasoning would be slightly different but would still lead us to the same answer. Our null hypothesis still has $r1$ free parameters corresponding to the choice probabilities. Still, it now also has $c1$ free parameters corresponding to the species probabilities because we’d also have to estimate the probability that a randomly sampled person turns out to be a robot.^{45} Finally, since we did fix the total number of observations $N$, that’s one more constraint. So now we have, $rc$ observations, and $(c1) + (r1) + 1$ constraints. What does that give? $\begin{array}{rcl} df &=& \mbox{(number of observations)}  \mbox{(number of constraints)} \\ &=& rc  ( (c1) + (r1) + 1) \\ &=& rc  c  r + 1 \\ &=& (r  1)(c  1) \end{array}$ Amazing.
10.2.2 The test results in CogStat
The test is automatically done in CogStat using the Compare groups
feature. The result set will contain information about the sample and its properties, as seen in Figure 10.7. Further scrolling down, you’ll see the effect size (which we will cover in a short while in Chapter 10.4). The last part of the result set is the hypothesis test itself (see Figure 10.8).
Let us go through the Hypothesis tests
section line by line.
Hypothesis tests
Testing if the distributions are the same.
One grouping variable. Two groups. Nominal variable. >> Running chisquared test.
Sensitivity power analysis. Minimal effect size to reach 95% power with the present sample size for the present hypothesis test. Minimal effect size in w: 0.29.
Result of the Pearson's chisquared test: χ2(2, N = 180) = 10.72, p = .005
Testing if the distributions are the same.
: This, in plain English, tells us that we are testing for a null hypothesis where all distributions of all group, or probabilities, are the same. It does not differ in essence from the $H_0$ we described more eloquently in Table 10.2.One grouping variable.
: This says we are looking at only one variable by which we have dissected our data:species
.Two groups
: This tells us that we have two groups,robot
andhuman
.Nominal variable.
: This tells us that the variable we are looking at is categorical.Running chisquared test.
: Based on all the above, CogStat has decided to run a $\chi^2$ test.
Let us ignore the details about the 95% confidence interval, minimal effect size w, and Cramér’s V (Figure 10.8) for now. We will come back to them in Chapter 10.4.
Result of the Pearson's chisquared test:
$\chi^2(2, N = 180) = 10.72, p = 0.005$
The test result is 10.72, the degree of freedom is 2 with 180 observations, and the pvalue is 0.005. This means that the null hypothesis is rejected at the 0.005 level of significance.
This output gives us enough information to write up the result:
Pearson’s $\chi^2$ revealed a significant association between species and choice ($\chi^2(2, N = 180) = 10.72, p < .01$): robots appeared to be more likely to say that they prefer flowers, but the humans were more likely to say they prefer data.
Notice that, once again, we provided a little bit of interpretation to help the human reader understand what’s going on with the data. This is a good habit to get into. It’s also a good idea to report the effect size, which we will do in the next section.
10.3 Yates correction for 1 degree of freedom
Time for a little bit of a digression. You need to make a tiny change to your calculations whenever you only have 1 degree of freedom. It’s called the continuity correction, or sometimes the Yates correction.
The $\chi^2$ test is based on an approximation, specifically on the assumption that binomial distribution starts to look like a normal distribution for large $N$. One problem with this is that it often doesn’t quite work, especially when you’ve only got 1 degree of freedom (e.g. when you’re doing a test of independence on a $2 \times 2$ contingency table). The main reason for this is that the true sampling distribution for the $X^2$ statistic is actually discrete (because you’re dealing with categorical data!), but the $\chi^2$ distribution is continuous. This can introduce systematic problems. Specifically, when $N$ is small and when $df=1$, the goodness of fit statistic tends to be “too big”, meaning that you actually have a bigger $\alpha$ value than you think (or, equivalently, the $p$ values are a bit too small). Yates (1934) suggested a simple fix, in which you redefine the goodness of fit statistic as: $X^2 = \sum_{i} \frac{(E_i  O_i  0.5)^2}{E_i}$ Basically, he subtracts off 0.5 everywhere. The correction is basically a hack. It’s not derived from any principled theory: rather, it’s based on an examination of the behaviour of the test and observing that the corrected version seems to work better.
CogStat (and many other software, for that matter) introduces this correction, so it’s useful to know what it is about. You won’t know when it happens because the CogStat output doesn’t explicitly say that it has used a “continuity correction” or “Yates’ correction”.^{46}
Let us overwrite all the puppy
answers in our chapek9
data frame to look at 1 degree of freedom (Figure 10.9). Let’s use the chapek9two.csv
data set for this.
The result as calculated by default with the Yates correction is: $\chi^2(1, N = 180) = 6.24, p = 0.013$
However, had we not applied the Yates correction, the results would have been $\chi^2(2, N = 180) = 7.02, p = 0.008$, which is a bit different. The difference is not huge, but it is there. The Yates correction is a good thing to know about, but it’s not something you need to worry about too much. It’s just a little bit of a hack that makes the test work better in this specific case.
10.4 Effect size (Cramér’s $V$)
As we discussed earlier in Chapter 9.7, it’s becoming commonplace to ask researchers to report some measure of effect size. So, suppose that you’ve run your chisquare test, which turns out to be significant. So you now know that there is some association between your variables (independence test) or some deviation from the specified probabilities (goodness of fit test). Now you want to report a measure of effect size. That is, given that there is an association/deviation, how strong is it?
There are several different measures you can choose to report and several different tools you can use to calculate them. By default, the two measures that people tend to report most frequently are the $\phi$ (pronounced: “phi”) statistic and the somewhat superior version, known as Cramér’s $V$. While CogStat gives you only Cramér’s $V$, we need to start with $\phi$ because they are related.
Mathematically, they’re both very simple. To calculate the $\phi$ statistic, you just divide your $X^2$ value by the sample size and take the square root: $\phi = \sqrt{\frac{X^2}{N}}$
The idea here is that the $\phi$ statistic is supposed to range between 0 (no association at all) and 1 (perfect association). However, it doesn’t always do this when your contingency table is bigger than $2 \times 2$ (like in our original chapek9 data set), which is a total pain. So, to correct this, people usually prefer to report the $V$ statistic proposed by Cramér (1946). It’s a pretty simple adjustment to $\phi$. If you’ve got a contingency table with $r$ rows and $c$ columns, then define $k = \min(r,c)$ to be the smaller of the two values. If so, then Cramér’s $V$ statistic is $\phi_c = \sqrt{\frac{X^2}{N(k1)}}$ And you’re done. This seems to be a reasonably popular measure, presumably because it’s easy to calculate, and it gives answers that aren’t completely silly: you know that $V$ does range from 0 (no association at all) to 1 (perfect association).
Calculating $V$ is automatic in CogStat, as you’ve seen in the result sets earlier in both the original chapek9 data set (Figure 10.8) and the modified one (Figure 10.9). Now let’s look at the original chapek9 effect size.
Standardized effect sizes
Value  
Cramér's V measure of association  ϕ_{c} = 0.244 
A Cramer’s V of 0.244 tells us that there is a moderate association between the two variables. The usual guidance is that anything below 0.2 is a weak association, 0.2 to 0.6 is a moderate association, and anything above 0.6 is a strong association. However, you must always look at the context of your data when determining the effect size.
10.5 Assumptions of the test(s)
All statistical tests make assumptions, and it’s usually a good idea to check that those assumptions are met. For the chisquare tests discussed so far in this chapter, the assumptions are:
 Expected frequencies are sufficiently large. Remember how in the previous section, we saw that the $\chi^2$ sampling distribution emerges because the binomial distribution is similar to a normal distribution? Well, as we discussed in Chapter 7, this is only true when the number of observations is sufficiently large. What that means in practice is that all of the expected frequencies need to be reasonably big. How big is reasonably big? Opinions differ, but the default assumption seems to be that you generally would like to see all your expected frequencies larger than about 5, though for larger tables, you would probably be okay if at least 80% of the expected frequencies are above 5 and none of them are below 1. However, these seem to have been proposed as rough guidelines, not hard and fast rules; and they seem somewhat conservative (Larntz, 1978).
 Data are independent of one another. One somewhat hidden assumption of the chisquare test is that you have to believe that the observations are genuinely independent. Suppose we are interested in the proportion of babies born at a particular hospital that are boys. We walk around the maternity wards and observe 20 girls and only 10 boys. Seems like a pretty convincing difference, right? But later on, it turns out that we’d actually walked into the same ward 10 times, and in fact, we’d only seen 2 girls and 1 boy. Not as convincing, is it? Our original 30 observations were massively nonindependent. And were only, in fact, equivalent to 3 independent observations. Obviously, this is an extreme(ly silly) example, but it illustrates the fundamental issue. Nonindependence “stuffs things up”. Sometimes it causes you to falsely reject the null, as the silly hospital example illustrates, but it can go the other way too. Let’s consider what would happen if we’d done the cards experiment slightly differently: instead of asking 200 people to try to imagine sampling one card at random, suppose we asked 50 people to select 4 cards. One possibility would be that everyone selects one heart, one club, one diamond and one spade (in keeping with the “representativeness heuristic”; Tversky & Kahneman 1974). This is highly nonrandom behaviour from people, but in this case, we would get an observed frequency of 50 for all four suits. For this example, the fact that the observations are nonindependent (because the four cards that you pick will be related to each other) actually leads to the opposite effect, falsely retaining the null.
If you find yourself in a situation where independence is violated, it may be possible to use the McNemar test. Similarly, if your expected cell counts are too small, check out the Fisher exact test.
10.6 The Fisher exact test
What should you do if your cell counts are too small, but you’d still like to test the null hypothesis that the two variables are independent? One answer would be “collect more data”, but that’s far too glib: there are a lot of situations in which it would be either infeasible or unethical to do. If so, statisticians are morally obligated to provide scientists with better tests. In this instance, Fisher (1922) kindly provided the right answer to the question. To illustrate the basic idea, let’s suppose we’re analysing data from a field experiment, looking at the emotional status of people accused of witchcraft. Some of them are currently being burned at the stake.^{47} Unfortunately for the scientist (but rather fortunately for the general populace), it’s quite hard to find people in the process of being set on fire, so the cell counts are microscopic. The salem.csv
file illustrates the point.
Looking at this data, you’d be hard pressed not to suspect that people not on fire are more likely to be happy than people on fire. However, the chisquare test (even with the Yates correction for the 2x2 data) makes this very hard to test because of the small sample size.
We’d really like to be able to get a better answer than this provided we really don’t want to be on fire. This is where Fisher’s exact test (Fisher, 1922a) comes in very handy. The Fisher exact test works somewhat differently to the chisquare test (or in fact any of the other hypothesis tests in this book) insofar as it doesn’t have a test statistic; it calculates the $p$value “directly”.
Let’s have some notation:
Happy  Sad  Total  

Set on fire  $O_{11}$  $O_{12}$  $R_{1}$ 
Not set on fire  $O_{21}$  $O_{22}$  $R_{2}$ 
Total  $C_{1}$  $C_{2}$  $N$ 
In order to construct the test Fisher treats both the row and column totals ($R_1$, $R_2$, $C_1$ and $C_2$) are known, fixed quantities; and then calculates the probability that we would have obtained the observed frequencies that we did ($O_{11}$, $O_{12}$, $O_{21}$ and $O_{22}$) given those totals. In the notation that we developed in Chapter 7 this is written: $P(O_{11}, O_{12}, O_{21}, O_{22} \  \ R_1, R_2, C_1, C_2)$ and as you might imagine, it’s a slightly tricky exercise to figure out what this probability is, but it turns out that this probability is described by a distribution known as the hypergeometric distribution. Now that we know this, what we have to do to calculate our $p$value is calculate the probability of observing this particular table or a table that is “more extreme”.^{48} Back in the 1920s, computing this sum was daunting even in the simplest of situations, but these days it’s pretty easy as long as the tables aren’t too big and the sample size isn’t too large. The conceptually tricky issue is to figure out what it means to say that one contingency table is more “extreme” than another. The easiest solution is to say that the table with the lowest probability is the most extreme. This then gives us the $p$value of $0.03571$.
The implementation of the test in CogStat is not yet available. The main thing we’re interested in here is the $p$value, which in this case is small enough ($p=.036$) to justify rejecting the null hypothesis that people on fire are just as happy as people not on fire.
10.7 The McNemar test
Suppose you’ve been hired to work for the Australian Generic Political Party (AGPP), and part of your job is to find out how effective the AGPP political advertisements are. So, what you do, is you put together a sample of $N=100$ people and ask them to watch the AGPP ads. Before they see anything, you ask them if they intend to vote for the AGPP; after showing the ads, you ask them again to see if anyone has changed their minds. One way to describe your data is via the following contingency table:
Before  After  Total  

Yes  30  10  40 
No  70  90  160 
Total  100  100  200 
At first pass, you might think that this situation lends itself to the Pearson $\chi^2$ test of independence (as per Chapter 10.2). However, we’ve got a problem: we have 100 participants but 200 observations. This is because each person has given us an answer in both the before and after columns. What this means is that the 200 observations aren’t independent of each other: if voter A says “yes” the first time and voter B says “no”, then you’d expect that voter A is more likely to say “yes” the second time than voter B! The consequence of this is that the usual $\chi^2$ test won’t give trustworthy answers due to the violation of the independence assumption. Now, if this were a really uncommon situation, I wouldn’t be bothering to waste your time talking about it. But it’s not uncommon at all: this is a standard repeated measures design, and none of the tests we’ve considered so far can handle it.
The solution to the problem was published by McNemar (1947). The trick is to start by tabulating your data in a slightly different way:
Before: Yes  Before: No  Total  

After: Yes  5  5  10 
After: No  25  65  90 
Total  30  70  100 
This is exactly the same data, but it’s been rewritten so that each of our 100 participants appears in only one cell. Because we’ve written our data this way, the independence assumption is now satisfied, and this is a contingency table that we can use to construct an $X^2$ goodness of fit statistic. However, as we’ll see, we need to do it in a slightly nonstandard way. To see what’s going on, it helps to label the entries in our table a little differently:
Before: Yes  Before: No  Total  

After: Yes  $a$  $b$  $a+b$ 
After: No  $c$  $d$  $c+d$ 
Total  $a+c$  $b+d$  $n$ 
Next, let’s think about what our null hypothesis is: it’s that the “before” test and the “after” test have the same proportion of people saying, “Yes, I will vote for AGPP”. Because of the way we have rewritten the data, it means that we’re now testing the hypothesis that the row totals and column totals come from the same distribution. Thus, the null hypothesis in McNemar’s test is that we have “marginal homogeneity”. That is, the row totals and column totals have the same distribution: $P_a + P_b = P_a + P_c$, and similarly that $P_c + P_d = P_b + P_d$. Notice that this means that the null hypothesis actually simplifies to $P_b = P_c$.
In other words, as far as the McNemar test is concerned, it’s only the offdiagonal entries in this table (i.e. $b$ and $c$) that matter! After noticing this, the McNemar test of marginal homogeneity is no different to a usual $\chi^2$ test. After (automatically) applying the Yates correction, our test statistic becomes: $X^2 = \frac{(bc  0.5)^2}{b+c}$ or, to revert to the notation that we used earlier in this chapter: $X^2 = \frac{(O_{12}O_{21}  0.5)^2}{O_{12} + O_{21}}$ and this statistic has an (approximately) $\chi^2$ distribution with $df=1$. However, remember that – just like the other $\chi^2$ tests – it’s only an approximation, so you need to have reasonably large expected cell counts for it to work.
Now that you know what the McNemar test is all about, lets actually run one. The agpp.csv
file contains the raw data. It contains three variables, an id
variable that labels each participant in the data set, a responseBefore
variable that records the person’s answer when they were asked the question the first time, and a responseAfter
variable that shows the answer that they gave when asked the same question a second time.
Let us think what we want to do here. We have the same participants giving us two answers. We want to test whether the two answers (the before and the after) are independent of each other. Or, in other words, we want to compare a reapeated measure. In CogStat, we need to select Compare repeated measures variables
and add the responseBefore
and responseAfter
variables to the Selected variables
box to run an analysis (Figure 10.11). The results are shown in Figure 10.12.
Hypothesis tests
Testing if the distributions are the same.
Two variables. Nominal dichotomous variables. >> Running McNemar test.
Result of the McNemar test: χ2(1, N = 100) = 12.03, p < .001
And we’re done. We’ve just run a McNemar’s test automatically, since our data set was identified by CogStat as categorical data.
The results would tell us something like this:
The test was significant ($\chi^2(1) = 12.03, p<.001$), suggesting that people were not just as likely to vote AGPP after the ads as they were before hand. In fact, the ads had a negative effect: people were less likely to vote AGPP after seeing the ads.
10.8 What’s the difference between McNemar and independence?
Let’s go back to the beginning of the chapter and look at the cards
data set again. If you recall, the experimental design described involved people making two choices. Because we have information about the first choice and the second choice that everyone made, we can construct the following contingency table that crosstabulates the first choice against the second choice.
clubs  diamonds  hearts  spades  Total  
clubs  10  9  10  6  35 
diamonds  20  4  13  14  51 
hearts  20  18  3  23  64 
spades  18  13  15  4  50 
Total  68  44  41  47  200 
First, we wanted to know whether the choice you make the second time is dependent on the choice you made the first time (for this, we’ll run the Explore relation of variable pair
analysis). This is where a test of independence is useful, and what we’re trying to do is see if there’s some relationship between the rows and columns of this table.
Second, we wanted to know if on average, the frequencies of suit choices were different the second time than the first time. In that situation, we’re trying to see if the row totals in cardChoices
(i.e. the frequencies for choice_1
) are different from the column totals (i.e. the frequencies for choice_2
). That’s when we’d use the McNemar test. However, when running the Compare repeated measures variables
analysis, we get an error, as the function for nondichotomous nominal data is not implemented yet.
Here’s the result if we run the Explore relation of variable pair
analysis in CogStat:
Hypothesis tests
Testing if variables are independent.
Nominal variables. >> Running Cramér's V.
Sensitivity power analysis. Minimal effect size to reach 95% power with the present sample size for the present hypothesis test. Minimal effect size in w: 0.34.
Result of the Pearson's chisquared test: χ2(9, N = 200) = 29.24, p < .001
For the second case, running the McNemar test, the answer would be McNemar’s chisquared = $16.03$, df = $6$, pvalue = $0.014$. This is a significant result, suggesting that the frequencies of suit choices were different the second time than the first time.
Notice that the results are different! These aren’t the same test.
10.9 Summary
The key ideas discussed in this chapter are:
 The chisquare goodness of fit test (Section 10.1) is used when you have a table of observed frequencies of different categories; the null hypothesis gives you a set of “known” probabilities to compare them to.
 The chisquare test of independence (Section 10.2) is used when you have a contingency table (crosstabulation) of two categorical variables. The null hypothesis is that there is no relationship/association between the variables.
 Effect size for a contingency table can be measured in several ways (Section 10.4). In particular, we noted the Cramér’s $V$ statistic.
 Both versions of the Pearson test rely on two assumptions: that the expected frequencies are sufficiently large and that the observations are independent (Section 10.5). The Fisher exact test (Section 10.6) can be used when the expected frequencies are small. The McNemar test (Section 10.7) can be used for some kinds of violations of independence.
If you’re interested in learning more about categorical data analysis, an excellent first choice would be Agresti (1996), which, as the title suggests, provides an Introduction to Categorical Data Analysis. If the introductory book isn’t enough for you (or you can’t solve the problem you’re working on), you could consider Agresti (2002), Categorical Data Analysis. The latter is a more advanced text, so it’s probably not wise to jump straight from this book to that one.
References
This, again, is an oversimplification. It works nicely for quite a few situations, but every now and then, we’ll come across degrees of freedom values that aren’t whole numbers. Don’t let this worry you too much – when you come across this, just remind yourself that “degrees of freedom” is actually a bit of a messy concept. For an introductory class, it’s usually best to stick to the simple story.↩︎
In practice, the sample size isn’t always fixed… e.g. we might run the experiment over a fixed period of time, and the number of people participating depends on how many people show up. That doesn’t matter for the current purposes.↩︎
To some people, this advice might sound odd or at least in conflict with the “usual” advice on how to write a technical report. Students are typically told that the “results” section of a report is for describing the data and reporting statistical analysis, and the “discussion” section provides interpretation. That’s true as far as it goes, but people often interpret it way too literally. Provide a quick and simple interpretation of the data in the results section so that the reader understands what the data are telling us. Then, in the discussion, try to tell a bigger story; about how my results fit the rest of the scientific literature. In short, don’t let the “interpretation goes in the discussion” advice turn your results section into incomprehensible garbage. Being understood by your reader is much more important.↩︎
Complicating matters, the $G$test is a special case of a whole class of tests that are known as likelihood ratio tests.↩︎
Technically, $E_{ij}$ here is an estimate, so we should probably write it $\hat{E}_{ij}$.↩︎
A problem many of us worry about in real life.↩︎
Technically, CogStat uses
chi2_contingency
function fromscipy
without specifying thecorrection
parameter which defaults totrue
.↩︎This example is based on a joke article published in the Journal of Irreproducible Results.↩︎
Not surprisingly, the Fisher exact test is motivated by Fisher’s interpretation of a $p$value, not Neyman’s!↩︎