Chapter 4 A brief introduction to research design

In this chapter, we’ll start thinking about the basic ideas for designing a study, collecting data, checking whether your data collection works, and so on. It won’t give you enough information to design studies of your own, but it will provide you with a lot of the essential tools you need to assess the studies done by other people. Since this book focuses more on data analysis than data collection, this only gives a very brief overview. This chapter relies heavily on Campbell & Stanley (1963) for discussing study design and Stevens (1946) for discussing scales of measurement.

4.1 Introduction to psychological measurement

First, data collection can be thought of as a kind of measurement. That is, what we’re trying to do here is measure something about human behaviour or the human mind. What do we mean by “measurement”?

4.1.1 Some thoughts about psychological measurement

Measurement itself is a subtle concept, but it comes down to finding some way of assigning numbers, labels, or other well-defined descriptions to “stuff”. So, any of the following would count as a psychological measurement:

  • Danielle’s age is 33 years.
  • She does not like anchovies.
  • Her chromosomal gender is male.
  • Her self-identified gender is female.

In the short list above, the bolded part is “the thing to be measured”, and the italicised part is “the measurement itself”. We can expand on this a little bit by thinking about the set of possible measurements that could have arisen in each case:

  • Age (in years) could have been 0, 1, 2, 3 …, etc. The upper bound on what the age could be is a bit fuzzy, but in practice, you’d be safe in saying that the largest possible age is 150 since no human has ever lived that long.
  • When asked if someone likes anchovies, they might say I do, I do not, I have no opinion, or I sometimes do.
  • Chromosomal gender is almost certainly going to be male (XY) or female (XX), but there are a few other possibilities. With Klinefelter’s syndrome (XXY), it is more similar to male than to female. And there are other possibilities, too.
  • Self-identified gender is also very likely to be male or female, transgender, nonbinary, queer etc.

As you can see, for some things (like age), it seems pretty apparent what the set of possible measurements should be, whereas, for other things, it gets a bit tricky. But even regarding someone’s age, it’s much more subtle than this. For instance, if you’re a developmental psychologist, measuring in years is way too crude, and so you often measure age in years and months (if a child is 2 years and 11 months, this is usually written as “2;11”). If you’re interested in newborns, you might want to measure age in days since birth, maybe even hours since birth.

Looking at this a bit more closely, you might also realise that the concept of “age” isn’t all that precise. Generally, when we say “age”, we implicitly mean “the length of time since birth”. But that’s not always the right way to do it. Suppose you’re interested in how newborn babies control their eye movements. If you’re interested in kids that young, you might also start to worry that “birth” is not the only significant point in time to care about. If Baby Alice is born 3 weeks premature and Baby Bianca is born 1 week late, would it really make sense to say that they are the “same age” if we encountered them “2 hours after birth”? In a sense, yes. By social convention, we use birth as our reference point for talking about age in everyday life since it defines the amount of time the person has been operating as an independent entity in the world. But from a scientific perspective, that’s not the only thing we care about. When we think about the biology of human beings, it’s often helpful to think of ourselves as organisms that have been growing and maturing since conception. From that perspective, Alice and Bianca aren’t the same age at all. So you might want to define the concept of “age” in two different ways: the length of time since conception and the length of time since birth. It won’t make much difference when dealing with adults, but when dealing with newborns, it might.

In other words, how you specify the allowable measurement values is important.

Still, there’s the question of methodology. What specific “measurement method” will you use to find out someone’s age? As before, there are lots of different possibilities:

  • You could just ask people, “how old are you?” The method of self-report is fast, cheap and easy, but it only works with people old enough to understand the question, and some people lie about their age.
  • You could ask an authority (e.g. a parent), “how old is your child?” This method is fast, and it’s not all that hard when dealing with kids since the parent is almost always around. It doesn’t work as well if you want to know “age since conception” since a lot of parents can’t say for sure when conception took place. You might need a different authority (e.g. an obstetrician).
  • You could look up official records, like birth certificates. This is time-consuming and annoying, but it has its uses (e.g. if the person is now dead).

4.1.2 Operationalisation: defining your measurement

All of the ideas discussed in the previous section relate to the concept of operationalisation. To be a bit more precise about the idea, operationalisation is the process by which we take a meaningful but somewhat vague concept and turn it into an accurate measurement. The method of operationalisation can involve several different things:

  • Being precise about what you are trying to measure. For instance, does “age” mean “time since birth” or “time since conception” in the context of your research?
  • Determining what method you will use to measure it. Will you use self-report to measure age, ask a parent, or look up an official record? If you’re using self-report, how will you phrase the question?
  • Defining the set of the allowable values that the measurement can take. Note that these values don’t always have to be numerical, though they often are. When measuring age, the values are numerical, but we still need to think carefully about what numbers are allowed. Do we want age in years, years and months, days, or hours? The values aren’t numerical for other types of measurements (e.g. gender). But, as before, we need to consider what values are allowed. If we’re asking people to self-report their gender, what options do we allow them to choose from? Is it enough to allow only “male” or “female”? Do you need an “other” option? Or should we not give people any specific options and let them answer in their own words? And if you open up the set of possible values to include all verbal responses, how will you interpret their answers?

Operationalisation is tricky, and there’s no “one, true way” to do it. How you operationalise the informal concept of “age” or “gender” into a formal measurement depends on what you need to use the measurement for. You’ll often find that the community of scientists who work in your area have some well-established ideas for how to go about it. In other words, operationalisation needs to be thought through case-by-case. Nevertheless, while there are a lot of issues that are specific to each individual research project, there are some aspects to it that are pretty general.

Before moving on, let’s take a moment to clear up our terminology and, in the process, introduce one more term. Here are four different things that are closely related to each other:

  • A theoretical construct. This is the thing that you’re trying to measure, like “age”, “gender”, or an “opinion”. A theoretical construct can’t be directly observed, and often they’re a bit vague.
  • A measure. The measure refers to the method or tool used to make your observations. A question in a survey, a behavioural observation or a brain scan could all count as a measure.
  • An operationalisation. The term “operationalisation” refers to the logical connection between the measure and the theoretical construct or the process by which we try to derive a measure from a theoretical construct.
  • A variable. Finally, a new term. A variable is what we end up with when we apply our measure to something in the world. That is, variables are the actual “data” we end up with in our data sets.

In practice, even scientists tend to blur the distinction between these things, but it’s very helpful to understand the differences.

4.2 Scales of measurement

As the previous section indicates, the outcome of a psychological measurement is called a variable. But not all variables are of the same qualitative type, and it’s handy to understand what types there are. A very useful concept for distinguishing between different types of variables is what’s known as scales of measurement.

4.2.1 Nominal scale

A nominal scale variable (also referred to as a categorical variable) is one in which there is no particular relationship between the different possibilities: for these kinds of variables, it doesn’t make any sense to say that one of them is “bigger’ or”better” than any other one, and it doesn’t make any sense to average them. The classic example for this is “eye colour”. Eyes can be blue, green and brown, among other possibilities, but none of them is any “better” than any other one. As a result, it would feel bizarre to talk about an “average eye colour”. Similarly, gender is nominal too: male isn’t better or worse than female, neither does it make sense to try to talk about an “average gender”. In short, nominal scale variables are those for which the only thing you can say about the different possibilities is that they are different. That’s it.

Let’s take a slightly closer look at this. Suppose we were researching how people commute to and from work. One variable we would have to measure would be what kind of transportation people use to get to work. This “transport type” variable could have quite a few possible values, including: “train”, “bus”, “car”, “bicycle”, etc. For now, let’s suppose that these four are the only possibilities, and suppose that when we ask 100 people how they got to work today, we get this:

Transportation Number of people
(1) Train 12
(2) Bus 30
(3) Car 48
(4) Bicycle 10

So, what’s the average transportation type? Obviously, the answer here is that there isn’t one. It’s a silly question to ask. You can say that travel by car is the most popular method, and travel by train is the least popular method, but that’s about all. Similarly, notice that the order in which the options are listed isn’t very exciting. We could have chosen to display the data like this:

Transportation Number of people
(3) Car 48
(1) Train 12
(4) Bicycle 10
(2) Bus 30

– and nothing really changes.

4.2.2 Ordinal scale

Ordinal scale variables have a bit more structure than nominal scale variables, but not by a lot. An ordinal scale variable is one in which there is a natural, meaningful way to order the different possibilities, but you can’t do anything else. The usual example of an ordinal variable is “finishing position in a race”. You can say that the person who finished first was faster than the person who finished second, but you don’t know how much faster. Consequently, we know that 1st > 2nd, and 2nd > 3rd, but the difference between 1st and 2nd might be much larger than the difference between 2nd and 3rd.

Here’s a more psychologically exciting example. Suppose we’re interested in people’s attitudes to climate change, and we ask them to pick one of these four statements that most closely matches their beliefs:

  1. Temperatures are rising because of human activity
  2. Temperatures are rising, but we don’t know why
  3. Temperatures are rising, but not because of humans
  4. Temperatures are not rising

Notice that these four statements actually do have a natural ordering in terms of “the extent to which they agree with the current science”. Statement 1 is a close match, statement 2 is a suitable match, statement 3 isn’t a perfect match, and statement 4 strongly opposes science. So, in terms of the thing we’re interested in (the extent to which people endorse the science), we can order the items as 1 > 2 > 3 > 4. Since this ordering exists, it would be peculiar to list the options like this:

  1. Temperatures are rising, but not because of humans
  2. Temperatures are rising because of human activity
  3. Temperatures are not rising
  4. Temperatures are rising, but we don’t know why – because it seems to violate the natural “structure” of the question.

So, let’s suppose I asked 100 people these questions and got the following answers:

Response Number
(1) Temperatures are rising because of human activity 51
(2) Temperatures are rising, but we don’t know why 20
(3) Temperatures are rising, but not because of humans 10
(4) Temperatures are not rising 19

When analysing these data, it seems quite reasonable to try to group (1), (2) and (3) together and say that 81 of 100 people were willing to at least partially endorse the science. And it’s also quite reasonable to group (2), (3) and (4) together and say that 49 of 100 people registered at least some disagreement with the dominant scientific view. However, it would be entirely bizarre to try to group (1), (2) and (4) together and say that 90 of 100 people said what? There’s nothing sensible that allows you to group those responses together at all.

That said, notice that while we can use the natural ordering of these items to construct sensible groupings, what we can’t do is average them. For instance, in our simple example here, the “average” response to the question is 1.97. We would love to know if someone can tell us what that means.

4.2.3 Interval scale

In contrast to nominal and ordinal scale variables, interval scale and ratio scale variables are variables for which the numerical value is genuinely meaningful. In the case of interval scale variables, the differences between the numbers are interpretable, but the variable doesn’t have a “natural” zero value. A good example of an interval scale variable is measuring temperature in degrees Celsius. For instance, if it was 15\(^\circ\) yesterday and 18\(^\circ\) today, then the 3\(^\circ\) difference between the two is genuinely meaningful. Moreover, that 3\(^\circ\) difference is exactly the same as the 3\(^\circ\) difference between 7\(^\circ\) and 10\(^\circ\). In short, addition and subtraction are meaningful for interval scale variables.5

However, notice that the 0\(^\circ\) does not mean “no temperature at all”: it means “the temperature at which water freezes”, which is pretty arbitrary. As a consequence, it becomes pointless to try to multiply and divide temperatures. It is wrong to say that \(20^\circ\) is twice as hot as 10\(^\circ\), just as it is weird and meaningless to claim that 20\(^\circ\) is negative two times as hot as -10\(^\circ\).

Again, let us look at a more psychological example. Suppose we’re interested in looking at how the attitudes of first-year university students have changed over time. We will want to record the year in which each student started. This is an interval scale variable. A student who started in 2003 did arrive 5 years before a student who started in 2008. However, it would be completely insane for me to divide 2008 by 2003 and say that the second student started “1.0024 times later” than the first one. That doesn’t make any sense at all.

4.2.4 Ratio scale

The fourth and final type of variable to consider is a ratio scale variable, in which zero really means zero, and it’s okay to multiply and divide. A good psychological example of a ratio scale variable is response time (RT). In many tasks, it’s very common to record the amount of time somebody takes to solve a problem or answer a question because it’s an indicator of how difficult the task is. Suppose that Alan takes 2.3 seconds to respond to a question, whereas Ben takes 3.1 seconds. As with an interval scale variable, addition and subtraction are both meaningful here. Ben really did take 3.1 - 2.3 = 0.8 seconds longer than Alan did. However, notice that multiplication and division also make sense here: Ben took 3.1 / 2.3 = 1.35 times as long as Alan did to answer the question. And you can do this because, for a ratio scale variable such as RT, “zero seconds” really means “no time at all”.

4.2.5 Continuous versus discrete variables

There’s a second kind of distinction that you need to be aware of regarding what types of variables you can run into. This is the distinction between continuous variables and discrete variables. The difference between these is as follows:

  • A continuous variable is one in which, for any two values you can think of, it’s always logically possible to have another value in between.
  • A discrete variable is, in effect, a variable that isn’t continuous. For a discrete variable, it’s sometimes the case that there’s nothing in the middle.

These definitions probably seem a bit abstract, but they’re pretty simple once you see some examples. For instance, response time is continuous. If Alan takes 3.1 seconds and Ben takes 2.3 seconds to respond to a question, then it’s possible for Cameron’s response time to lie in-between by taking 3.0 seconds. And, of course, it would also be possible for David to take 3.031 seconds to respond, meaning that his RT would lie in between Cameron’s and Alan’s. And while in practice, it might be impossible to measure RT that precisely, it’s certainly possible in principle. Because we can always find a new value for RT in between any two other ones, we say that RT is continuous.

Discrete variables occur when this rule is violated. For example, nominal scale variables are always discrete: there isn’t a type of transportation that falls “in-between” trains and bicycles, not in the strict mathematical way that 2.3 falls in between 2 and 3. So transportation type is discrete. Similarly, ordinal scale variables are always discrete: although “2nd place” does fall between “1st place” and “3rd place”, there’s nothing that can logically fall in between “1st place” and “2nd place”. Interval scale and ratio scale variables can go either way. As we saw above, response time (a ratio scale variable) is continuous. Temperature in degrees Celsius (an interval scale variable) is also continuous. However, the year you went to school (an interval scale variable) is discrete. There’s no year in between 2002 and 2003. The number of questions you get right on a true-or-false test (a ratio scale variable) is also discrete: since a true-or-false question doesn’t allow you to be “partially correct”, there’s nothing in between 5/10 and 6/10. Table 4.1 summarises the relationship between the measurement scales and the discrete/continuity distinction. Cells with a tick mark correspond to things that are possible. Some textbooks get this wrong, and people often say things like “discrete variable” when they mean “nominal scale variable”. It’s very unfortunate.

Table 4.1: The relationship between the measurement scales and the discrete/continuity distinction. Cells with a tick mark correspond to things that are possible.
continuous discrete
nominal \(\checkmark\)
ordinal \(\checkmark\)
interval \(\checkmark\) \(\checkmark\)
ratio \(\checkmark\) \(\checkmark\)

4.2.6 Some complexities: the Likert scale

Okay, I know you’ll be shocked to hear this, but the real world is much messier than this little classification scheme suggests. Very few variables in real life fall into these neat categories, so you need to be careful not to treat the scales of measurement as if they were hard and fast rules. It doesn’t work like that: they’re guidelines intended to help you think about the situations in which you should treat different variables differently. Nothing more.

So let’s take a classic example, maybe the classic example, of a psychological measurement tool: the Likert scale. The humble Likert scale is all survey designs’ bread and butter tool. You have filled out hundreds, maybe thousands of them, and odds are you’ve even used one yourself. Suppose we have a survey question that looks like this:

Which of the following best describes your opinion of the statement that “all pirates are freaking awesome” …

and then, the options presented to the participant are these:

  1. Strongly disagree
  2. Disagree
  3. Neither agree nor disagree
  4. Agree
  5. Strongly agree

This set of items is an example of a 5-point Likert scale: people are asked to choose among one of several (in this case, 5) clearly ordered possibilities, generally with a verbal descriptor given in each case. However, it’s not necessary that all items be explicitly described. This is a perfect example of a 5-point Likert scale too:

  1. Strongly disagree
  2. Strongly agree

Likert scales are convenient, if somewhat limited, tools. The question is, what kind of variable are they? They’re obviously discrete since you can’t give a response of 2.5. They’re obviously not nominal scale since the items are ordered, and they’re not ratio scale either since there’s no natural zero.

But are they ordinal scale or interval scale? One argument says that we can’t prove that the difference between “strongly agree” and “agree” is of the same size as the difference between “agree” and “neither agree nor disagree”. In fact, in everyday life, it’s pretty apparent they’re not the same. So this suggests that we ought to treat Likert scales as ordinal variables. On the other hand, in practice, most participants do seem to take the whole “on a scale from 1 to 5” part fairly seriously, and they tend to act as if the differences between the five response options were fairly similar to one another. As a consequence, a lot of researchers treat Likert scale data as if it were interval scale. It’s not interval scale, but in practice, it’s close enough that we usually think of it as being quasi-interval scale.

4.3 Assessing the reliability of a measurement

At this point, we’ve thought a little bit about how to operationalise a theoretical construct and thereby create a psychological measure. We’ve seen that by applying psychological measures we end up with variables, which can come in many different types. At this point, we should start discussing the obvious question: is the measurement any good? We’ll do this in terms of two related ideas: reliability and validity. Put simply, the reliability of a measure tells you how precisely you are measuring something, whereas the validity of a measure tells you how accurate the measure is.

Reliability is actually a very simple concept: it refers to the repeatability or consistency of your measurement. The measurement of our weight by means of a “bathroom scale” is very reliable: if we step on and off the scales repeatedly, it’ll keep giving us the same answer. Measuring our intelligence by means of “asking our mum” is very unreliable. Notice that this concept of reliability is different to the question of whether the measurements are correct (the correctness of a measurement relates to its validity). If we’re holding a sack of potatoes when we step on and off bathroom scales, the measurement will still be reliable: it will always give us the same answer. However, this highly reliable answer doesn’t match our true weight at all; therefore, it’s wrong. In technical terms, this is a reliable but invalid measurement. Similarly, while our mum’s estimate of our intelligence is a bit unreliable, she might be right. So that would be an unreliable but valid measure. To some extent, a very unreliable measure tends to end up being invalid for practical purposes, so much so that many would say that reliability is necessary (but not sufficient) to ensure validity.

Okay, now that we’re clear on the distinction between reliability and validity, let’s think about the different ways in which we might measure reliability:

  • Test-retest reliability. This relates to consistency over time: if we repeat the measurement at a later date, do we get the same answer?
  • Inter-rater reliability. This relates to consistency across people: if someone else repeats the measurement (e.g. someone else rates our intelligence), will they produce the same answer?
  • Parallel forms reliability. This relates to consistency across theoretically-equivalent measurements: if we use a different set of bathroom scales to measure our weight, does it give the same answer?
  • Internal consistency reliability. Suppose a measurement is constructed from many different parts that perform similar functions (e.g. a personality questionnaire result is added up across several questions). Do the individual parts tend to give similar answers?

Not all measurements need to possess all forms of reliability. For instance, educational assessment can be thought of as a form of measurement. One of the subjects that Danielle teaches, Computational Cognitive Science, has an assessment structure that has a research component and an exam component (plus other things). The exam component is intended to measure something different from the research component, so the assessment as a whole has low internal consistency. However, within the exam, several questions are intended to (approximately) measure the same things, and those tend to produce similar outcomes, so the exam on its own has a relatively high internal consistency, which is as it should be. You should only demand reliability when you want to measure the same thing!

4.4 The “role” of variables: predictors and outcomes

We’ve got one last piece of terminology that needs to be explained before moving away from variables. Usually, when we do some research, we end up with lots of different variables. Then, when we analyse our data, we typically try to explain some of the variables in terms of the other variables. It’s essential to keep the two roles, “thing doing the explaining” and “thing being explained”, distinct. So let’s be clear about this now. Firstly, we might as well get used to the idea of using mathematical symbols to describe variables since it’s going to happen repeatedly. Let’s denote the “to be explained” variable \(Y\), and the variables “doing the explaining” as \(X_1\), \(X_2\), etc.

Now, when we are doing analysis, we have different names for \(X\) and \(Y\), since they play different roles. The classical names for these roles are independent variable (IV) and dependent variable (DV). The IV is the variable you use to explain (i.e., \(X\)) and the DV is the variable being explained (i.e., \(Y\)). The logic behind these names goes like this: if there is a relationship between \(X\) and \(Y\), then we can say that \(Y\) depends on \(X\), and if we have designed our study “properly”, then \(X\) isn’t dependent on anything else. However, those names are horrible: they’re hard to remember, and they’re highly misleading because (a) the IV is never actually “independent of everything else” and (b) if there’s no relationship, then the DV doesn’t actually depend on the IV. And, because we’re not the only people who think that IV and DV are just awful names, there are several alternatives that some find more appealing. The terms used in these notes are predictors and outcomes. The idea here is that you’re trying to use \(X\) (the predictors) to make guesses about \(Y\) (the outcomes). This is summarised in Table 4.2.

Table 4.2: The terminology used to distinguish between different roles that a variable can play when analysing a data set. Note that this book will tend to avoid the classical terminology in favour of the newer names.
role of the variable classical name modern name
to be explained dependent variable (DV) outcome
to do the explaining independent variable (IV) predictor

4.5 Experimental and non-experimental research

One of the big distinctions that you should be aware of is the distinction between “experimental research” and “non-experimental research”. When we make this distinction, what we’re really talking about is the degree of control that the researcher exercises over the people and events in the study.

4.5.1 Experimental research

The key feature of experimental research is that the researcher controls all aspects of the study, especially what participants experience during the study. In particular, the researcher manipulates or varies the predictor variables (IVs) and then allows the outcome variable (DV) to vary naturally. The idea here is to deliberately vary the predictors (IVs) to see if they have any causal effects on the outcomes. Moreover, in order to ensure that there’s no chance that something other than the predictor variables is causing the outcomes, everything else is kept constant or is in some other way “balanced” to ensure that they have no effect on the results. In practice, it’s almost impossible to think of everything else that might have an influence on the outcome of an experiment, much less keep it constant. The standard solution to this is randomisation: that is, we randomly assign people to different groups and then give each group a different treatment (i.e., assign them different values of the predictor variables). We’ll talk more about randomisation later in this course. Still, for now, it’s enough to say that what randomisation does is minimise (but not eliminate) the chances that there is any systematic difference between groups.

Let’s consider a very simple, completely unrealistic and grossly unethical example. Suppose you wanted to find out if smoking causes lung cancer. One way to do this would be to find people who smoke and people who don’t smoke and look to see if smokers have a higher rate of lung cancer. This is not a proper experiment since the researcher doesn’t have a lot of control over who is and isn’t a smoker. And this really matters: for instance, it might be that people who choose to smoke cigarettes also tend to have poor diets, or maybe they tend to work in asbestos mines, or whatever. The point here is that the groups (smokers and non-smokers) actually differ on lots of things, not just smoking. So it might be that the higher incidence of lung cancer among smokers is caused by something else, not by smoking per se. In technical terms, these other things (e.g. diet) are called “confounds”, and we’ll talk about those in just a moment.

In the meantime, let’s now consider what a good experiment might look like. Recall that our concern was that smokers and non-smokers might differ in lots of ways. The solution, as long as you have no ethics, is to control who smokes and who doesn’t. Specifically, suppose we randomly divide participants into two groups and force half of them to become smokers. In that case, it’s doubtful that the groups will differ in any respect other than half of them smoke. That way, if our smoking group gets cancer at a higher rate than the non-smoking group, we can feel pretty confident that (a) smoking does cause cancer and (b) we’re murderers.

4.5.2 Non-experimental research

Non-experimental research is a broad term that covers “any study in which the researcher doesn’t have quite as much control as they do in an experiment”. Control is something that scientists like to have, but as the previous example illustrates, there are many situations in which you can’t or shouldn’t try to obtain that control. Since it’s grossly unethical (and almost undoubtedly criminal) to force people to smoke to find out if they get cancer, this is an excellent example of a situation where you shouldn’t try to obtain experimental control. But there are other reasons too. Even leaving aside the ethical issues, our “smoking experiment” does have a few other issues. For instance, when we suggested that we “force” half of the people to become smokers, we must have been talking about starting with a sample of non-smokers and then forcing them to become smokers. While this sounds like the kind of solid, evil experimental design that a mad scientist would love, it might not be a very sound way of investigating the effect in the real world. For instance, suppose that smoking only causes lung cancer when people have poor diets and also suppose that people who usually smoke do tend to have poor diets. However, since the “smokers” in our experiment aren’t “natural” smokers (i.e. we forced non-smokers to become smokers; they didn’t take on all of the other normal, real-life characteristics that smokers might tend to possess), they probably have better diets. As such, in this silly example, they wouldn’t get lung cancer, and our experiment will fail because it violates the structure of the “natural” world (the technical name for this is an “artefactual” result; see later).

One distinction worth making between two types of non-experimental research is the difference between quasi-experimental research and case studies. The example from earlier – in which we wanted to examine the incidence of lung cancer among smokers and non-smokers without trying to control who smokes and who doesn’t – is a quasi-experimental design. That is, it’s the same as an experiment, but we don’t control the predictors (IVs). We can still use statistics to analyse the results – it’s just that we have to be a lot more careful.

The alternative approach, case studies, aims to provide a very detailed description of one or a few instances. In general, you can’t use statistics to analyse the results of case studies, and it’s usually very hard to draw any general conclusions about “people in general” from a few isolated examples. However, case studies are very useful in some situations. Firstly, there are situations where you don’t have any alternative: neuropsychology has this issue a lot. Sometimes, you just can’t find a lot of people with brain damage in a specific area, so the only thing you can do is describe those cases that you do have in as much detail and with as much care as you can. However, there are also some genuine advantages to case studies: because you don’t have as many people to study, you have the ability to invest lots of time and effort trying to understand the specific factors at play in each case. This is a very valuable thing to do. As a consequence, case studies can complement the more statistically-oriented approaches that you see in experimental and quasi-experimental designs. We won’t talk much about case studies in these lectures, but they are nevertheless very valuable tools!

4.6 Assessing the validity of a study

More than any other thing, a scientist wants their research to be “valid”. The conceptual idea behind validity is very simple: can you trust the results of your study? If not, the study is invalid. However, while it’s easy to state, in practice, it’s much harder to check validity than to check reliability. And in all honesty, there’s no precise, clearly agreed-upon notion of what validity actually is. In fact, there are lots of different kinds of validity, each of which raises its own issues, and not all forms of validity are relevant to all studies. Let’s talk about five different types:

  • Internal validity
  • External validity
  • Construct validity
  • Face validity
  • Ecological validity

To give you a quick guide as to what matters here:

  1. Internal and external validity are the most important since they tie directly to the fundamental question of whether your study really works.

  2. Construct validity asks whether you’re actually measuring what you think you are.

  3. Face validity isn’t terribly important except insofar as you care about “appearances”.

  4. Ecological validity is a special case of face validity that corresponds to a kind of appearance that you might care about a lot.

4.6.1 Internal validity

Internal validity refers to the extent to which you are able to draw the correct conclusions about the causal relationships between variables. It’s called “internal” because it refers to the relationships between things “inside” the study. Let’s illustrate the concept with a simple example. Suppose you’re interested in finding out whether a university education makes you write better. To do so, you get a group of first-year students, ask them to write a 1000-word essay, and count the number of spelling and grammatical errors they make. Then you find some third-year students, who obviously have had more university education than the first-years, and repeat the exercise. And let’s suppose it turns out that the third-year students produce fewer errors. And so you conclude that a university education improves writing skills. Right? Except – the big problem that you have with this experiment is that the third-year students are older, and they’ve had more experience with writing things. So it’s hard to know for sure what the causal relationship is: Do older people write better? Or people who have had more writing experience? Or people who have had more education? Which of the above is the true cause of the superior performance of the third-years? Age? Experience? Education? You can’t tell. This is an example of a failure of internal validity because your study doesn’t properly tease apart the causal relationships between the different variables.

4.6.2 External validity

External validity relates to the generalisability of your findings. That is, to what extent do you expect to see the same pattern of results in “real life” as you saw in your study? To put it a bit more precisely, any study that you do in psychology will involve a fairly specific set of questions or tasks, will occur in a specific environment, and will involve participants that are drawn from a particular subgroup. So, if it turns out that the results don’t actually generalise to people and situations beyond the ones that you studied, then what you’ve got is a lack of external validity.

The classic example of this issue is the fact that a very large proportion of studies in psychology will use undergraduate psychology students as participants. However, the researchers don’t care only about psychology students; they care about people in general. Given that, a study that uses only psych students as participants always risks lacking external validity. That is, if there’s something “special” about psychology students that makes them different to the general populace in some relevant respect, then we may start worrying about a lack of external validity.

That said, it is absolutely critical to realise that a study that uses only psychology students does not necessarily have a problem with external validity. The choice of population threatens the external validity: if (a) the population from which you sample your participants is very narrow (e.g. psych students), and (b) the narrow population that you sampled from is systematically different from the general population, in some respect that is relevant to the psychological phenomenon that you intend to study. The italicised part is the bit that many people forget: psychology undergraduates indeed differ from the general population in many ways, so a study that uses only psych students may have problems with external validity. However, if those differences aren’t very relevant to the phenomenon you’re studying, there’s nothing to worry about. To make this a bit more concrete, here are two extreme examples:

  • You want to measure “attitudes of the general public towards psychotherapy”, but all of your participants are psychology students. This study would almost certainly have a problem with external validity.
  • You want to measure the effectiveness of a visual illusion, and your participants are all psychology students. This study is very unlikely to have a problem with external validity

Having just spent the last couple of paragraphs focusing on the choice of participants (since that’s the big issue that everyone tends to worry most about), it’s worth remembering that external validity is a broader concept. The following are also examples of things that might pose a threat to external validity, depending on what kind of study you’re doing:

  • People might answer a “psychology questionnaire” in a manner that doesn’t reflect what they would do in real life.
  • Your lab experiment on (say) “human learning” has a different structure to the learning problems people face in real life.

4.6.3 Construct validity

Construct validity is a question of whether you’re measuring what you want to be measuring. A measurement has good construct validity if it is actually measuring the correct theoretical construct and inadequate construct validity if it doesn’t. To give a very simple (if ridiculous) example, suppose we’re trying to investigate the rates with which university students cheat on their exams. And the way we attempt to measure it is by asking the cheating students to stand up in the lecture theatre so that we can count them. When we do this with a class of 300 students, 0 people claim to be cheaters. So we, therefore, conclude that the proportion of cheaters is 0%. Clearly, this is a bit ridiculous. But the point here is not that this is a very deep methodological example, but rather to explain construct validity. The problem with the measure is that while we’re trying to measure “the proportion of people who cheat”, we’re actually measuring “the proportion of people stupid enough to own up to cheating, or bloody-minded enough to pretend that they do”. Obviously, these aren’t the same thing! So our study has gone wrong because our measurement has very poor construct validity.

4.6.4 Face validity

Face validity simply refers to whether or not a measure “looks like” it’s doing what it’s supposed to, nothing more. If you design a test of intelligence, and people look at it and say, “no, that test doesn’t measure intelligence”, then the measure lacks face validity. It’s as simple as that. Obviously, face validity isn’t very important from a purely scientific perspective. After all, what we care about is whether or not the measure actually does what it’s supposed to do, not whether it looks like it does what it’s supposed to do. Consequently, we generally don’t care much about face validity. That said, the concept of face validity serves three useful pragmatic purposes:

  • Sometimes, an experienced scientist will have a “hunch” that a particular measure won’t work. While these hunches have no strict evidentiary value, it’s often worth paying attention to them. Because oftentimes people know they can’t quite verbalise, there might be something to worry about even if you can’t quite say why. In other words, when someone you trust criticises the face validity of your study, it’s worth taking the time to think more carefully about your design to see if you can think of reasons why it might go awry. If you don’t find any reason for concern, you should probably not worry: after all, face validity doesn’t matter much.
  • (Very) often, completely uninformed people will also have a “hunch” that your research is bollocks. And they’ll criticise it on the internet or something. On close inspection, you’ll often notice that these criticisms are focused entirely on how the study “looks”, but not on anything more profound. The concept of face validity is useful for gently explaining to people that they need to substantiate their arguments further.
  • Expanding on the last point, if the beliefs of untrained people are critical (e.g. this is often the case for applied research where you actually want to convince policymakers of something or other), then you have to care about face validity. Simply because – whether you like it or not – a lot of people will use face validity as a proxy for real validity. If you want the government to change a law on scientific, psychological grounds, then it won’t matter how good your studies “really” are. If they lack face validity, you’ll find that politicians ignore you. Of course, it’s somewhat unfair that policy often depends more on appearance than fact, but that’s how things go.

4.6.5 Ecological validity

Ecological validity is a different notion of validity, similar to external validity but less important. The idea is that, to be ecologically valid, the entire study set-up should closely approximate the real-world scenario being investigated. In a sense, ecological validity is a kind of face validity – it relates mostly to whether the study “looks” right, but with a bit more rigour to it. To be ecologically valid, the study has to look right in a fairly specific way. The idea behind it is the intuition that a study that is ecologically valid is more likely to be externally valid. It’s no guarantee, of course. But the nice thing about ecological validity is that it’s much easier to check whether a study is ecologically valid than it is to check whether a study is externally valid. A simple example would be eyewitness identification studies. Most of these studies tend to be done in a university setting, often with a fairly simple array of faces to look at rather than a lineup. The length of time between seeing the “criminal” and being asked to identify the suspect in the “line up” is usually shorter. The “crime” isn’t real, so there’s no chance that the witness is scared, and there are no police officers present, so there’s not as much chance of feeling pressured. These things all mean that the study definitely lacks ecological validity. They might (but might not) mean that it also lacks external validity.

4.7 Confounds, artefacts and other threats to validity

If we look at the issue of validity in the most general fashion, the two biggest worries that we have are confounds and artefacts. These two terms are defined in the following way:

  • Confound: A confound is an additional, often unmeasured variable6 that turns out to be related to both the predictors and the outcomes. The existence of confounds threatens the internal validity of the study because you can’t tell whether the predictor causes the outcome, if the confounding variable causes it, etc.
  • Artefact: A result is said to be “artefactual” if it only holds in the special situation you tested in your study. The possibility that your result is an artefact describes a threat to your external validity, because it raises the possibility that you can’t generalise your results to the actual population you care about.

As a general rule, confounds are a more significant concern for non-experimental studies precisely because they’re not proper experiments. By definition, you’re leaving lots of things uncontrolled, so there’s a lot of scope for confounds working their way into your study. Experimental research tends to be much less vulnerable to confounds: the more control you have over what happens during the study, the more you can prevent confounds from appearing.

However, there are always swings and roundabouts, and when we start thinking about artefacts rather than confounds, the shoe is very firmly on the other foot. For the most part, artefactual results tend to be a concern for experimental studies than for non-experimental studies. To see this, it helps to realise that the reason that a lot of studies are non-experimental is precisely because what the researcher is trying to do is examine human behaviour in a more naturalistic context. By working in a more real-world context, you lose experimental control (making yourself vulnerable to confounds). Still, because you tend to be studying human psychology “in the wild”, you reduce the chances of getting an artefactual result. Or, to put it another way, when you take psychology out of the wild and bring it into the lab (which we usually have to do to gain our experimental control), you always run the risk of accidentally studying something different than you wanted to study: which is more or less the definition of an artefact.

Be warned though: the above is a rough guide only. It’s absolutely possible to have confounds in an experiment, and to get artefactual results with non-experimental studies. This can happen for all sorts of reasons, not least of which is researcher error. In practice, it’s really hard to think everything through ahead of time, and even very good researchers make mistakes. But other times it’s unavoidable, simply because the researcher has ethics (e.g. see 4.7.5).

Okay. There’s a sense in which almost any threat to validity can be characterised as a confound or an artefact: they’re pretty vague concepts. So let’s have a look at some of the most common examples…

4.7.1 History effects

History effects refers to the possibility that specific events may occur during the study itself that might influence the outcomes. For instance, something might happen between a pre-test and a post-test. Or in between testing participant 23 and participant 24. Alternatively, it might be that you’re looking at an older study, which was perfectly valid for its time, but the world has changed enough since then that the conclusions are no longer trustworthy. Examples of things that would count as history effects:

  • You’re interested in how people think about risk and uncertainty. You started your data collection in December 2010. But finding participants and collecting data takes time, so you were still finding new people in February 2011. Unfortunately for you (and even more unfortunately for others), the Queensland floods occurred in January 2011, causing billions of dollars of damage and killing many people. Not surprisingly, the people tested in February 2011 expressed quite different beliefs about handling risk than the people tested in December 2010. Which (if any) of these reflects the “true” beliefs of participants? I think the answer is probably both: the Queensland floods genuinely changed the beliefs of the Australian public, though possibly only temporarily. The key thing here is that the “history” of the people tested in February is quite different to people tested in December.
  • You’re testing the psychological effects of a new anti-anxiety drug. So what you do is measure anxiety before administering the drug (e.g. by self-report, and taking physiological measures, let’s say), then you administer the drug, and then you take the same measures afterwards. In the middle, however, because your labs are in Los Angeles, there’s an earthquake, which increases the anxiety of the participants.

4.7.2 Maturation effects

As with history effects, maturational effects are fundamentally about change over time. However, maturation effects aren’t in response to specific events. Instead, they relate to how people change on their own over time: we get older, we get tired, we get bored, etc. Some examples of maturation effects:

  • When doing developmental psychology research, you need to be aware that children grow up quite rapidly. So, suppose that you want to find out whether some educational trick helps with vocabulary size among 3-year-olds. One thing you need to be aware of is that the vocabulary size of children that age is growing at an incredible rate (multiple words per day), all on its own. If you design your study without taking this maturational effect into account, you won’t be able to tell if your educational trick works.
  • When running a very long experiment in the lab (say, something that goes for 3 hours), it’s very likely that people will begin to get bored and tired and that this maturational effect will cause performance to decline, regardless of anything else going on in the experiment

4.7.3 Repeated testing effects

An important type of history effect is the effect of repeated testing. Suppose I want to take two measurements of some psychological construct (e.g. anxiety). One thing I might be worried about is if the first measurement has an effect on the second measurement. In other words, this is a history effect in which the “event” that influences the second measurement is the first measurement itself! This is not at all uncommon. Examples of this include:

  • Learning and practice: e.g. “intelligence” at time 2 might appear to go up relative to time 1 because participants learned the general rules of how to solve “intelligence-test-style” questions during the first testing session.
  • Familiarity with the testing situation: e.g. if people are nervous at time 1, this might make the performance go down; after sitting through the first testing situation, they might calm down a lot precisely because they’ve seen what the testing looks like.
  • Auxiliary changes caused by testing: e.g. if a questionnaire assessing mood is boring, then mood at measurement at time 2 is more likely to become “bored”, precisely because of the boring measurement made at time 1.

4.7.4 Selection bias

Selection bias is a pretty broad term. Suppose you’re running an experiment with two groups of participants, where each group gets a different “treatment”, and you want to see if the different treatments lead to different outcomes. However, suppose that, despite your best efforts, you’ve ended up with a gender imbalance across groups (say, group A has 80% females and group B has 50% females). It might sound like this could never happen, but trust me, it can. This is an example of a selection bias, in which the people “selected into” the two groups have different characteristics. If any of those characteristics turn out to be relevant (say, your treatment works better on females than males), then you’re in a lot of trouble.

4.7.5 Differential attrition

One quite subtle danger to be aware of is called differential attrition, which is a kind of selection bias that is caused by the study itself. Suppose that, for the first time ever in the history of psychology, we manage to find a perfectly balanced and representative sample of people. We start running “our incredibly long and tedious experiment” on our perfect sample, but then, because our study is incredibly long and tedious, lots of people start dropping out. We can’t stop this: as we’ll discuss later in the chapter on research ethics, participants absolutely have the right to stop doing any experiment, any time, for whatever reason they feel like, and as researchers, we are morally (and professionally) obliged to remind people that they do have this right. So, suppose that “our incredibly long and tedious experiment” has a very high dropout rate. What do you suppose the odds are that this dropout is random? Answer: zero. Almost certainly, the people who remain are more conscientious, more tolerant of boredom etc., than those that leave. To the extent that (say) conscientiousness is relevant to the psychological phenomenon that we care about, this attrition can decrease the validity of our results.

When thinking about the effects of differential attrition, it is sometimes helpful to distinguish between two different types. The first is homogeneous attrition, in which the attrition effect is the same for all groups, treatments or conditions. In the example above, the differential attrition would be homogeneous if (and only if) the easily bored participants are dropping out of all the conditions in our experiment at about the same rate. The main effect of homogeneous attrition is likely to be that it makes your sample unrepresentative. As such, the biggest worry you’ll have is that the generalisability of the results decreases: in other words, you lose external validity.

The second type of differential attrition is heterogeneous attrition, in which the attrition effect is different for different groups. This is a much bigger problem: not only do you have to worry about your external validity, but you also have to worry about your internal validity too. To see why this is the case, let’s consider a very dumb study in which I want to see if insulting people makes them act more obediently. So, we design our experiment with two conditions. In the “treatment” condition, the experimenter insults the participant and then gives them a questionnaire designed to measure obedience. In the “control” condition, the experimenter engages in a bit of pointless chitchat and then gives them the questionnaire. Leaving aside the questionable scientific merits and dubious ethics of such a study, let’s have a think about what might go wrong here. As a general rule, when someone insults me to my face, I tend to get much less cooperative. So, there’s a pretty good chance that a lot more people are going to drop out of the treatment condition than the control condition. And this dropout isn’t going to be random. The people most likely to drop out would probably be those who don’t care all that much about the importance of obediently sitting through the experiment. Since the most bloody-minded and disobedient people all left the treatment group but not the control group, we’ve introduced a confound: the people who actually took the questionnaire in the treatment group were already more likely to be dutiful and obedient than the people in the control group. In short, in this study, insulting people doesn’t make them more obedient: it makes the more disobedient people leave the experiment! The internal validity of this experiment is completely shot.

4.7.6 Non-response bias

Non-response bias is closely related to selection bias and to differential attrition. The simplest version of the problem goes like this. You mail out a survey to 1000 people, and only 300 reply. The 300 people who replied are almost certainly not a random subsample. People who respond to surveys are systematically different to people who don’t. This introduces a problem when trying to generalise from those 300 people who replied, to the population at large; since you now have a very non-random sample. The issue of non-response bias is more general than this, though. Among the (say) 300 people that did respond to the survey, you might find that not everyone answered every question. If (say) 80 people chose not to answer one of your questions, does this introduce problems? As always, the answer is maybe. If the question that wasn’t answered was on the last page of the questionnaire, and those 80 surveys were returned with the last page missing, there’s a good chance that the missing data isn’t a big deal: probably the pages just fell off. However, if the question that 80 people didn’t answer was the most confrontational or invasive personal question in the questionnaire, then almost certainly you’ve got a problem. In essence, what you’re dealing with here is what’s called the problem of missing data. If the data that is missing was “lost” randomly, then it’s not a big problem. If it’s missing systematically, then it can be a big problem.

4.7.7 Regression to the mean

Regression to the mean is a curious variation on selection bias. It refers to any situation where you select data based on an extreme value on some measure. Because the measure has natural variation, it almost certainly means that when you take a subsequent measurement, that later measurement will be less extreme than the first one, purely by chance.

Here’s an example. Say we’re interested in whether a psychology education has an adverse effect on very smart kids. To do this, I find the 20 psych I students with the best high school grades and look at how well they’re doing at university. It turns out that they’re doing a lot better than average, but they’re not topping the class at university, even though they did top their classes at high school. What’s going on? The natural first thought is that this must mean that the psychology classes must be having an adverse effect on those students. However, while that might very well be the explanation, it’s more likely that what you’re seeing is an example of “regression to the mean”. To see how it works, let’s take a moment to think about what is required to get the best mark in a class, regardless of whether that class be at high school or at university. When you’ve got a big class, there are going to be lots of very smart people enrolled. To get the best mark, you have to be very smart, work very hard, and be a bit lucky. The exam has to ask just the right questions for your unique skills, and you have to not make any dumb mistakes (we all do that sometimes) when answering them. And that’s the thing: intelligence and hard work are transferrable from one class to the next. Luck isn’t. The people who got lucky in high school won’t be the same as the people who get lucky at university. That’s the very definition of “luck”. The consequence of this is that when you select people at the very extreme values of one measurement (the top 20 students), you’re selecting for hard work, skill and luck. But because the luck doesn’t transfer to the second measurement (only the skill and work), these people will all be expected to drop a little bit when you measure them a second time (at university). So their scores fall back a little bit, back towards everyone else. This is regression to the mean.

Regression to the mean is surprisingly common. For instance, if two very tall people have kids, their children will tend to be taller than average but not as tall as the parents. The reverse happens with very short parents: two very short parents will tend to have short children, but nevertheless, those kids will tend to be taller than the parents. It can also be extremely subtle. For instance, there have been studies done that suggest that people learn better from negative feedback than from positive feedback. However, the way that people tried to show this was to give people positive reinforcement whenever they did good and negative reinforcement when they did bad. And what you see is that after the positive reinforcement, people tended to do worse, but after the negative reinforcement, they tended to do better. But! Notice that there’s a selection bias here: when people do very well, you’re selecting for “high” values, and so you should expect (because of regression to the mean) that performance on the next trial should be worse, regardless of whether reinforcement is given. Similarly, after a bad trial, people will tend to improve all on their own. The apparent superiority of negative feedback is an artefact caused by regression to the mean (see Kahneman & Tversky, 1973 for discussion).

4.7.8 Experimenter bias

Experimenter bias can come in multiple forms. The basic idea is that the experimenter, despite the best of intentions, can accidentally influence the experiment results by subtly communicating the “right answer” or the “desired behaviour” to the participants. Typically, this occurs because the experimenter has special knowledge that the participant does not – either the right answer to the questions being asked or knowledge of the expected performance pattern for the condition the participant is in, and so on. The classic example of this happening is the case study of “Clever Hans”, which dates back to 1907 (Hothersall, 2004; Pfungst, 1911). Clever Hans was a horse that apparently was able to read and count and perform other human-like feats of intelligence. After Clever Hans became famous, psychologists started examining his behaviour more closely. It turned out that – not surprisingly – Hans didn’t know how to do maths. Rather, Hans was responding to the human observers around him. Because they did know how to count, and the horse had learned to change its behaviour when people changed theirs.

The general solution to the problem of experimenter bias is to engage in double-blind studies, where neither the experimenter nor the participant knows which condition the participant is in or knows what the desired behaviour is. This provides a good solution to the problem, but it’s essential to recognise that it’s not quite ideal and hard to pull off perfectly. For instance, the obvious way that we could try to construct a double-blind study is to have one of the PhD students (one who doesn’t know anything about the experiment) run the study. That feels like it should be enough. The only person (us) who knows all the details (e.g. correct answers to the questions, assignments of participants to conditions) has no interaction with the participants, and the person who does all the talking to people (the PhD student) doesn’t know anything. Except, that last part is improbable. For the PhD student to run the study effectively, they need to have been briefed by us, the researcher. And as it happens, the PhD student also knows a bit about our general beliefs about people and psychology. As a result of all this, it’s almost impossible for the experimenter to avoid learning a little bit about what expectations we have. And even a little knowledge can have an effect: suppose the experimenter accidentally conveys that the participants are expected to do well in this task. Well, there’s a thing called the “Pygmalion effect”: if you expect great things of people, they’ll rise to the occasion; but if you expect them to fail, they’ll do that too. In other words, the expectations become a self-fulfilling prophecy.

4.7.9 Demand effects and reactivity

When talking about experimenter bias, the worry is that the experimenter’s knowledge or desires for the experiment are communicated to the participants and that these affect people’s behaviour (Rosenthal, 1966). However, even if you manage to stop this from happening, it’s almost impossible to stop people from knowing that they’re part of a psychological study. And the mere fact of knowing that someone is watching/studying you can have a pretty big effect on behaviour. This is generally referred to as reactivity or demand effects. The Hawthorne effect captures the idea that people alter their performance because of the attention that the study focuses on them. The effect takes its name from the “Hawthorne Works” factory outside of Chicago (see Adair, 1984). A study done in the 1920s looking at the effects of lighting on worker productivity at the factory turned out to be an effect of the fact that the workers knew they were being studied rather than the lighting.

To get a bit more specific about how the mere fact of being in a study can change how people behave, it helps to think like a social psychologist and look at some of the roles that people might adopt during an experiment. Still, it might not adopt if the corresponding events were occurring in the real world:

  • The good participant tries to be too helpful to the researcher: he or she seeks to figure out the experimenter’s hypotheses and confirm them.
  • The negative participant does the exact opposite of the good participant: he or she seeks to break or destroy the study or the hypothesis in some way.
  • The faithful participant is unnaturally obedient: he or she seeks to follow instructions perfectly, regardless of what might have happened in a more realistic setting.
  • The apprehensive participant gets nervous about being tested or studied, so much so that his or her behaviour becomes highly unnatural or overly socially desirable.

4.7.10 Placebo effects

The placebo effect is a specific type of demand effect that we worry a lot about. It refers to the situation where the mere fact of being treated causes an improvement in outcomes. The classic example comes from clinical trials: if you give people a completely chemically inert drug and tell them that it’s a cure for a disease, they will tend to get better faster than people who aren’t treated at all. In other words, it is people believing that they are being treated that causes the improved outcomes, not the drug.

4.7.11 Situation, measurement and subpopulation effects

In some respects, these terms are a catch-all term for “all other threats to external validity”. They refer to the fact that the choice of the subpopulation from which you draw your participants, the location, timing and manner in which you run your study (including who collects the data) and the tools that you use to make your measurements might all be influencing the results. Specifically, the worry is that these things might be influencing the results in such a way that the results won’t generalise to a wider array of people, places and measures.

4.7.12 Fraud, deception and self-deception

Textbooks assessing the validity of a study often seem to make the assumption that the researcher is honest. I find this hilarious. While the vast majority of scientists are honest, in my experience at least, some are not.7 Not only that, but scientists are not immune to belief bias – it’s easy for a researcher to end up deceiving themselves into believing the wrong thing, which can lead them to conduct subtly flawed research, and then hide those flaws when they write it up. So you need to consider not only the (probably unlikely) possibility of outright fraud, but also the (probably quite common) possibility that the research is unintentionally “slanted”. Here’s a list of a few ways in which these issues can arise:

  • Data fabrication. Sometimes, people just make up the data. This is occasionally done with “good” intentions. For instance, the researcher believes that the fabricated data do reflect the truth and may actually reflect “slightly cleaned up” versions of actual data. On other occasions, the fraud is deliberate and malicious. Some high-profile examples where data fabrication has been alleged or shown include Cyril Burt (a psychologist who is thought to have fabricated some of his data), Andrew Wakefield (who has been accused of fabricating his data connecting the MMR vaccine to autism) and Hwang Woo-suk (who falsified a lot of his data on stem cell research).
  • Hoaxes. Hoaxes share many similarities with data fabrication, but they differ in the intended purpose. A hoax is often a joke, and many of them are intended to be (eventually) discovered. Often, the point of a hoax is to discredit someone or some field. There are quite a few well-known scientific hoaxes that have occurred over the years (e.g. Piltdown man). Some were deliberate attempts to discredit particular fields of research (e.g. the Sokal affair).
  • Data misrepresentation. While fraud gets most of the headlines, it’s much more common in my experience to see data misrepresented. Often the data doesn’t say what the researchers think it says. This isn’t the result of deliberate dishonesty, but it’s due to a lack of sophistication in the data analysis. For instance, think back to the example of Simpson’s paradox. It’s very common to see people present “aggregated” data of some kind, and sometimes, when you dig deeper and find the raw data yourself, you find that the aggregated data tell a different story to the disaggregated data. Alternatively, you might find that some aspect of the data is being hidden because it tells an inconvenient story (e.g. the researcher might choose not to refer to a particular variable). There are a lot of variants on this, many of which are very hard to detect.
  • Study “misdesign”. Okay, this one is subtle. The issue here is that a researcher designs a study with built-in flaws, which are never reported in the paper. The data reported are authentic and are correctly analysed, but they are produced by a study that is actually quite wrongly put together. The researcher really wants to find a particular effect, and so the study is set up in such a way as to make it “easy” to (artefactually) observe that effect. One sneaky way to do this – in case you’re feeling like dabbling in a bit of fraud yourself – is to design an experiment in which it’s obvious to the participants what they’re “supposed” to be doing and then let reactivity work its magic for you. If you want, you can add all the trappings of double-blind experimentation etc. It won’t make a difference since the study materials are subtly telling people what you want them to do. When you write up the results, the fraud won’t be obvious to the reader: what’s obvious to the participant when they’re in the experimental context isn’t always obvious to the person reading the paper. Of course, the way we’ve described this makes it sound like it’s always fraud: probably there are cases where this is done deliberately, but the bigger concern has been unintentional misdesign. The researcher believes. And so the study just happens to end up with a built-in flaw, and that flaw then magically erases itself when the study is written up for publication.
  • Data mining & post hoc hypothesising. Another way in which the authors of a study can more or less lie about what they found is by engaging in what’s referred to as “data mining”. As we’ll discuss later in the class, if you keep trying to analyse your data in lots of different ways, you’ll eventually find something that “looks” like a real effect but isn’t. This is referred to as “data mining”. It used to be quite rare because data analysis used to take weeks, but now that everyone has very powerful statistical software on their computers, it’s becoming very common. Data mining per se isn’t “wrong”, but the more that you do it, the bigger the risk you’re taking. The wrong thing is unacknowledged data mining. That is, the researcher runs every possible analysis known to humanity, finds the one that works, and then pretends that this was the only analysis that they ever conducted. Worse yet, they often “invent” a hypothesis after looking at the data to cover up the data mining. To be clear: it’s not wrong to change your beliefs after looking at the data and to reanalyse it using your new “post hoc” hypotheses. What is wrong is failing to acknowledge that you’ve done so. If you acknowledge that you did it, then other researchers are able to take your behaviour into account. If you don’t, then they can’t. And that makes your behaviour deceptive.
  • Publication bias & self-censoring. Finally, a pervasive bias is the “non-reporting” of negative results. This is almost impossible to prevent. Journals don’t publish every article submitted to them: they prefer to publish articles that find “something”. So, if 20 people run an experiment looking at whether reading Finnegans Wake causes insanity in humans, and 19 of them find that it doesn’t, which one do you think is going to get published? Obviously, it’s the one study that did find that Finnegans Wake causes insanity 8. This is an example of a publication bias: since no one ever published the 19 studies that didn’t find an effect, a naive reader would never know that they existed. Worse yet, most researchers “internalise” this bias and end up self-censoring their research. Knowing that negative results aren’t going to be accepted for publication, they never even try to report them. As a friend of Danielle’s says, “for every experiment that you get published, you also have ten failures”. And she’s right. The catch is, while some (maybe most) of those studies are failures for boring reasons (e.g. you stuffed something up), others might be genuine “null” results that you ought to acknowledge when you write up the “good” experiment – and telling which is which, is often hard to do. A good place to start is a paper by Ioannidis (2005) with the depressing title “Why most published research findings are false”. We’d also suggest taking a look at work by Kühberger et al. (2014) presenting statistical evidence that this actually happens in psychology.

There’s probably a lot more issues like this to think about, but that’ll do to start with. It’s the obvious truth that real world science is conducted by actual humans, and only the most gullible of people automatically assumes that everyone else is honest and impartial. Actual scientists aren’t usually that naive, but for some reason, the world likes to pretend that we are, and the textbooks we usually write seem to reinforce that stereotype.

4.8 Summary

This chapter isn’t really meant to provide a comprehensive discussion of psychological research methods: it would require another volume just as long as this one does justice to the topic. However, in real life, statistics and study design are tightly intertwined, so discussing some key topics is convenient. In this chapter, we’ve briefly discussed the following:

  • Introduction to psychological measurement. What does it mean to operationalise a theoretical construct? What does it mean to have variables and take measurements?
  • Scales of measurement and types of variables. Remember that there are two different distinctions here: there’s the difference between discrete and continuous data, and there’s the difference between the four different scale types (nominal, ordinal, interval and ratio).
  • Reliability of a measurement. If I measure the “same” thing twice, should I expect the same result? Only if my measure is reliable. But what does it mean to talk about doing the “same” thing? Well, that’s why we have different types of reliability. Make sure you remember what they are.
  • Terminology: predictors and outcomes. What roles do variables play in an analysis? Can you remember the difference between predictors and outcomes? Dependent and independent variables? Etc.
  • Experimental and non-experimental research designs. What makes an experiment an experiment? Is it a nice white lab coat, or does it have something to do with researcher control over variables?
  • Validity and its threats. Does your study measure what you want it to? How might things go wrong?

All this should make clear to you that study design is a critical part of research methodology. This chapter is built from the classic little book by Campbell & Stanley (1963), but there are, of course, a large number of textbooks out there on research design. Spend a few minutes with your favourite search engine, and you’ll find dozens.


  1. Actually, temperature isn’t strictly an interval scale, in the sense that the amount of energy required to heat something up by 3\(^\circ\) depends on its current temperature. So in the sense that physicists care about, temperature isn’t actually an interval scale.↩︎

  2. The reason why we say that it’s unmeasured is that if you have measured it, then you can use some fancy statistical tricks to deal with the confound. Because of the existence of these statistical solutions to the problem of confounds, we often refer to a confound that we have measured and dealt with as a covariate. Dealing with covariates is a topic for a more advanced course, but it’s comforting to know that it exists.↩︎

  3. Some people might argue that if you’re not honest then you’re not a real scientist. That does have some truth, but that’s disingenuous (google the “No true Scotsman” fallacy). The fact is that there are lots of people who are employed ostensibly as scientists, and whose work has all of the trappings of science, but who are outright fraudulent. Pretending that they don’t exist by saying that they’re not scientists is just childish.↩︎

  4. Clearly, the real effect is that only insane people would even try to read Finnegans Wake.↩︎