Chapter 2 Basic concepts

In this chapter, we’ll discuss basic concepts and terminology related to measurement and study design. Since this book focuses on data analysis, it won’t give you enough information to design studies of your own. To do that, you’ll need to take a course in research design and methods. There are a couple of good books on the topic.

This chapter relies on Howitt & Cramer (2020) and Stevens (1946) for discussing scales of measurement, and on Boncz (2015), Michalos (2014), and Marks & Yardley (2004) for discussing reliability and validity.

2.1 Introduction to psychological measurement

First, data collection can be thought of as a kind of measurement. That is, what we’re trying to do here is measure something about human behaviour or the human mind. What do we mean by “measurement”?

Measurement as a concept comes down to finding some way of assigning numbers, labels, or other well-defined descriptions to phenomena. So, any of the following would count as a psychological measurement:

  • Danielle’s age is 33 years.
  • She does not like anchovies.
  • Her chromosomal gender is male.
  • Her self-identified gender is female.

In the short list above, the bolded part is “the thing to be measured”, and the italicised part is “the measurement itself”. We can expand on this a little bit by thinking about the set of possible measurements that could have arisen in each case:

  • Age (in years) could have been 1, 2, 3 … years, etc. The upper bound on what the age could be is a bit fuzzy, but in practice, you’d be safe in saying that the largest possible age is 150 since no human has ever lived that long. And age doesn’t really have a true zero, but that is not important just yet.
  • When asked if someone likes anchovies, they might say I do, I do not, I have no opinion, or I sometimes do.
  • Chromosomal gender is almost certainly going to be male (XY) or female (XX), but there are a few other possibilities: with Klinefelter’s syndrome (XXY), it is more similar to male than to female. And there are other possibilities, too.
  • Self-identified gender is also very likely to be male or female, transgender, nonbinary, queer etc.

As you can see, for some things, like age, it seems pretty apparent what the set of possible measurements should be, whereas, for other things, it gets a bit tricky. But even regarding someone’s age, it’s a bit more subtle. For instance, if you’re a developmental psychologist, measuring in years is way too crude, and so you often measure age in years and months (if a child is 2 years and 11 months, this is usually written as “2;11”). If you’re interested in newborns, you might want to measure age in days since birth, or maybe even hours since birth.

Looking at this a bit more closely, you might also realise that the concept of “age” isn’t all that precise. Generally, when we say “age”, we implicitly mean “the length of time since birth”. But that’s not always the right way to do it.

Suppose you’re interested in how newborn babies control their eye movements. If Baby Alice is born 3 weeks premature and Baby Bianca is born 1 week late, would it really make sense to say that they are the “same age” if we encountered them “2 hours after birth”? In a sense, yes. By social convention, we use birth as our reference point for talking about age in everyday life since it defines the amount of time the person has been operating as an independent entity in the world. But from a scientific perspective, that’s not the only thing we care about. You might want to measure age from conception, and voilà, you’re already thinking about all the potential problems with analysis and comparisons. When we think about the biology of human beings, it’s often helpful to think of ourselves as organisms that have been growing and maturing since conception. From that perspective, Alice and Bianca aren’t the same age at all. So you might want to define the concept of “age” in two different ways: the length of time since conception and the length of time since birth. It won’t make much difference when dealing with adults, but when dealing with newborns, it might. In other words, how you specify the allowable measurement values is important.

Still, there’s the question of methodology. What specific measurement method will you use to find out someone’s age? There are several options:

  • You could just ask people, “how old are you?” The method of self-report is fast, cheap and easy, but it only works with people old enough to understand the question, and some people lie about their age.
  • You could ask an authority (e.g. a parent), “how old is your child?” This method is fast, and it’s not all that hard when dealing with kids since the parent is almost always around. It doesn’t work as well if you want to know “age since conception” since a lot of parents can’t say for sure when conception took place. You might need a different authority (e.g. an obstetrician).
  • You could look up official records, like birth certificates. This is time-consuming and annoying, but it has its uses (e.g. if the person is now dead).

All of the ideas discussed in the previous section relate to the concept of operationalisation. To be a bit more precise about the idea, operationalisation is the process by which we take a meaningful but somewhat vague concept and turn it into an accurate measurement. The method of operationalisation can involve several different things:

  • For instance, does “age” mean “time since birth” or “time since conception” in the context of your research?
  • Will you use self-report to measure age, ask a parent, or look up an official record? If you’re using self-report, how will you phrase the question?
  • Defining the set of allowable values that the measurement can take. Note that these values don’t always have to be numerical, though they often are. When measuring age, the values are numerical, but we still need to think carefully about what numbers are allowed. Do we want age in years, years and months, days, or hours? The values aren’t numerical for other types of measurements (e.g. gender). But, as before, we need to consider what values are allowed. If we’re asking people to self-report their gender, what options do we allow them to choose from? Is it enough to allow only “male” or “female”? Do you need an “other” option? Or should we not give people any specific options and let them answer in their own words? And if you open up the set of possible values to include all verbal responses, how will you interpret their answers?

Operationalisation is tricky, and there’s no “one, true way” to do it. How you operationalise the informal concept of “age” or “gender” into a formal measurement depends on what you need to use the measurement for. You’ll often find that the community of scientists who work in your area have some well-established ideas for how to go about it. Tip: when working on your dissertation, consult your supervisor and the literature to see what the community has settled on.

In other words, operationalisation needs to be thought through case-by-case. Nevertheless, while there are a lot of issues that are specific to each individual research project, there are some aspects to it that are pretty general.

Let’s take a moment to clear up our terminology and, in the process, introduce one more term. Here are four different things that are closely related to each other:

  • A clear construct. This is the concept that you’re trying to measure, like “age”, “gender”, or an “opinion”.
  • A measure. The measure refers to the method or instrument used to make your observations. A question in a survey, a behavioural observation or a brain scan could all count as a measure.
  • Operationalisation. This is the logical connection between the measure and the theoretical construct: a definition of how you assign value to the concept (calculation, range, level of measurement etc.).
  • A variable. Finally, a new term. A variable is the outcome of the measurement process.

Example 2.1 (Method) Here’s an example:

  • Construct: attitude towards Prince Harry the Duke of Sussex (the underlying concept)
  • Measure: 5-point Likert scale in a self-report survey (instrument)
  • Operationalisation: survey a sample of UK online newsreaders and ask them to rate their level of agreement on a 5-point Likert scale with the following statement: “I personally find Prince Harry the Duke of Sussex an agreeable person”6.
  • Variable: the numerical value of the response to the question, e.g., 5: “strongly agree”…

2.2 Measurement levels

As the previous section indicates, the outcome of a psychological measurement is called a variable. But not all variables are of the same qualitative type, and it’s handy to understand what types there are. A very useful concept for distinguishing between different types of variables is what’s known as scales of measurement, or in CogStat terminology, measurement levels.

2.2.1 Nominal categories

Definition 2.1 (Nominal scale variable) A nominal scale variable (also referred to as a categorical variable) is the results of a qualitative measurement, where the number of possible values are limited and fixed; they represent a name, a label, a classification, or non-overlapping categories; and there is no order, hierarchy or ranking between the categories. E.g. eye colour, place of residence, gender.

For these kinds of variables, it doesn’t make any sense to say that one of them is “bigger’ or”better” than any other one, and it doesn’t make any sense to average them. The classic example for this is “eye colour”. Eyes can be blue, green and brown, among other possibilities, but none of them is any “better” than any other one. As a result, it would feel bizarre to talk about an “average eye colour”. Similarly, gender is nominal too: male isn’t better or worse than female, neither does it make sense to try to talk about an “average gender”. In short, nominal scale variables are those for which the only thing you can say about the different possibilities is that they are different. That’s it.

Note that sometimes nominal variables will have numbers coded to them usually for technical reasons (e.g., in survey data outputs). E.g., 1: male, 2: female, 3: nonbinary… This is a common practice, but the variable is still nominal: the numbers are just a way of representing the categories, and you should never, ever, ever (!) analyse them as if they were numerical/score measurements.

Example 2.2 (Nominal scale variable example) Suppose we were researching how people commute to and from work. One variable we would have to measure would be what kind of transportation people use to get to work. This “transport type” variable could have quite a few possible values, including: “train”, “bus”, “car”, “bicycle”, etc. For now, let’s suppose that these four are the only possibilities, and suppose that when we ask 100 people how they got to work today, we get this:

Transportation Number of people
Train 12
Bus 30
Car 48
Bicycle 10

So, what’s the average transportation type? Obviously, the answer here is that there isn’t one. You can say that travel by car is the most popular method, and travel by train is the least popular method, but that’s about all. That is based on the frequency of the occurence (i.e., count). Similarly, notice that the order in which the options are listed isn’t very exciting. We could have chosen to display the data like this:

Transportation Number of people
Car 48
Train 12
Bicycle 10
Bus 30

– and nothing really changes.

2.2.2 Ordinal scale and rank

Definition 2.2 (Ordinal scale variable) An ordinal scale variable or rank is the results of a measurement, where there is a natural, meaningful way to order or rank the different outcome possibilities. E.g. finishing position in a race, education status.

The quantitative difference between the outcomes might be unknown or uneven. E.g. education status: finishing elementary school takes 8 years in Europe, while an undergraduate degree is usually 3 years long; or you can say that the person who finished first was faster than the person who finished second, but you don’t know the exact difference (unless you had a “finished at” timestamp data, which in turn is no longer an ordinal variable anymore). The important thing is that there is a natural, meaningful way to order the outcomes, but we don’t quantify the difference between them.

Note that you can have a numeric code assigned to an ordinal variable. However, do not process these numbers as if they were meaningful beyond them representing an order. E.g., if you have a variable that measures the level of education, and you code it as 1: elementary school, 2: high school, 3: undergraduate degree, 4: graduate degree, but the variable still represents an order (i.e., level): you cannot add, subtract, divide or multiply these numbers.

Example 2.3 (Ordinal scale variable example) Suppose we’re interested in people’s attitudes to climate change, and we ask them to pick one of these four statements that most closely matches their beliefs:

1 Temperatures are rising because of human activity
2 Temperatures are rising, but we don’t know why
3 Temperatures are rising, but not because of humans
4 Temperatures are not rising

Notice that these four statements actually do have a natural ordering in terms of “the extent to which they agree with the current science”. Statement 1 is a close match, statement 2 is a suitable match, statement 3 isn’t a perfect match, and statement 4 strongly opposes science. So, in terms of the thing we’re interested in (the extent to which people endorse the science), we can order the items as 1 > 2 > 3 > 4. Since this ordering exists, it would be peculiar to list the options like this:

3 Temperatures are rising, but not because of humans
1 Temperatures are rising because of human activity
4 Temperatures are not rising
2 Temperatures are rising, but we don’t know why

– because it seems to violate the natural “structure” of the question.

So, let’s suppose I asked 100 people these questions and got the following answers:

Response Number
1 Temperatures are rising because of human activity 51
2 Temperatures are rising, but we don’t know why 20
3 Temperatures are rising, but not because of humans 10
4 Temperatures are not rising 19

When analysing these data, it seems quite reasonable to try to group (1), (2) and (3) together and say that 81 of 100 people were willing to at least partially endorse the science. And it’s also quite reasonable to group (2), (3) and (4) together and say that 49 of 100 people registered at least some disagreement with the dominant scientific view. However, it would be entirely bizarre to try to group (1), (2) and (4) together and say that 90 of 100 people said what? There’s nothing sensible that allows you to group those responses together at all.

That said, notice that while we can use the natural ordering of these items to construct sensible groupings, what we can’t do is average them. For instance, in our simple example here, the “average” response to the question is 1.97. We would love to know if someone can tell us what that means.

2.2.3 Interval scale

Definition 2.3 (Interval scale variable) An (equal-)interval scale variable is the results of a quantitative measurement, where the difference between the outcomes is meaningful, but no true zero value can be assigned to our variable. E.g. temperature in Celsius or Fahrenheit etc.

In contrast to nominal and ordinal scale variables, the differences between the numbers are interpretable: addition and subtraction make sense for interval scale variables. The intervals are same-sized, but a measurement value of 0 does not mean “nothing”/“none at all” on the Celsius scale: 00^\circ means “the temperature at which water freezes”, it’s a, useful, but arbitrary label, not a true zero. As a consequence, it becomes pointless to try to multiply and divide temperatures. It is wrong to claim that 2020^\circ is negative two times as hot as 10-10^\circ.

Example 2.4 (Interval scale variable example) Suppose we’re interested in looking at how the attitudes of first-year university students have changed over time, and we need to capture the year they started. This is an interval scale variable. A student who started in 2003 did arrive 5 years before a student who started in 2008. However, it would be completely insane to divide 2008 by 2003 and say that the second student started “1.0024 times later” than the first one.

2.2.4 Ratio scale

Definition 2.4 (Ratio scale variable) A ratio scale variable is the results of a quantitative measurement, where both the difference between the outcomes and the ratio of the outcomes are meaningful, and the variable has a true zero value. E.g. distance in meters, heart rate, mass, temperature on the Kelvin scale etc.

The fourth and final type of variable to consider is a ratio scale variable, in which zero really means zero, and it’s okay to multiply and divide on top of addition and subtraction. You can have a heart rate of zero, or in other words, “no heart rate at all” (an absolute zero), but that sadly means that you are likely dead. Kelvin is a ratio scale variable, because it has a true zero (absolute zero), and 100 K means truly twice as much energy as 50 K.

Example 2.5 (Ratio scale variable example) A psychological example would be the result of a short-term working memory capacity test7, where we ask respondents to remember a set of 5-letter words and recall them. Let’s make our variable the number of words that they successfully recall. Person A is able to recall 12 words, and Person B can recall 6 words, Person C cannot recall a single word (i.e., 0 words), and Person D can recall 7 words. We can set an order between them: Person A > Person D > Person B > Person C; there is equal distance between the possible units, so subtraction is meaningful (interval); furthermore, there is a true zero (Person C). It also makes sense to say Person B made twice as many errors as Person A.

2.2.5 The special case of the Likert scale

The humble Likert scale is all survey designs’ bread and butter tool. You have filled out hundreds, maybe thousands of them, and odds are you’ve even used one yourself. Suppose we have a survey question that looks like this:

Which of the following best describes your opinion of the statement that “all pirates are awesome” …

and then, the options presented to the participant are these:

  1. Strongly disagree
  2. Disagree
  3. Neither agree nor disagree
  4. Agree
  5. Strongly agree

This set of items is an example of a 5-point Likert scale: people are asked to choose among one of several (in this case, 5) clearly ordered possibilities, generally with a verbal descriptor given in each case. However, it’s not necessary that all items be explicitly described. This is a perfect example of a 5-point Likert scale too:

  1. Strongly disagree
  2. Strongly agree

Likert scales are convenient, if somewhat limited, tools. The question is, what kind of variable are they? They’re obviously discrete since you can’t give a response of 2.5. They’re obviously not nominal scale since the items are ordered, and they’re not ratio scale either since there’s no natural zero8.

But are they ordinal scale or interval scale? One argument says that we can’t prove that the difference between “strongly agree” and “agree” is of the same size as the difference between “agree” and “neither agree nor disagree”. In fact, in everyday life, it’s pretty apparent they’re not the same. So this suggests that we ought to treat Likert scales as ordinal variables. On the other hand, we can argue that some participants will take the whole “on a scale from 1 to 5” part seriously, and they tend to act as if the differences between the five response options were equidistant. While theoretically it is not an interval scale, researchers treat it as a quasi-interval scale.

2.2.6 Continuous versus discrete variables

There’s a second kind of distinction that you need to be aware of regarding what types of variables you can run into. This is the distinction between continuous and discrete data types.

Definition 2.5 (Continuous variables) A continuous variable can take on any value on a spectrum, and it’s logically possible to have a value in between.

Example 2.6 (Continuous variable example) Response time is continuous. If Alan takes 3.1 seconds and Ben takes 2.3 seconds to respond to a question, then Cameron’s response time can lie in between by taking 3.0 seconds. And, of course, it would also be possible for David to take 3.031 seconds to respond, meaning that his RT would lie in between Cameron’s and Alan’s. And while in practice, it might be impossible to measure RT that precisely, it’s certainly possible in principle. Because we can always find a new value for RT in between any two other ones, we say that RT is continuous.

Definition 2.6 (Discrete variables) A discrete variable can take on a limited number of distinct values; there is no possible value in between.

Example 2.7 (Discrete variable example) Nominal scale variables are always discrete: there isn’t a type of transportation that falls “in-between” trains and bicycles, not in the strict mathematical way that 2.3 falls in between 2 and 3. So transportation type is discrete.

Similarly, ordinal scale variables are always discrete: although “2nd place” does fall between “1st place” and “3rd place”, there’s nothing that can logically fall in between “1st place” and “2nd place”.

Interval scale and ratio scale variables can go either way. Temperature in degrees Celsius (an interval scale variable) is also continuous; however, the year you went to school (an interval scale variable) is discrete. There’s no year between 2002 and 2003. The number of questions you get right on a true-or-false test (a ratio scale variable) is also discrete: since a true-or-false question doesn’t allow you to be “partially correct”, there’s nothing in between 5/10 and 6/10.

Note that some people might say “discrete variable” when they mean “nominal scale variable”. While all nominal scale variables are discrete, not all discrete variables are nominal.

2.2.7 A summary guide for levels of measurement

Variable types Nominal Ordinal Interval Ratio
Data types
Discrete \checkmark \checkmark \checkmark \checkmark
gender, birthplace education, finishing position year of enrolment to university number of questions answered correctly
Continuous \checkmark \checkmark
attitude height, weight, heart rate, distance
Properties
Can be ordered or ranked \checkmark \checkmark \checkmark
Equidistant units \checkmark \checkmark
Has a meaningful, true zero \checkmark
Valid operations
++ addition, - subtraction \checkmark \checkmark
×\times multiplication, ÷\div division \checkmark \checkmark
Central tendency measures
Mode \checkmark \checkmark \checkmark \checkmark
Median \checkmark \checkmark \checkmark
Mean \checkmark \checkmark

2.3 Independent and dependent variables

Usually, when we do some research, we end up with lots of different variables. Then, when we analyse our data, we often try to explain some of the variables in terms of the other variables. It’s essential to keep the two roles, “thing doing the explaining” and “thing being explained”, distinct. So let’s be clear about this now. Firstly, we might as well get used to the idea of using mathematical symbols to describe variables since it’s going to happen repeatedly. Let’s denote the “to be explained” variable YY, and the variables “doing the explaining” as X1X_1, X2X_2, etc.

Now, when we are doing analysis, we have different names for XX and YY, since they play different roles. The classical names for these roles are independent variable (IV) and dependent variable (DV). The IV is the variable you use to explain (i.e., XX) and the DV is the variable being explained (i.e., YY). The logic behind these names goes like this: if there is a relationship between XX and YY, then we can say that YY depends on XX, and if we have designed our study “properly”, then XX isn’t dependent on anything else. However, those names are horrible: they’re hard to remember, and they’re highly misleading because (a) the IV is never actually “independent of everything else” and (b) if there’s no relationship, then the DV doesn’t actually depend on the IV. And, because we’re not the only people who think that IV and DV are just awful names, there are several alternatives that some find more appealing. The terms used in these notes are predictors and outcomes. The idea here is that you’re trying to use XX (the predictors) to make guesses about YY (the outcomes). This is summarised in Table 2.1.

Table 2.1: The terminology used to distinguish between different roles that a variable can play when analysing a data set
role of the variable classical name modern name
to be explained dependent variable (DV) outcome
to do the explaining independent variable (IV) predictor

2.4 Reliability

By applying psychological measures we end up with variables, which can come in many different types. But the inevitable question arises: is the measurement any good? We’ll do this in terms of two related ideas: reliability and validity.

Definition 2.7 (Reliability) The reliability of a measure is the extent to which it is dependably, consistently, and stably giving the same result when measuring the same observation.

Example 2.8 (Reliability example A) You are designing a health psychology experiment to test the effect of psilocybin on mood disorders9. You are trying to find the right bodyweight-adjusted dose. You have a scale in the lab in Room A. You ask a participant to step on and you read 65 kg. For some reason, within a minute (in which they didn’t drink, eat or go to the bathroom), you have to ask them to step back, but this time, the scale shows 74 kg. Disbelief creeps in, rightly. You are confused, and you decide to ask them to step on the scale again. This time, the scale shows 65 kg. Can you trust this 65 kg reading?

You borrow a scale from Room B and ask the same participant to step on it – again, within a minute or two, so no weight shift should occur. This time, the scale shows 68 kg after three consecutive readings. But can you trust this 68 kg reading? You have your doubts. So you go into Room C, which has a different type of scale. The participant steps on the scale three times, and it consistently shows 68 kg.

Example 2.9 (Reliability example B) You have a patient admitted to a ward, and you and your colleague are asked to evaluate them. You both use the same version of the Questionnaire X, still within the same time of day, but you both end up with different scores and different diagnoses. Can you trust either of your scores?

A senior colleague provides you with SCID-5, a structured clinical interview for diagnosing mental disorders based on DSM-5. You use it to evaluate the patient. You both get a diagnosis of major depressive disorder.

The examples already hints at different ways of thinking about reliability, but let’s summarise what the different types of reliability might be:

  • Test-retest or temporal reliability. This relates to consistency over time: if we repeat the measurement on the same thing at a later date, do we get the same answer? In Example 2.8, the scale in Room A was not reliable, but the scale in Room B and Room C was.
  • Interrater reliability. This relates to consistency across people: if someone else repeats the measurement, will they produce the same answer? In Example 2.9, the two raters gave different answers with the same Questionnaire X, so clearly it was not reliable in this sense, however the SCID-5 was10. If this fails, the instrument is subjective.
  • Internal consistency reliability or homogeneity. Suppose a measurement is constructed from many different parts that perform similar functions. Inventories and scales use multiple questions to measure a single concept. If the questions are all measuring the same thing, then they should be consistent with each other. Ideally, items relating to the same concept should be highly correlated with each other, i.e., they should be consistent with each other11.
  • Parallel/alternative forms reliability. This relates to consistency across theoretically-equivalent measurements. In Example 2.8, the scales in both Room B and Room C gave the same reading. Two different “versions” of a measurement instrument were used to measure the same concept (human body weight), and they gave the same answer12.
  • Split-half reliability. This is a very specific concept widely used in survey design: divide a test into two halves, and see if the two halves are consistent with each other by calculating the reliability coefficient.

Not all measurements need to possess all forms of reliability. Nevertheless, it is important to be aware of reliability as a concept with its forms. CogStat will help you with analysing internal consistency and interrater reliability as of version 2.4+, but we will cover the tutorial once we understand how to compare two groups of data (Chapter 6).

2.5 Validity

More than any other thing, a scientist wants their research to be “valid”. The conceptual idea behind validity is simple: can you trust the results of your study? In practice, there aredifferent kinds of validity, each of which raises its own issues, and not all forms of validity are relevant to all studies.

While validity is a research methodology subject and not a statistical one, some forms of validity are measured using statistical methods. So let’s talk about different kinds of validity briefly, in a somewhat particular order.

Definition 2.8 (Content validity) The content validity is the extent to which an instrument measures the desired concept or construct comprehensively (with no gaps or missing aspects) and accurately (with no irrelevant or misleading domains).

Some typical questions would be:

  • Are all important domains of the concept we want to measure covered?
  • Does the selected instrument measure the same concept that we are trying to measure in all its aspects?
  • Are there any items in the instrument that are not relevant to the concept we are trying to measure?
  • What biases might be present in the instrument?
  • Is the instrument culturally sensitive? Is the instrument reliable and consistent across different subgroups of the population?

A subjective judgement of content validity is called face validity.

Definition 2.9 (Face validity) The face validity of a study is the extent to which the measurement of a variable “looks like” it’s measuring the correct theoretical construct, but it is a subjective judgement.

Types of questions that you might ask:

  • Does the instrument look like it’s measuring the right thing?
  • Will the participant recognise the construct that we are trying to measure? (This might lead to cases where the participant fakes the response.)

Definition 2.10 (Construct validity) The construct validity is the extent to which the measurement of a variable is consistent with the theoretical construct that it is supposed to measure.

Typical questions you might consider:

  • Is the construct defined in a way that is clear and unambiguous?
  • Is the construct defined too broadly or too narrowly?
  • Is the sample and method appropriate for the construct?
  • How does this instrument relate to other instruments designed to measure the same construct?
  • Does the instrument produce the same expected results as other instruments designed to measure the same construct?
  • Does the instrument predict other variables that are theoretically related to the construct?

To test for construct validity, we can to test for discriminant validity13 through correlation and factor analysis.

Definition 2.11 (Discriminant validity) The discriminant validity assesses the extent to which a measure is capable of accurately distinguishing the construct variable it is intended to measure from an unrelated construct variables. It is the opposite of convergent validity.

Question:

  • What is the correlation between the variable we are measuring and other variables that should not be related to it? \rightarrow discriminant validity coefficient

Definition 2.12 (Convergent validity) The convergent validity is the extent to which the measurement of a variable is consistent with other variables that should be theoretically related to it.

Types of questions you might consider:

  • What is the correlation between the variable we are measuring and other variables that should be related to it? \rightarrow correlation coefficient

Example 2.10 (Constructs and validity) You are researching the relationship between socio-economic status and health outcomes. The health outcome is the construct of interest, it is the dependent variable. The socio-economic status is the independent variable. You are interested in the relationship between the two, and you want to know if socio-economic status is a good predictor of health outcomes.

There are a number of related variables that you could use to measure socio-economic status, such as: personal income, household income, education, occupation, and so on. You could use any of these variables to measure socio-economic status, but you want to know which one is the best predictor of health outcomes. You could test for the relationship between each of these variables and health outcomes, and then choose the one that has the highest correlation coefficient. This is a form of convergent validity.

There are unrelated variables that have an effect on health outcomes, such as: environment, gender, disability and impairments, minority status, family history of diseases, and so on. If the unrelated variables demonstrate a weak correlation with the socio-economic status variables, then this is a form of discriminant validity.

You’ll notice that in this particular example, gender14, disability, and minority status as variables would likely have a stronger than weak correlation with socio-economic status. These are called confounding variables. They are variables that are related to both the independent and dependent variables, and they can affect the results of the study. You’ll need to control for these variables in your study design.

Definition 2.13 (Internal validity) The internal validity is the extent to which the results of the study can be attributed to the cause-and-effect relationships between the variables studied, rather than other factors. I.e., any change in the dependent variable (outcome) is a result of the manipulation of the independent variable (predictor), and not due to other factors.

Typical questions would be:

  • Is the relationship between the independent and dependent variables causal?
  • Is this cause-and-effect relationship the only one that could explain the results?
  • Are there any confounding variables that could affect the results?
  • Was the design fit for establishing causality?

Definition 2.14 (External validity) The external validity of a study is the extent to which the results of the study can be generalised to other people, other situations, and other times.

Questions to consider:

  • To what extent can the results of the study be generalised to other people, other situations, and other times?
  • Is the sample representative of the population?
  • Is the sample too narrow? (e.g. only psychology students)
  • Is this study replicable across different settings?
  • Are there any respondent biases that could affect the results? (Hawthorne effect, demand characteristics, etc.)

Definition 2.15 (Ecological validity) The ecological validity of a study is the extent to which the entire study set-up closely approximates the real-world scenario being investigated.

Some typical questions would be:

  • Is the lab-based study set-up similar to the real-world scenario?
  • Would the study scenario occur naturally in the real world or in other non-controlled environment?
  • Are there any environmental or systemic factors that could affect the replicability of results in the real world?

Definition 2.16 (Criterion validity) The criterion validity of an instrument is the extent to which the score results are consistent with other measures of the same construct.

Typical questions would be:

  • Does the instrument produce the same expected results as other instruments designed to measure the same construct?
  • Is the correlation between the instrument and other instruments designed to measure the same construct high enough?

Definition 2.17 (Concurrent validity) The concurrent validity of an instrument is the extent to which the score results are consistent with other criterion measures (where the criterion is measured at the same time as the instrument is administered).

Question:

  • What is the correlation between the instrument and other instruments designed to measure the same construct at the same time on the same sample? \rightarrow concurrent validity coefficient

Definition 2.18 (Predictive validity) The predictive validity of an instrument is the extent to which the results can be used to make inferences about future criterion outcomes (where the criterion is measured after the instrument is administered).

Definition 2.19 (Postdictive validity) The postdictive validity or retrospective validity of an instrument is the extent to which the results can be used to make inferences about past criterion outcomes (where the criterion is measured before the instrument is administered).

Modern Validity Theory Terminology

The American Educational Research Association (AERA), the American Psychological Association (APA), and the National Council on Measurement in Education (NCME) prepared their Standards for Educational and Psychological Testing (2014). This standard uses different terminology to describe the different types of validity.

It does not apply validity to the research instrument but rather the interpretation of it; it is a continuum and not a dichotomy (“valid” or “not valid”) (Edwards et al., 2018). So in this mindset, we no longer have types of validity but rather types of validity evidence:

  • Evidence based on test content
  • Evidence based on response processes
  • Evidence based on internal structure
  • Evidence based on relations to other variables

However, still many statistics and research methodology textbooks used today, particularly the ones that are used to create this chapter, use the historical terminology.

Summary

This chapter isn’t really meant to provide a comprehensive discussion of psychological research methods: it would require another volume just as long as this one does justice to the topic. However, in real life, statistics and study design are tightly intertwined, so discussing some key topics is convenient. In this chapter, we’ve briefly discussed the following:

  • Measurement. What does it mean to operationalise a theoretical construct? What does it mean to have variables and take measurements?
  • Scales of measurement and variable types. Remember that there are two different distinctions here: there’s the difference between discrete and continuous data, and there’s the difference between the four different scale types (nominal, ordinal, interval and ratio).
  • Terminology: predictors and outcomes. What roles do variables play in an analysis? Can you remember the difference between predictors and outcomes? Dependent and independent variables? Etc.
  • Reliability. Can you trust your results? How do you know that your measurements are consistent?
  • Validity. Does your study measure what you want it to?

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for Educational and Psychological Testing. American Educational Research Association.
Boncz, I. (2015). Research methodology basics [Kutatásmódszertani alapismeretek]. Pécsi Tudományegyetem.
Brodey, B. B., First, M., Linthicum, J., Haman, K., Sasiela, J. W., & Ayer, D. (2016). Validation of the NetSCID: An automated web-based adaptive version of the SCID. Comprehensive Psychiatry, 66, 67–70. https://doi.org/10.1016/j.comppsych.2015.10.005
Cowan, N. (2015). George Miller’s magical number of immediate memory in retrospect: Observations on the faltering progression of science. Psychological Review, 122(3), 536–541. https://doi.org/10.1037/a0039035
Edwards, M. C., Slagle, A., Rubright, J. D., & Wirth, R. J. (2018). Fit for purpose and modern validity theory in clinical outcomes assessment. Quality of Life Research, 27(7), 1711–1720. https://doi.org/10.1007/s11136-017-1644-z
Garcia-Romeu, A., Barrett, F. S., Carbonaro, T. M., Johnson, M. W., & Griffiths, R. R. (2021). Optimal dosing for psilocybin pharmacotherapy: Considering weight-adjusted and fixed dosing approaches. Journal of Psychopharmacology, 35(4), 353–361. https://doi.org/10.1177/0269881121991822
Howitt, D., & Cramer, D. (2020). Understanding statistics in psychology with SPSS (Eighth edition). Pearson.
Hubley, A. M. (2014). Divergent Validity. In A. C. Michalos (Ed.), Encyclopedia of Quality of Life and Well-Being Research (pp. 1675–1676). Springer Netherlands. https://doi.org/10.1007/978-94-007-0753-5_766
Marks, D., & Yardley, L. (Eds.). (2004). Research methods for clinical and health psychology. SAGE.
Michalos, A. C. (Ed.). (2014). Encyclopedia of quality of life and well-being research. Springer.
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81–97. https://doi.org/10.1037/h0043158
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103, 677–680.
Taylor, S. E., & Stanton, A. L. (2021). Health Psychology (Eleventh). McGraw-Hill.

  1. Is this most relevant sentence that accurately represents the concept you want to measure?↩︎

  2. One of the most famous ones is a digit span test (Miller, 1956), but the original article is not without its faults (Cowan, 2015).↩︎

  3. And you can’t cheat. You can code your Likert scale from 0 to 4 instead of 1 to 5, but these are arbitrary, just like deciding that the freezing temperature is labelled as 0.↩︎

  4. Read up on actual research on dosing, if interested. At the time of writing, this is an excellent starting point: Garcia-Romeu et al. (2021)↩︎

  5. SCID-5 is a golden standard, but it is not fully void of inter-rater reliability issues, particularly among recently trained clinicians (Brodey et al., 2016), however, there are always new tools or versions to help tackle this.↩︎

  6. Of course if some items are too highly correlated, one or two of them might even be redundant to be dropped while keeping the overall consistency high. However, we won’t cover in detail survey design in this book.↩︎

  7. Making different versions of the same test, e.g. by switching up the order of items is one possible way to test for reliability.↩︎

  8. Sometimes referred to as divergent validity. This, however, is not a universally accepted terminology (Hubley, 2014).↩︎

  9. E.g. research suggests that there are a number of reasons why women tend to live longer than men beyond socio-economic status (with which the relationship is negative), like biological protective factors and differences in risk aversion between men and women (Taylor & Stanton, 2021).↩︎