Dr. Michael Passer, Psychology 209
U. of Washington, Winter 2008
Answers to: Knowledge Check Questions
Reliability and Validity
- Weighing yourself on a scale 3 times and getting the following readings: 150 lbs., 157 lbs., 153 lbs.
This example primarily illustrates low reliability: the scale is yielding inconsistent output (a 7 pound range) simply by getting on and off the scale three times. Measures with low reliability always have low validity as well. Although the construct (i.e., concept) of “weight” has validity, this scale could not provide a valid measure of weight because it doesn’t even yield consistent measurements in the first place.
- Administering a job skills test to 100 job applicants, hiring the 50 best scorers, and then finding out that even among these 50 new employees, those who scored higher on the job skills test tend to perform better on the job.
Criterion validity focuses on how well a measure predicts future behavior or meaningfully relates to some other criterion. In this example, the job skills test has high criterion validity because it predicts future job performance: Higher scores on the test were predictive of better on the job performance. (Ultimately, if supported by other findings, this high criterion validity will help to establish the construct validity of this job skills test as well.) Note that in terms of establishing criterion validity, it would have been ideal if the company had hired all 100 job applicants – even those who scored poorly on the skills test – and then examined how well the test predicted actual job performance. In the real world, however, companies are not likely to do this.
- Students who score in the top 10% on the ACTs (a college aptitude test) tend to score in about the same percentile on the SATs (a different college aptitude test).
Criterion validity focuses on how well a measure predicts future behavior or relates to some other criterion, such as (in this case) another measure of the same construct. In this example, "college aptitude" is the underlying construct (concept) that we are interested in, and we have two measures of it: ACT scores and SAT scores. Most directly, this example illustrates high criterion validity, because the ACT and SAT are both supposed to be measuring the college aptitude and therefore they should yield similar results. Ultimately, if supported by other findings, this high criterion validity help to establish the construct validity of these aptitude tests as well.
- After many administrations, researchers administering a polygraph test begin to worry that the machine is actually measuring anxiety and not dishonest responses.
This illustrates a concern about potentially low construct validity, because the concern is that the instrument (the polygraph) does not appear to be measuring the desired construct (dishonesty), but is instead measuring a different theoretical construct (anxiety).
- A personality test that helps to predict the development of schizophrenia consists entirely of items such as “What is your favorite color?” and “Are red apples better than green apples?”
In this hypothetical example, the personality test has low face validity, because the items on the test seem to be unrelated to the construct (schizophrenia) being measured. What on Earth do favorite colors and “red versus green apples” have to do with schizophrenia? But even though the items might look silly or irrelevant, the more important issue (in terms of developing psychological tests) is that the test has high criterion validity: based on the information provided in the example, the test helps to predict the development of schizophrenia.
- Individuals that score high on a questionnaire measuring racism on Tuesday morning are likely to score high on the same scale one week later.
This illustrates high reliability, because multiple administrations of the questionnaire are yielding similar results. Note that the questionnaire’s high reliability does not indicate anything about its validity. The questionnaire might have low or high validity – we need more information to determine this.
- An observational coding system for marital conflict and unhappiness correctly predicts which couples will get divorced 95% of the time.
This coding system has high criterion validity, because it is highly successful in predicting future divorce. (Ultimately, if supported by other types of evidence, high criterion validity will help to establish the general construct validity of this coding scale. In other words, it will help to establish that this coding system really is measuring marital conflict and unhappiness).
- Several years of research have consistently shown that scores on Dr. Smith’s self-report scale measuring selfishness are positively correlated with scores on other types of measures of selfishness (e.g., observational measures) and also with psychological tests that measure “egocentrism”. Scientists conclude that Dr. Smith’s scale is a good one for assessing selfishness.
Two types of validity are supported here. First and most directly, Dr. Smith’s selfishness scale has high criterion validity because it is positively correlated with other criterion measures (i.e., the other psychological tests that measure selfishness). Along with other supportive evidence, this high criterion ability eventually will help to support the construct validity of Dr. Smith's selfishness measure. Second, based on existing psychological theories, suppose we hypothesize that egocentrism and selfishness are not identical traits, but they are constructs that should related to one another. In this case, the fact that Dr. Smith's selfishness scale correlates with psychological tests of egocentrism directly supports the construct validity of the selfishness scale.
- People who score high on a new test of shyness (indicating they are very shy) also score high on a personality test of extraversion (indicating they are socially outgoing) that has already been well-validated.
The new shyness test has low construct validity because it does not yield results that fit with those from an already existing, well-validated measure (the personality test of extraversion) of a related construct. In other words, based on psychological theory, we would expect that people who have higher scores on the new shyness test should generally have lower scores on extraversion: they should be less socially outgoing. Thus, we would expect a negative correlation between shyness and extraversion, but this isn’t what happened: the correlation was positive. If other studies yield similarly discouraging findings, this low construct validity suggests that this new test really is not measuring shyness.
- Students in Professor Jones' Geography 215 class are assigned to read Chapters 1, 2, 3, and 4 for the first exam. All the chapters are similar in length and amount of material. In lecture Professor Jones conducts 3 lectures on the topics in each chapter. Students are told to study all chapters and lecture notes for their first exam. On the first exam, however , 90% of the exam questions are based on the material in Chapter 3 and Chapter 4, and only 10% of the questions are based on material in Chapter 1 and Chapter 2.
Most directly this example illustrates that Professor Jones' exam has low content validity. The sample of questions contained in the exam poorly represents the domain of material that students were asked to read about and which they learned about in lecture. Roughly 25% of the material covered in class and the text focused on concepts related to Chapter 1, and another 25% was related to Chapter 2. Yet only 10% of the exam questions focused on topics from these two chapters combined. In addition, the poor content validity likely will cause many students to feel that this was not a "fair exam." If so, then poor content validity would lead to poor face validity.