This Document Contains Chapters 5 to 7 Chapter 5 Reliability THE CONCEPT OF RELIABILITY Sources of Error Variance Test construction Test administration Test scoring and interpretation Other sources of error variance RELIABILITY ESTIMATES Test-Retest Reliability Estimates Parallel-Forms and Alternate-Forms Reliability Estimates Split-Half Reliability Estimates The Spearman-Brown formula Other Methods of Estimating Internal Consistency The Kuder-Richardson formulas Coefficient alpha Average Proportional Distance Measures of Inter-Scorer Reliability USING AND INTERPRETING A COEFFICIENT OF RELIABILITY The Purpose of the Reliability Coefficient The Nature of the Test Homogeneity versus heterogeneity of items Dynamic versus static characteristics Restriction or inflation of range Speed test versus power test Criterion-referenced tests The True Score Model of Measurement and Alternatives to It Domain sampling theory and generalizability theory Item response theory (IRT) RELIABILITY AND INDIVIDUAL SCORES The Standard Error of Measurement The Standard Error of the Difference Between Two Scores Close-up: Item Response Theory (IRT) Everyday Psychometrics: The Reliability Defense and the Breathalyzer Test Meet an Assessment Professional: Meet Dr. Bryce B. Reeve Self-Assessment TERM TO LEARN Reliability The extent to which measurements are consistent or repeatable Some relevant reference citations: Green, C.E., Chen, C. E., Helms, J. E., & Henze, K. T. (2011). Recent reliability reporting practice in Psychological Assessment: Recognizing the people behind the data. Psychological Assessment, 23(3), 656-669. Kieffer, K. M., & MacDonald, G. (2011). Exploring factors that affect score reliability and variability in the Ways of Coping Questionnaire reliability coefficients: A meta-analytic reliability generalization study. Journal of Individual Differences, 32(1), 26-38. Markon, K. E., Chmielewski, M., Miller, C. J. (2011). The reliability and validity of discrete and continuous measures of psychopathology: A quantitative review. Psychological Bulletin, 137(5), 856-879. For class consideration: What does it mean when a tool of assessment is characterized as being reliable? Under what conditions might we expect an otherwise useful tool of assessment to be unreliable? CLASS DISCUSSION QUESTIONS Here is a list of questions that may be used to stimulate class discussion, as well as critical and generative thinking, with regard to some of the material presented in this chapter of the text. 1. Ask the class to draw parallels between a reliable person and a reliable test. What areas of similarity exist? What differences? Answer: In drawing parallels between a reliable person and a reliable test, students can explore similarities such as consistency and predictability. Just as a reliable person demonstrates consistency in their actions and behaviors, a reliable test produces consistent results when administered repeatedly. Both reliability in a person and a test engender trust and confidence in their respective outcomes. However, differences arise in the nature of their reliability. While a reliable person's consistency stems from internal attributes like integrity and dependability, a reliable test's consistency is rooted in its standardized procedures and psychometric properties. Additionally, a reliable person's judgment may be influenced by subjective factors, whereas a reliable test aims for objectivity and uniformity in measurement. Understanding these parallels and distinctions can deepen students' grasp of the concept of reliability in psychological assessment. 2. Present a hypothetical situation with hypothetical test results for the class for discussion. For example: The developers of a new test called The Willingness to Demonstrate Vulnerability Inventory (WDVI) are eager to explore the test-retest reliability of their test. They approach several large corporations with a proposal to conduct a reliability study, and it is finally accepted by one. The developers have permission to administer the test to a convenience sample of 3,000 Starbucks store franchisees in Southern California. The test is administered in early November. Scores on the WDVI could range from a low of 0 (no willingness to demonstrate vulnerability) to 200 (an inordinately high willingness to demonstrate vulnerability). The mean score for the sample is 150 with a standard deviation of 10. About two months later, in early January, the exact same test is readministered to the same 3,000 people; this time, the mean WDVI score is 90 with a standard deviation of 15. How do students account for the results? What do they suggest to the WDVI test developers as “next steps”? Answer: Students might analyze the hypothetical situation by first noting the significant decrease in mean WDVI scores from November to January, along with the increase in standard deviation, indicating variability in individual responses over time. They could discuss potential factors contributing to this change, such as seasonal influences, workplace dynamics, or external events. In suggesting "next steps" to the WDVI developers, students may recommend conducting further investigations to understand the reasons behind the observed fluctuations, such as conducting qualitative interviews to explore participants' experiences or implementing a longitudinal study to track changes in vulnerability over time. Additionally, they might advise the developers to assess the test's reliability using alternative methods, such as split-half reliability or inter-rater reliability, to corroborate the findings from the test-retest study. Overall, students could emphasize the importance of thorough analysis and validation procedures to ensure the robustness and validity of the WDVI as a measure of willingness to demonstrate vulnerability. 3. Another hypothetical situation for discussion: One student’s measured IQ on a test of intelligence is 100. Another student’s measured IQ is 110 on a different test of intelligence. What can one say about the two students and their scores on these tests? How might the standard error of the difference score be involved in determining whether there exists a difference between the two scores? What factors affect the magnitude of the standard error of the difference score? Answer: In comparing the two students' IQ scores, it's important to consider that IQ tests may vary in content, format, and psychometric properties, influencing individual scores. A difference of 10 points between the two scores suggests a potential disparity in their cognitive abilities as measured by the respective tests. The standard error of the difference score would indicate the extent to which the observed difference between their scores could be attributed to measurement error rather than true differences in intelligence. Factors affecting the magnitude of the standard error of the difference score include the reliability of the tests, the variability of scores within the population, and the correlation between the two tests. A larger standard error suggests greater uncertainty in the interpretation of the difference between scores, highlighting the need for caution in drawing conclusions about the students' relative intelligence levels based solely on their IQ scores. 4. Create a list of various types of tests and measurements on the chalkboard. Ask students to respond with the corresponding type(s) of reliability estimates that would be appropriate. You may also ask them to name the statistic of choice to be calculated. Here is a sample list: (a) typing test (timed) (b) color blindness (c) weight from month to month of a 6-week-old infant and a 21-year-old (d) reading from 1 week to another (e) intelligence (f) mood (g) weight of melting ice cubes (h) test of reaction time (i) multiple choice exams in this course for midterm and final during (two different exams) (j) test of test anxiety (k) essay exam in an English literature course (l) presidential preference poll (m) art aptitude test, which includes judging the quality of a clay sculpture Answer: For each type of test or measurement listed, students would need to identify the appropriate reliability estimate(s) and statistic(s) of choice. For example: (a) Typing test (timed): Test-retest reliability would be appropriate, with the correlation coefficient (e.g., Pearson's r) calculated to assess consistency in typing speed over time. (b) Color blindness: Split-half reliability could be used, with the Spearman-Brown prophecy formula applied to estimate the reliability coefficient. (c) Weight from month to month of a 6-week-old infant and a 21-year-old: Test-retest reliability would be suitable, with the intraclass correlation coefficient (ICC) used to assess the consistency of weight measurements over time. (d) Reading from 1 week to another: Test-retest reliability would be applicable, and the coefficient of stability could be calculated to determine the consistency of reading abilities over time. (e) Intelligence: Split-half reliability or internal consistency reliability (e.g., Cronbach's alpha) could be used to assess the consistency of scores across different parts of the intelligence test. (f) Mood: Test-retest reliability would be appropriate, and the correlation coefficient could be calculated to assess the stability of mood measurements over time. (g) Weight of melting ice cubes: Test-retest reliability would be relevant, and the ICC could be used to determine the consistency of weight measurements across different trials. (h) Test of reaction time: Test-retest reliability would be suitable, with the correlation coefficient used to assess the stability of reaction time measurements over time. (i) Multiple choice exams in this course for midterm and final during two different exams: Parallel-forms reliability would be appropriate, and the coefficient of equivalence could be calculated to determine the consistency of scores between the two exams. (j) Test of test anxiety: Split-half reliability or internal consistency reliability could be used, with Cronbach's alpha calculated to assess the consistency of scores across different items measuring test anxiety. (k) Essay exam in an English literature course: Inter-rater reliability would be relevant, and the Cohen's kappa coefficient could be calculated to assess the agreement between different graders' scoring of the essays. (l) Presidential preference poll: Test-retest reliability would be applicable, with the ICC used to assess the consistency of responses over time. (m) Art aptitude test, which includes judging the quality of a clay sculpture: Inter-rater reliability would be necessary, and Cohen's kappa coefficient could be calculated to assess the agreement between different raters' evaluations of the sculpture's quality. 5. Why is there a need for different methods of estimating reliability for norm-referenced and criterion-referenced tests? Answer: Different methods of estimating reliability are necessary for norm-referenced and criterion-referenced tests due to their distinct purposes and characteristics. Norm-referenced tests compare an individual's performance to that of a group, focusing on relative standing, while criterion-referenced tests evaluate performance against predetermined criteria or standards, focusing on absolute mastery of content or skills. Reliability estimates for norm-referenced tests typically assess consistency in rank order or relative performance, such as test-retest reliability or split-half reliability. In contrast, reliability estimates for criterion-referenced tests focus on consistency in absolute scores or judgments, such as inter-rater reliability or equivalence reliability. Utilizing appropriate reliability methods ensures the accuracy and validity of interpretations and decisions based on test scores, aligning with the specific purposes and contexts of norm-referenced and criterion-referenced assessments. 6. Why does an observed score always represent the combination of the true score plus the error score? Answer: An observed score always represents the combination of the true score and the error score because measurement involves inherent imperfections and variability. The true score reflects the individual's actual level of the attribute being measured, free from measurement error. However, no measurement process is entirely free from error, which can arise from various sources such as test construction, administration, or scoring. Thus, the observed score includes both the true score, representing the individual's genuine performance or attribute level, and the error score, representing the discrepancy between the observed score and the true score. Understanding this combination is crucial for interpreting test results accurately and recognizing the limitations and uncertainties inherent in measurement. It underscores the importance of employing reliable and valid assessment methods to minimize error and maximize the accuracy of observed scores. Additionally, considering the true score and error score separately allows researchers and practitioners to assess the reliability and validity of the measurement process and make informed decisions based on test results. 7. What factors affect the test-retest reliability of tests designed to measure the developing cognitive and motor skills of infants? Answer: Several factors influence the test-retest reliability of assessments designed to measure the developing cognitive and motor skills of infants. These include the age and developmental stage of the infant, as skills develop rapidly during infancy, leading to potential fluctuations in performance over short periods. The environment in which the testing occurs, such as noise levels or distractions, can also impact reliability by influencing the infant's attention and engagement. Additionally, the familiarity and rapport between the infant and the examiner may affect test-retest reliability, as comfort and trust can enhance cooperation and performance consistency. Variability in infant temperament or mood across testing sessions can introduce inconsistency in responses, impacting reliability. The choice of assessment tools and methods, such as standardized tests versus observational measures, can also influence reliability, with some measures being more sensitive to developmental changes than others. The duration of the test-retest interval is critical, as shorter intervals may capture transient fluctuations, while longer intervals risk capturing developmental changes rather than true stability of skills. Standardization of testing procedures and training of examiners contribute to reliability by minimizing variability in administration and scoring across sessions. Lastly, factors such as cultural differences or socioeconomic status may influence the reliability of infant assessments, highlighting the importance of considering contextual factors in interpretation. Overall, understanding and controlling for these factors are essential for ensuring the validity and reliability of tests designed to measure the developing cognitive and motor skills of infants. IN-CLASS DEMONSTRATIONS 1. Bring something to class (a) Bring in test manuals of tests that do and do not employ a true score model. Bring in manuals for tests that do and do not employ a true score model of measurement. Discuss the similarities and differences between the information in the manuals. (b) Bring in the manual of a test that has alternate forms. To supplement discussion of alternate forms reliability, bring to class copies of test manuals for tests in which alternate forms have been developed (for the Peabody Picture Vocabulary Test, the Key Math, or the Woodcock Diagnostic Reading Battery). Using information from the test’s technical manual, summarize how the alternate forms were developed. (c) Bring in quiz data for analysis. Bring in data from two past quizzes or examinations using the same students (could be from the measurement class or another class) and analyze the data for test-retest reliability. (d) Bring in a copy of the Standards. Bring in a copy of the most recent edition of the Standards and discuss with the class the material dealing with reliability and errors of measurement. (d) Create two forms of a test of basic arithmetic. Create two forms of a “Basic Arithmetic” test (“Form A” and “Form B”) that could reasonably be administered to your class in 5 minutes. The test should tap knowledge of basic arithmetic operations such as addition, subtraction, multiplication, and division. Administer Form A with a 5-minute time limit, under regular test taking conditions. Collect the exam papers. Now, administer Form B also with a 5-minute time limit. During the administration of Form B, however, create adverse testing conditions such as by playing loud, distracting music, flashing the lights on and off, and so on. Collect the papers. Now, redistribute all of the papers so that different students are marking different students’ tests, and no student is marking his/her own paper. Using the group data, calculate a rank order (Rho) coefficient that will be the measure of test-retest reliability. Is it lower than expected due to the various sources of error variance introduced? (e) Create a simple speed test. Create a simple speed test in which the test-taker’s task is to correctly alphabetize a list of words. Administer the test to the class. As a class, calculate the split-half reliability of the resulting data. Then, discuss why such an approach to estimating reliability is inappropriate for use in determining the reliability of a speeded test. 2. Bring someone to class. Invite a guest speaker to class. The guest speaker could be: (a) a faculty member Invite a faculty member (from your university or a neighbouring one) who is an expert in the area of reliability, or item response theory, or classical test theory to elaborate on the material presented in this chapter; (b) a local test user Invite a local user of psychological tests from any setting who can elaborate on principles of reliability as used in everyday work; or (c) a law enforcement officer knowledgeable about the reliability of the Breathalyzer. To supplement the material presented in this chapter’s Everyday Psychometrics feature, invite a member of the highway unit of the local or state police who is involved in administering Breathalyzer (or similar) tests. This guest speaker should be prepared to speak on the specifics of how a field sobriety test is administered and statistics on the reliability of the examination. Have students prepare questions in advance dealing with the issue of the different types of reliability that are important to consider for this measure. IN-CLASS ROLE-PLAY AND DEBATE EXERCISES 1. Role-Play: The Willingness to Demonstrate Vulnerability Inventory (WDVI) Divide the class into three groups: (1) WDVI test developers, (2) Investors, and (3) Advisers to Investors. Group 1 plays the role of the developers of the test and it is their job to make a well-informed pitch to Group 2 for investment funds to develop their test. Group 2, also well informed of many issues related to assessment, particularly reliability issues, questions Group 1 about their test. After the exchange between Group 1 and 2 has run its course, Group 3 plays the role of advisor to Group 2, advising them whether or not the WDVI appears to be a good investment. 2. Debate: CTT versus IRT Should all psychological tests developed from the present day forward rely on classical test theory (CTT) or item response theory (IRT). Students will research the relevant issues and come to class prepared to debate them. Half the students in the class will be assigned to the “CTT” team, and the other half will be assigned to the “IRT” team. The use of home-made tee shirt “uniforms” with the appropriate lettering is encouraged on debate day. OUT-OF-CLASS LEARNING EXPERIENCES 1. Take a field trip. Arrange a trip as a class to: (a) a corporate Human Resources department Arrange a visit to the Human Resources (HR) department of a local business or large corporation that employs psychological tests to see how they are used in practice, with a specific focus on reliability issues. (b) a local law enforcement agency Arrange a visit to a local or state police facility for a firsthand look at how field sobriety tests are conducted and the reliability of those examinations. (c) a local consumer research firm Arrange a visit to a local consumer research firm that employs statistics and/or statistical methods for a discussion of the testing that is conducted, and what measures are in place to ensure that the data obtained is reliable. SUGGESTED ASSIGNMENTS 1. Critical Thinking Exercise: Error in Ability Tests Critically evaluate any existing ability test with regard to all of the possible sources of error that may be inherent in measurement. Explain why some sources of error are likely to be greater than others in magnitude for this particular test. 2. Generative Thinking Exercise: Appropriate Measures of Reliability Create a table with two headings: Appropriate and Inappropriate. Then under these two headings, list at least one type of tests for which different methods of estimating reliability would or would not be appropriate. Your listing should include, for example, one types of test for which the test-retest method of estimating reliability would be appropriate, and one type of test for which the test-retest method of estimating reliability would be inappropriate. Continue to do this for estimates of interitem consistency, interscorer reliability, and alternate forms reliability. Explain why you have listed each type of test in the Appropriate or Inappropriate column. 3. Read-then-Discuss Exercises (a) Generalizability Theory versus Classical Test Theory Do some independent reading about generalizability theory and come to class prepared to discuss how it differs from classical test theory. (b) Item Response Theory versus Classical Test Theory Do some independent reading about item response theory and come to class prepared to discuss how it differs from true score theory. (c) Reliability Reporting in a Test Manual Students are instructed to check out and review a test manual from their college/university test library. They then report on the types of reliability estimates reported for their test of choice. 4. Other Exercises and Assignments (a) Demonstrating Measurement Error Buck (1991) provides suggestions for a demonstration of measurement error and reliability. Note that Allen (1992) objected to aspects of this demonstration. Buck (1992) responded to the criticisms. The demonstration, as well as the criticisms and the rebuttal, may all make for a lively class exercise. (b) Moore on More Assignments for Teaching Concepts of Reliability Moore (1981) provided suggestions for teaching related reliability concepts such as true score, true variance, and standard error of measurement using measurements of lines. MEDIA RESOURCES On the Web Understanding the Reliability of a Test http://www.ehd.org/science_technology_testresults.php Applied examples from fields other than psychology are presented and explained. Reliability and Validity http://www.youtube.com/watch?v=LolwQXYjuh8&feature=related Validity and Reliability (3 segments) http://www.youtube.com/watch?v=DS8Hw0Ort4w&feature=related Aspects of reliability and validity explained in this and several other related videos from Miami University. Validity and Reliability http://www.youtube.com/watch?v=56jYpFkdqW8&feature=related IRT http://edres.org/irt/ Conference on Applications of IRT for Health Outcomes Measurement http://outcomes.cancer.gov/conference/irt/ IRT Model Fit Software: http://outcomes.cancer.gov/areas/measurement/irt_model_fit.html REFERENCES Allen, M. J. (1992). Comments on “A demonstration of measurement error and reliability.” Teaching of Psychology, 19, 111. Buck, J. (1991). A demonstration of measurement error and reliability. Teaching of Psychology, 18, 46–47. Buck, J. (1992). When true scores equal zero: A reply to Allen. Teaching of Psychology, 19, 111–112. Moore, M. (1981). An empirical investigation and a classroom demonstration of reliability concepts. Teaching of Psychology, 8, 163–164. Chapter 6 Validity THE CONCEPT OF VALIDITY Face Validity Content Validity The quantification of content validity Culture and the relativity of content validity CRITERION-RELATED VALIDITY What Is a Criterion? Characteristics of a criterion Concurrent Validity Predictive Validity The validity coefficient Incremental validity Expectancy data Decision theory and test utility CONSTRUCT VALIDITY Evidence of Construct Validity Evidence of homogeneity Evidence of changes with age Evidence of pretest–posttest changes Evidence from distinct groups Convergent evidence Discriminate evidence Factor analysis VALIDITY, BIAS, AND FAIRNESS Test Bias Rating error Test Fairness Close-up: Base Rates and Predictive Validity Everyday Psychometrics: Adjustment of Test Scores by Group Membership: Fairness in Testing or Foul Play? Meet an Assessment Professional: Meet Dr. Adam Shoemaker Self-Assessment TERM TO LEARN Validity In general, a term referring to a judgment regarding how well a test or other measurement tool measures what it purports to measure. Some relevant reference citations: Bornstein, R. F. (2011). Toward a process-focused model of test score validity: Improving psychological assessment in science and practice. Psychological Assessment, 23(2), 532-544. Cizek, G. J. (2012). Defining and distinguishing validity: Interpretations of score meaning and justifications of test use. Psychological Methods, 17(1), 31-43. Fossati, A., Borroni, S., Marchione, D., & Maffei, C. (2011). The Big Five Inventory (BFI): Reliability and validity of its Italian translation in three independent nonclinical samples. European Journal of Psychological Assessment, 27(1), 50-58. For class consideration: What is actually meant by a statement that a test is valid? How absolute or relative is test validity? Can a test be valid at one time and place with one population of Test-takers, and invalid at another time, in another place, or with another group of Test-takers? For class consideration: What is actually meant by a statement that a test is valid? How absolute or relative is test validity? Can a test be valid at one time and place with one population of test-takers, and invalid at another time, in another place, or with another group of test-takers? CLASS DISCUSSION QUESTIONS Here is a list of questions that may be used to stimulate class discussion, as well as critical and generative thinking, with regard to some of the material presented in this chapter of the text. 1. If your use of our WDVI (Willing to Demonstrate Vulnerability Inventory) example in the Chapter 5 class discussion was a "hit," you may wish to pursue it in this lecture as well. (Alternatively, if the WDVI example went over like a lead balloon, skip down to the additional class discussion questions that follow this one.) To stimulate thought as to what an adequate criterion is for demonstrating criterion-related validity, ask students to think of criteria that might be used to validate a test such as the WDVI. Challenge the students to think about the types of convergent and divergent evidence that might be used to establish the construct validity of the test. Answer: In exploring criterion-related validity for the WDVI, students could consider various criteria that align with the construct of willingness to demonstrate vulnerability, such as self-disclosure in interpersonal relationships or seeking help during times of need. Convergent evidence might involve correlating WDVI scores with related constructs, such as empathy or emotional intelligence, to demonstrate consistency in measurement. Divergent evidence could involve assessing the WDVI's discriminant validity by demonstrating weak correlations with constructs unrelated to vulnerability, such as physical strength or mathematical ability. Additionally, students might propose using behavioral observations or self-report measures of vulnerability in real-life situations as criteria for validation. Overall, integrating multiple sources of evidence would strengthen the case for the WDVI's construct validity, providing a comprehensive understanding of its measurement properties and relevance in assessing willingness to demonstrate vulnerability. 2. Open for class discussion the prospect of class members themselves creating a test blueprint for a quiz on the content of Chapter 6 in the text. The discussion should lead into the subject of how blueprinting relates to content validity. Answer: Opening the floor for class members to collaboratively create a test blueprint for a quiz on Chapter 6 content fosters engagement and active participation. Through this exercise, students can collectively outline the key topics, concepts, and skills they deem essential for assessment. By discussing the distribution of questions across different content areas or cognitive levels, students gain insight into the process of test blueprinting and its role in ensuring content validity. Blueprinting involves systematically aligning assessment items with the learning objectives and content coverage of the curriculum, thereby enhancing the relevance and representativeness of the assessment. This discussion enables students to understand how blueprinting helps ensure that assessments adequately sample the content domain, providing a valid measure of students' mastery of the material. By actively participating in blueprinting, students develop a deeper appreciation for the importance of aligning assessments with instructional goals and content standards. Overall, this collaborative exercise empowers students to take ownership of their learning assessment process while reinforcing the principles of content validity in educational assessment. 3. To stimulate discussion on the subjects of test bias and test fairness, present the following scenario: A college uses a particular admissions test, which has well documented predictive validity. However, members of a particular minority group tend to score low on this admission test. Some students who have been denied admission based on their test scores are criticizing the school for using a biased test. What steps need to be taken prior to making the conclusion the test is "biased" in the psychometric sense? How can a determination be made regarding whether or not the test is being used in a fair and equitable manner? Answer: Before concluding that the admissions test is biased in the psychometric sense, several steps should be taken. Firstly, thorough analysis of the test's content and administration procedures should be conducted to identify any potential sources of bias. This includes reviewing test items for cultural or linguistic biases and examining the testing environment for factors that may disadvantage certain groups. Additionally, the differential impact of the test on different demographic groups should be investigated through statistical analyses to assess for subgroup differences in performance. Furthermore, qualitative research methods such as focus groups or interviews can provide insights into test-takers' perceptions and experiences, helping to uncover any systemic inequities in test administration or interpretation. To determine whether the test is being used in a fair and equitable manner, the school should assess its admissions policies and practices holistically. This involves examining how test scores are weighed alongside other factors such as GPA, extracurricular activities, and personal essays in the admissions decision-making process. Additionally, the school should consider whether alternative measures of academic potential or achievement are available and accessible to all applicants. Regular monitoring and evaluation of admissions data for patterns of bias or inequity can inform ongoing revisions to admissions criteria and procedures to promote fairness and diversity. Finally, engaging stakeholders, including students, faculty, and community members, in discussions about fairness and equity in admissions can help ensure transparency and accountability in the process. Overall, a comprehensive approach that combines psychometric analysis, policy review, and stakeholder engagement is essential for addressing concerns about test bias and promoting fairness in admissions practices. 4. Building on this chapter’s Everyday Psychometrics, solicit student opinions regarding what constitutes a fair use of employment tests. Are measurement tools neither the cause of, nor the cure for, racial inequalities in employment settings? Solicit student opinions about the use of procedures to adjust test scores on the basis of group membership. Is this legal? Is it ethical? Do students agree with Section 106 of the Civil Rights Act of 1991? Why? Answer: In soliciting student opinions on fair use of employment tests, it's crucial to explore whether such tools contribute to or alleviate racial inequalities in employment settings. Students may debate the legality and ethics of adjusting test scores based on group membership, considering whether such practices violate anti-discrimination laws or perpetuate systemic biases. Opinions may vary on the fairness of Section 106 of the Civil Rights Act of 1991, which allows for adjusting test scores to address adverse impact, balancing concerns about equal opportunity with merit-based selection. Ultimately, students may grapple with the tension between promoting diversity and inclusion while ensuring meritocracy in employment practices, recognizing the complexities of addressing societal inequalities through psychometric tools and legal frameworks. IN-CLASS DEMONSTRATIONS 1. Bring something to class (a) Bring a bottle of ink and some paper to class. To stimulate a discussion of face validity, bring a bottle of ink to class for the purpose of creating a “homemade” inkblot. Then provide a quick demonstration of how inkblots may be used to study personality. Follow up with a discussion of face validity. Although a test developer cannot use this type of validity as evidence that the test is valid, why is it important? How might it influence test-takers? Why is it that face validity has not traditionally been considered a “genuine” aspect of validity? (b) Bring a copy of the Standards to class. Review and discuss what this authoritative manual has to say about the similarities and differences between the concepts of test bias and test fairness. The manual may also be used to expand upon other validity-related material presented in this chapter. (c) Bring an intelligence test manual to class. Bring to class a copy of the test manual for a widely used test of intelligence. Select a few sample items to share with the class, who will act as expert judges there to express an opinion about whether the content of the item would or would not be easier from respondents who are members of particular cultural groups. (d) Bring a personality test or test manual to class. So, for example, present Zick Rubin’s Love Scale (Rubin, 1970) to the class to stimulate validity-related discussion regarding a construct well known to college students. Administer this self-report questionnaire, and have all students score it. After the students have scored their questionnaires, they may begin to question what the scale is actually measuring; is it, indeed, a valid measure of “love.” Students are then divided into small groups to decide how they might establish the validity of the questionnaire. The focus of discussion in these groups is criteria that may used to validate the instrument, and the different types of evidence that can be gathered to support the construct validity of a measure. All of the small groups then come together for a sharing of findings and thoughts. The instructor shares the evidence for construct validity provided by the test developers in the manual and comparisons are made with what the class developed. The class comes to a conclusion about whether or not the Love Scale really does measure romantic love. 2. Bring someone to class. Invite a guest speaker to class. The guest speaker could be: (a) a faculty member Invite a faculty member (from your university or a neighbouring one) in the mathematics department to elaborate on the concept of test validity from a strictly statistical point of view. (b) a corporate representative who uses psychological tests. Invite a local user of psychological tests from a corporate setting who can elaborate on the issues of test bias and test fairness based on “real-life” experience with personnel tests. (c) a clinician Invite a psychologist who conducts child abuse evaluations. Ask the psychologist to discuss the types of tests and other procedures used as part of such assessments. Based on the material presented in the chapter, the class will pose questions regarding the validity of the methodology employed. IN-CLASS ROLE-PLAY AND DEBATE EXERCISES 1. Role-play: Test Publisher and Test Purchaser A student or team of students role-plays the role of a test publisher who has published tests in a variety of area. The rest of the class role-plays a school Budget Committee that may purchase tests from the test publisher. The test publisher must present one and only type of validity evidence for each one of the 5 types of tests presented in the list that follows. Which type of evidence will the test publisher choose? What questions does the Budget Committee have for the publisher? Which tests will the Budget Committee select and reject on the basis of the test publisher’s presentation? Five types of tests published by the test publisher —mathematics test —intelligence test —vocational interest inventory —music aptitude test —attitude toward school inventory 2. Debate: Quotas in Employment Settings and in Academic Admissions Divide the class into three groups: One to take the “Pro quota” position, one to take the “Con quota” position, and one to act as an impartial judging panel. As preparation for this debate, all students will read articles on this subject published both in academic journals and the popular press. Some sources students may wish to consult: Hunter & Schmidt (1976) and pages 397 thru 412 of Jensen (1980). Also as part of their preparation, students may contact local academic institutions and businesses to explore what, if any, consideration has been given to the use of quota systems. OUT-OF-CLASS LEARNING EXPERIENCES 1. Take a field trip. Arrange a trip as a class to the human research offices of a large corporation for the purpose of discussing with the personnel officer the use of test validation strategies (including any efforts at local validation of nationally used instruments) as well as related issues (such as issues of test bias and test fairness). SUGGESTED ASSIGNMENTS 1. Critical Thinking Exercises (a) Evaluation of Test Validation Strategies Using test manuals (if available from the university or departmental library) have students critically evaluate the test validation strategies used by developers of (1) a test of intelligence, (2) a personality test, (3) a neuropsychological test, or (4) any other test. (b) Evaluation of Test Fairness An organization called “Fair Test” has claimed that the SAT is unfair and invalid. Have students read this organization’s posting on the Net at: http://www.fairtest.org/facts/satvalidity.html. What do students think? Have them write an essay that critically evaluates the arguments presented. Allow students the latitude to research the issues and draw on other net postings 2. Generative Thinking Exercise: Every Day Predictive Validity Students are directed to select one of the following professionals listed below and discuss why test validity-related considerations--particularly issues related to predictive validity--might be very important in their daily work. —personnel manager for large corporation —high school principal —college admissions officer —prison warden —guidance counselor 3. Read-then-Discuss Exercises (a) An Article Purporting to Validate an Instrument Assign students the task of identifying a journal article in a recent issue of Psychological Assessment that deals with the validation of an instrument. Students will first read the article and summarize the test developer’s validation strategies. Then, students write their own critical review of the test developer’s approach to validation. (b) Test Blueprinting Assign students the task of learning more about test blueprinting by reading articles such as those written by McLaughlin et al. (2005) and Howell (2005). Students should then come to class prepared to discuss what they have learned. 4. Research-then-Report Exercises (a) Test Blueprint for a Particular Test Assign students the task of locating and reading original source material that deals with the test blueprint for a particular test. They then write a report briefly summarizing the test author’s thinking that led to a particular test’s blueprint. The report should conclude with a brief, critical evaluation of that test’s blueprint. Any appropriate source material may be used. Two possible choices are Van Tassel-Baska’s et al.’s (2007) article about a new classroom observation tool and Needleman et al.’s (2006) discussion of how the test blueprint for the WISC-IV is different from that of its predecessor, the WISC-IIII. Note: Choosing the latter option will require examination and comparison of the WISC-IV versus the WISC-III manuals. (b) Decision Theory in Employment-Related Decision-Making Using articles published in academic journals as well as articles in the popular press, assign students the task of writing a report entitled “The value of decision theory in employment-related decision making.” 5. Other Exercises and Assignments: An Exercise in Test Creation Students, individually or as a team, first conceive of a new test that needs to be created. They then (a) describe in general, how they would demonstrate the validity of this new test; (b) more specifically, how would they demonstrate the content, construct, and criterion-related validity of their test? (c) what about issues related to the face validity of this new test? (d) what issues regarding test bias and test fairness might attend the use of this new test? This exercise could take the form of a written report or an oral presentation of groups to the rest of the class. MEDIA RESOURCES On the Web Reliability and Validity http://www.youtube.com/watch?v=LolwQXYjuh8&feature=related Reliability and Validity (3 segments) http://www.youtube.com/watch?v=DS8Hw0Ort4w&feature=related http://www.youtube.com/watch?v=56jYpFkdqW8&feature=related Aspects of reliability and validity explained in this and several other related videos from Miami University. www.rasch.org/rmt/rmt111j.htm The posting on this site raises the question: “Is content validity is valid?” REFERENCES Howell, Scott L. (1995). The effects of using test blueprints as a test preparation tool. Dissertation Abstracts International Section A: Humanities and Social Sciences, 55(12-A), 3822. Hunter, J. E., & Schmidt, F. L. (1976). A critical analysis of the statistical and ethical implications of various definitions of "test bias." Psychological Bulletin, 83, 1053–1071. Jensen, A. R. (1980). Bias in mental testing. New York: The Free Press. McLaughlin, K., Coderre, S., Woloschuk, W., & Mandin, H. (2005). Does blueprint publication affect students’ perception of validity of the evaluation process? Advances in Health Sciences Education, 10(1), 15–22. Needelman, H., Schnoes, C. J., & Ellis, C. R. (2006). The new WISC-IV. Journal of Developmental & Behavioral Paediatrics, 27(2), 127–128. Rubin, Z. (1970). Measurement of romantic love. Journal of Personality and Social Psychology, 16, 265–273. VanTassel-Baska, J., Quek, C., & Feng, A. X. (2007). The development and use of a structured teacher observation scale to assess differentiated best practice. Roeper Review, 29 (Winter, 2), 84–92. Chapter 7 Utility WHAT IS UTILITY? Factors That Affect a Test’s Utility Psychometric soundness Costs Benefits UTILITY ANALYSIS What Is a Utility Analysis? How Is a Utility Analysis Conducted? Expectancy data The Brogden-Cronbach-Gleser formula Some Practical Considerations The pool of job applicants The complexity of the job The cut score in use METHODS FOR SETTING CUT SCORES The Angoff Method The Known Groups Method IRT-Based Methods Other Methods Close-up: Utility Analysis: An Illustration Everyday Psychometrics: Rethinking the “Costs” of Testing—and of Not Testing Meet an Assessment Professional: Meet Dr. Erik Viirre Self-Assessment TERM TO LEARN Utility In the context of psychological testing and assessment, a reference to how useful a test or other tool of assessment is for a particular purpose. Some relevant reference citations: Ehreke, L., Luck, T., Luppa, M., et al. (2011). Clock drawing test: Screening utility for mild cognitive impairment according to different scoring systems. International Psychogeriatrics, 23(10), 1592-1601 O'Meara, A., Davies, J., & Hammond, S. (2011). The psychometric properties and utility of the Short Sadistic Impulse Scale (SSIS). Psychological Assessment, 23(2), 523-531. Reel, K. H., Lecavalier, L., Butter, E., & Mulick, J. A. (2012). Diagnostic utility of the Pervasive Developmental Disorder Behavior Inventory. Research in Autism Spectrum Disorders, 6(1), 458-465. For class consideration: Is the concept of utility as it relates to tools of assessment absolute or relative? Can the same test be considered of great utility in one situation, but of little or no utility in another situation? CLASS DISCUSSION QUESTIONS Here is a list of questions that may be used to stimulate class discussion, as well as critical and generative thinking, with regard to some of the material presented in this chapter of the text. 1. Introduce the topic of test utility to the class by stimulating a discussion of the costs (monetary and otherwise) and benefits (monetary and otherwise), of various things familiar to students. What are the costs and benefits (in the broadest sense), for example, of: a. owning a credit card? b. obtaining an academic degree? c. military service? d. owning an automobile? e. investing in a business? Answer: Stimulating a discussion on test utility through familiar examples encourages students to consider costs and benefits in various contexts. For instance, owning a credit card incurs monetary costs like interest fees but offers benefits such as convenience and building credit history. Obtaining an academic degree involves costs like tuition and time but offers potential benefits including increased earning potential and career opportunities. Military service entails risks to personal safety and potential emotional tolls but offers benefits such as education benefits, job training, and a sense of duty and pride. Owning an automobile involves expenses like maintenance and insurance but provides mobility and convenience for work and leisure activities. Investing in a business carries financial risks but can yield profits and opportunities for entrepreneurship and wealth accumulation. Overall, these examples illustrate the trade-offs inherent in decision-making and highlight the importance of weighing costs and benefits when assessing the utility of tests in various contexts. 2. Just as a “valid test” is actually valid for a particular purpose, so a useful test has utility for a specific purpose. To amplify this point, present to the class a scenario wherein a particular test of cognitive ability has been found to have great utility in selecting members of a high school debate team. Now, pose the question of how much utility this same test might have in various other selection situations. For each situation, students should respond with an opinion in the form of “Yes” or “No” or “Maybe.” Students then explain the reasoning that this opinion was based on. Will this test of cognitive ability be useful or not in selecting applicants for: a. law school? b. art school? c. a police hostage negotiation unit? d. a middle school gifted program? e. executive level positions in a labor union? f. actors in a theme park who spend their work day dressed in a character costume? Answer: Stimulating a discussion on test utility through familiar examples encourages students to consider costs and benefits in various contexts. For instance, owning a credit card incurs monetary costs like interest fees but offers benefits such as convenience and building credit history. Obtaining an academic degree involves costs like tuition and time but offers potential benefits including increased earning potential and career opportunities. Military service entails risks to personal safety and potential emotional tolls but offers benefits such as education benefits, job training, and a sense of duty and pride. Owning an automobile involves expenses like maintenance and insurance but provides mobility and convenience for work and leisure activities. Investing in a business carries financial risks but can yield profits and opportunities for entrepreneurship and wealth accumulation. Overall, these examples illustrate the trade-offs inherent in decision-making and highlight the importance of weighing costs and benefits when assessing the utility of tests in various contexts. 3. As noted in the chapter, it is usually not possible to have it “all ways” with a particular test. The coexistence of lowest selection costs, highest hit rate, and lowest miss rate is usually not in the cards. In many instances, test users must make a judgment, for example, about the respective desirability of different types of “misses.” Is a possible false negative preferable to a false positive? Is a possible false positive preferable to false negative? In each of the scenarios described here, students should assume that the test mentioned has been shown to be valid for predicting success on the criterion of interest. The students’ task is to express an opinion about the relative desirability of a false negative and a false positive, and then explain the reasoning that led to that opinion. So, for each of the following situations, students respond to the question, Is a false negative preferable to a false positive? The response should also address the question of whether a false positive is preferable to a false negative. (a) Using a test of basic math skills to select grocery store cashiers. (b) Using an integrity test to select bank teller trainees. (c) Using the SAT or ACT to select college students. (d) Using letters of recommendation to select students for a scholarship. (e) Using the MMPI to determine whether an individual needs psychological treatment. (f) Using a test of physical ability to select firefighters. (g) Using a drug test in the selection of airline pilots. Answer: In considering the relative desirability of false negatives and false positives, students may express varying opinions based on the specific context and consequences of each scenario. For example, in selecting grocery store cashiers using a math skills test, a false negative (a candidate lacking basic math skills being hired) could lead to errors in transactions and customer dissatisfaction, potentially impacting the store's reputation and revenue. In this case, students may argue that false positives (rejecting qualified candidates) are preferable as they minimize the risk of detrimental outcomes. Conversely, when selecting bank teller trainees using an integrity test, a false positive (labeling a candidate as dishonest when they are not) could result in unjustly denying employment opportunities. Students might argue that false negatives (hiring dishonest individuals) are more concerning as they pose a risk to the bank's security and integrity. In the context of college admissions using standardized tests like the SAT or ACT, false positives (admitting students who may struggle academically) may result in students facing academic challenges and potentially dropping out. Conversely, false negatives (rejecting capable students) may deprive deserving individuals of educational opportunities. Similarly, in selecting scholarship recipients based on letters of recommendation, false positives (awarding scholarships to undeserving candidates) may undermine the scholarship's purpose and fairness. However, false negatives (overlooking qualified candidates) may deprive deserving students of financial support. When using the MMPI for psychological treatment decisions, false negatives (failing to identify individuals in need of treatment) may lead to untreated mental health issues worsening over time. In contrast, false positives (identifying individuals as needing treatment when they do not) may result in unnecessary interventions and stigma. For selecting firefighters based on physical ability tests, false negatives (rejecting physically capable candidates) may compromise the fire department's operational effectiveness during emergencies. Conversely, false positives (hiring candidates who are not physically fit) could endanger both firefighters and the public. Finally, in using drug tests for airline pilot selection, false negatives (failing to detect substance abuse in pilots) pose significant safety risks to passengers and crew. Conversely, false positives (incorrectly identifying pilots as drug users) could harm pilots' reputations and careers unfairly. In summary, students' opinions on the relative desirability of false negatives and false positives may vary depending on the potential consequences and priorities inherent in each scenario. IN-CLASS DEMONSTRATIONS 1. Bring something to class Bring (or assign students to bring) to class newspaper articles, scholarly articles, trade journal articles, news clips, or Web references that discuss the utility of using a particular test or intervention. For example, on February 10, 2008, http://www.msnbc.com featured a story about a new test designed to detect 10 genes in men who are at highest risk for developing prostate cancer. The article included a statement by a corporate chief executive that. "this is a test with significant clinical utility for improving and personalizing the screening and treatment of one of the most common cancers." Have students aspects of this test’s utility. 2. Bring someone to class. Invite a guest speaker to class. The guest speaker could be: (a) a psychologist who uses tests in practice Invite a clinical psychologist, counseling psychologist, or school psychologist who can discuss how tests can be useful and cost-effective means to facilitate diagnosis of a broad spectrum of psychological and educational problems. An added bonus, of course, would be a speaker who can discuss tests using the language of utility analysis. (b) a corporate representative who uses test data in practice Invite a local consumer of psychological test data from a corporate organization who can discuss test utility in terms of organizational decision-making. (c) an academician who is expert in the area of test utility Invite a local expert on test utility from academia who can provide the class with illustrations of how considerations of test utility have had “real-life” consequences. IN-CLASS ROLE-PLAY AND DEBATE EXERCISES 1. Role-play exercises a) Students Play the Role of Subject Matter Experts to Set Cut Scores. Instructors sometimes need to make decisions regarding whether a student should receive credit for a given course when they transfer from another university. Such decisions are usually made based on judgments regarding how similar one course is to the other. With this as background: (1) Stimulate a class dialogue regarding what basic facts about utility a student transferring into your tests and measurement class would have to know in order to be given credit for knowing the utility chapter. (2) Have students create a 10-item test to measure knowledge of the chapter. (3) Have students set a cut score to indicate what score on the test shall be deemed satisfactory. In this latter task, students are divided into separate groups and advised to employ the Angoff method to develop a cut score. (4) Compare the various cut scores that the groups of students have independently set. How similar are they? What is the lowest cut score? What is the highest cut score? The members of the various groups can then debate the issue until they reach some sort of consensus. (b) Students Play the Role of Consultants Advocating for the Use of a Test. Have students or teams of students identify a tool of assessment that could be used to (1) select individuals for jobs, (2) select individuals for educational programs, or (3) diagnosis individuals with a particular disorder, or (4) treat individuals with a particular disorder. The Buros Institute of Mental Measurements’ Mental Measurements Yearbook or Tests in Print are good sources to find tests. Next, role-playing a consultant who will make a presentation to sell assessment services to a corporate client, students (1) establish a specific context for which the testing, assessment, or intervention would be done (e.g., selecting computer programmers for a software development company). (2) research and estimate a realistic base rate and selection ratio for their particular context. (3) prepare a presentation that includes recommendations for the cost-effective use of this test for selection purposes. Recommendations should include whether the test should be used alone or in conjunction with other tests. Recommendations should also include whether a compensatory, multiple hurdle, or some kind of hybrid strategy should be employed. Finally, students should provide estimates of utility including projected false positive rates, false negative rates, and return on investment. (c) Students Play the Role of Parties in a Selection-Related Lawsuit Identify a test that could be used to select individuals for a position, whether a job or educational position. The Plaintiff claims that evidence regarding the validity of the test scores for making the selection decision is scanty and generally low (0.25). The Defendant claims that even scores that demonstrate low validity coefficients can have utility and that procedures such as validity generalization are applicable to the current situation. One-third of the class will role-play the Plaintiff in the dispute, one-third of the class will role-play the Defendant in the dispute, and the remaining one-third of the class will role-play that of Judge and Jury. Give the students the details (e.g., test name, type of position for which the selection is being made) in advance so that all students can thoroughly research the issues and the concepts involved with regard to the dispute. A trial will be held in class, focusing on the concepts involved. Students in the Plaintiff or Defendant groups may elect to present evidence (e.g., research results obtained from reputable sources, live expert witnesses). Judge and Jury members will be responsible for summarizing research results and doing any necessary research after the case is presented in order that they may deliberate and render a judgment, to be declared at the beginning of the next class. 2. Debate: Is Top-Down Selection Really “Tops”? A number of references are cited in the text to support the argument that top-down selection policies can carry with them consequences of adverse impact (Cascio et al., 1995; DeCorte & Lievens, 2005; McKinney & Collins, 1991; Zedeck et al., 1996). On the other side of the coin, top-down selection policies may also carry with them real benefits to the organization that uses them for selection. The task here is to prepare for a debate on the pros and cons of top-down selection and critically examine the adverse impact argument. The class will be divided into two groups: Group 1 is “Pro Top-Down,” and Group 2 is “Con Top-Down.” Specific examples of situations wherein a top-down selection policy would or would not be desirable should be cited. OUT-OF-CLASS LEARNING EXPERIENCES 1. Take a field trip. Arrange a trip as a class to: (a) a corporate H.R. Department Visit the human resources department of a local or large organization that uses psychological tests and other assessment instruments for employee selection, placement, or promotion decisions. Arrange with a representative to elaborate on how tests and assessment methods add to the organization’s viability by helping to make cost-effective decisions. (b) a university office Visit the office at a local university or other educational institution that uses testing to help make selection or advancement-related decisions regarding students. Arrange with a representative from the institution to elaborate on how tests and assessment methods help the institution make better decisions. (c) the offices of local consultants Visit a local consulting firm that conducts organizational training, team-building, or other organizational development exercises. Arrange with a representative to speak on topics such as (1) how these exercises, programs, and related interventions benefit the organization both in dollars-and-cents as well as other ways, and (2) how the firm is able to assess or estimate the utility or return on investment of these interventions. SUGGESTED ASSIGNMENTS 1. Critical Thinking Exercise: Psychological Testing Compared to Medical Testing A review article by Daw (2001) included on the American Psychological Association’s Web site http://www.apa.org/monitor/julaug01/psychassess.html summarizes a study conducted by Meyer et al. (2001) that appeared in American Psychologist. The article made the case that psychological tests are about as accurate as medical tests. Students are assigned the task of reading the article and critically evaluating its conclusions, citing, of course, utility-related issues. 2. Generative Thinking Exercise: When testing isn’t worth it. The Close-up in this chapter provided an example of the case when the cost of testing applicants is substantially outweighed by the benefits of the testing. For class discussion, have students generate a list of hypothetical situations in which use of test scores would have negative utility; that is, the cost of the testing would far outweigh the benefits of using the test scores to make decision. 3. Read-then-Discuss Exercises (a) Current Events with an Eye toward Utility Have each student review the daily newspaper, watch the news, or review news-related Web site looking for news articles that refer to cut scores or utility analyses. Students will bring in the articles or write a short summary of the news stories and discuss them during the next class session. Ideally, student presentations will include mention of whether cut scores, multiple hurdle, or top-down selection approaches were presented. Students should also opine on whether or not the most appropriate methods were employed. If not, how might the estimated utility of the measure change with a change in method? (b) Standardized Tests for Employment Selection Assign an article by Phelps (1999) a located at the following website: http://www.siop.org/tip/backissues/Tipapr99/4Phelps.aspx. After everyone has read it, conduct a class discussion on pros and cons of using a standardized test for employment selection decisions. 4. Research-then-Report Exercises (a) Potential Consequences of Misclassification Many tests available to consumers are used to make dichotomous decisions (e.g., pregnancy tests, drug tests, nicotine tests, tests for urinary tract infections, etc.). The students’ task is research and report on the utility-related aspects of any one of these tests. The report should include, for example, the false positive and false negative rates for the test, as well as some discussion of the implications of these types of errors for the consumer. (b) Summarize Reports in the Scholarly Literature Students select a utility-related topic from the scholarly literature and write a report on it. Here are some sample topics: —Report on the procedures used to determine an optimal cutoff score for a particular test. One example from the scholarly literature has to do with a test used to gauge alcohol use. The cutoff score was used to identify women who participated in drinking games. See Zamboanga et al. (2007). Another example involved a test used to assess delirium. The cutoff score was used to indicate the presence or absence of delirium after cardiac surgery. See Kazmierski et al. (2008). —Report on hit and miss rates of diagnosing dementia using various instruments. See Fisher and Larner (2007) and Smith et al. (2007). —Report on the utility of adding additional test results to make a diagnosis. See, for example, Tomanik et al. (2007). —Report on whether tests or self-report had most utility in learning about ADHD in adult patients. See Kooij et al. (2008). —Report on different procedures used to diagnose a particular disorder (in this case, pedophilia), and the consequential utility of the diagnosis. See, for example, Kingston et al. (2007). —Report on economic and social returns of educational choices. See Jaeger (2007). 5. Other Exercises and Assignments (a) Interview Test Users Rather than bringing test users or developers into class, you can assign students to interview test users or developers. The interviews should address how they determine whether use of scores on a test or use of a particular intervention might have utility, how they estimate the utility of test scores to make different decisions, how they determine the appropriate cut scores to use, and how they estimate and minimize false positive and false negative rates. (b) Add an Entry to Wikipedia Have students add a new entry or update an existing entry in Wikipedia (http://www.wikipedia.com). Students should turn in or present the previous entry (if there was one) and indicate their addition, revision, or new entry. Entries should be about a concept or concepts relevant to the chapter. Students should be sure that entries are in their own words rather than copied from the textbook. MEDIA RESOURCES On the Web A noncomprehensive sampling of some of the material available on the World Wide Web. 1. Buros Institute of Mental Measurements Includes a list of tests for which reviews are available as well as links to other test-related sites. http://www.unl.edu/buros/ Your institution’s library likely has copies of various editions of Mental Measurements Yearbook or online access so students may have free access to reviews rather than having to purchase reviews individually online. 2. Online tutorials For an online tutorial by David Lane that discusses the “usefulness” of adding a test or other predictor in multiple regression (i.e., the incremental validity of the predictor), click on: http://davidmlane.com/hyperstat/prediction.html For an online tutorial by Stefan Waner and Steven Costenoble that shows how to calculate false positive rates using Bayes’ Theorem, click on: http://people.hofstra.edu/Stefan_Waner/tutorialsf3/unit6_6.html For an online clinical decision making calculator by Rob Hamm, Ph.D. that will calculate sensitivity, specificity, false positive, false negative, and other factors related to utility, click on: http://www.fammed.ouhsc.edu/robhamm/cdmcalc.htm An online tutorial entitled “Utility and Decision Making” is available from “Web Interface for Statistics Education” (WISE) made available by Claremont Graduate University. To use this tutorial, your browser must be enabled with Java and Java Script. Access WISE tutorials at CGU’s Web site: http://wise.cgu.edu http://luna.cas.usf.edu/~mbrannic/files/tnm/taylor.htm Presents a review of what Taylor-Russell tables are including sample applications. 3. Other relevant sites Employee Selection www.siop.org www.shrm.org Educational Testing www.ets.org Use and abuse of statistics in the media www.stats.org REFERENCES Cascio, W. F., Outtz, J., Zedeck, S., & Goldstein, I. L. (1995). Statistical implications of six methods of test score use in personnel selection. Human Performance, 8(3), 133–164. Daw, J. (2001). Psychological assessments shown to be as valid as medical tests. APA Monitor, 32(7), 46–47. De Corte, W., & Lievens, F. (2005). The risk of adverse impact in selections based on a test with known effect size. Educational and Psychological Measurement, 65(5), 643–664. Fisher, C. A. H., & Larner, A. J. (2007). Frequency and diagnostic utility of cognitive test instrument use by GPs prior to memory clinic referral. Family Practice, 24(5), 495–497. Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality Psychology in Europe, 7 (pp. 7–28). Tilburg, The Netherlands: Tilburg University Press. International Personality Item Pool: A Scientific Collaboratory for the Development of Advanced Measures of Personality Traits and Other Individual Differences Web Site (http://ipip.ori.org/). Jæger, M. M. (2007). Economic and social returns to educational choices: Extending the utility function. Rationality and Society, 19(4), 451–483. Kazmierski, J., Kowman, M., Banach, M., et al. (2008). Clinical utility and use of DSM-IV and ICD-10 criteria and the Memorial Delirium Assessment Scale in establishing a diagnosis of delirium after cardiac surgery. Psychosomatics: Journal of Consultation Liaison Psychiatry, 49(1), 73–76. Kingston, D.A., Firestone, P., Moulden, H.M., & Bradford, J. M. (2007). The utility of the diagnosis of pedophilia: A comparison of various classification procedures. Archives of Sexual Behavior, 36(3), 423–436. Kooij, J., Boonstra, A. M., Swinkels, S. H., et al. (2008). Reliability, validity, and utility of instruments for self-report and informant report concerning symptoms of ADHD in adult patients. Journal of Attention Disorders, 11(4), 445–458. McKinney, W. R., & Collins, J. R. (1991). The impact on utility, race, and gender using three standard methods of scoring selection examinations. Public Personnel Management, 20(2), 145–169. Meyer, G. J., Finn, S. E., Eyde, L D., Kay, G. G., Moreland, K. L., Dies, R. R., Eisman, El J., Kubiszyn, T. W., & Reed, G. M. (2001). Psychological testing and psychological assessment: A review of evidence and issues. American Psychologist, 56(2), 128–165. Phelps, R. P. (1999). Education Establishment Bias? A Look at the National Research Council's Critique of Test Utility Studies. The Industrial-Organizational Psychologist (TIP), 36(4), 37–49. Smith, T. Gildeh, N., & Holmes, C. (2007). The Montreal Cognitive Assessment: Validity and utility in a memory clinic setting. The Canadian Journal of Psychiatry / La Revue canadienne de psychiatrie, 52(5), 329–332. Ten new prostate cancer genes found: Findings could help identify men at high risk for disease. (2008, February 10). Cited at: http://www.msnbc.msn.com/id/23100472/ on February 21, 2008, Reuters. Tomanik, S. S., Pearson, D. A., Loveland, K. A., et al. (2007). Improving the reliability of autism diagnoses: Examining the utility of adaptive behavior. Journal of Autism and Developmental Disorders, 37(5), 921–928. Zamboanga, B. L., Horton, N. J., Tyler, K. M. B., et al. (2007). The utility of the AUDIT in screening for drinking game involvement among female college students. Journal of Adolescent Health, 40(4), 359–361. Zedeck, S., Cascio, W. F.,Goldstein, I.L., & Outtz, J. (1996). Sliding bands: An alternative to top-down selection. In R. S. Barrett (Ed.), Fair employment strategies in human resource management (pp. 222–234). Westport, CT: Quorum Books/Greenwood Publishing. Solution Manual for Psychological Testing and Assessment Ronald Jay Cohen, Mark E. Swerdlik, Edward D. Sturman 9780077649814, 9781259870507
Close