Norm/Criterion/Self Referenced Scoring

NERD ALERT!!! This is an old paper I’ve posted because I am studying for a licensure exam 🙂

After administering a psychological assessment, it is necessary to compare raw scores against a predetermined criterion in order to attain a derived score. Hays, (2013) points out that this is essential since raw scores are meaningless by themselves. The derived score provides a meaningful interpretation of the raw score against a defined standard. Three main types of derived scores exist based on the type of standard utilized.   Norm-referenced scores compare an individual’s results against group scores in order to determine a person’s standing along a bell curve (Hays, 2013, Norm-Referenced Scoring, 2009). I.Q. tests provide just one example of a norm-referenced score. Criterion-referenced scores compare an individual’s performance on an assessment test to an established criterion (Hays, 2013, Smith & Stoval, 2002). This scoring method is often utilized in the educational system to assess a student’s academic achievement (Hays, 2013, Smith & Stoval, 2002). Finally, self-referenced scores utilize an individual’s previous scores on the same test in order to assess change over time. Self-referenced scores are frequently utilized in personality tests and interest inventories.

Norm-Referenced Scores

Description & Interpretation

Tests utilizing norm-referenced scores allow the comparison of an individual’s results with a normative sample (Hays, 2013). In order to interpret norm-referenced tests, an individual’s raw score is converted into a derived score.    This allows test users to compare an individual’s score to establish norms for the test (Cohen, 1996).   In order to create these norms, a version of the test is administered to a representative sample that reflects its target population.   The characteristics that comprise this sample are carefully considered since this greatly influences the generalizability of results (Cohen, 1996).

Norm referenced scores have various formats (Mertler, 2007) “A percentile rank…indicates the percentage of the norm group that scored below a given raw score” (Mertler, 2007, p114). A developmental scaled score allows educators to compare a student’s performance with children in his age group and/or grade level (Hays, 2013). Finally, standardized scores “have a mean at 100 and a deviations that occur in equal intervals” (Norm referenced scoring, 2009, p645). IQ tests utilize a standardized norm-referenced scoring system (Hays, 2013). In order to create a standardized test, raw scores of a normative sample are transformed so that they reflect a bell shaped curve (Mertler, 2007). The standard deviations are all equal, and measures of central tendency are in the exact center (Mertler, 2007).

Three Examples

Examples of normative tests include the Graduate Record Exam (GRE), the WISC intelligence tests, and Terranova testing. These tests represent three unique scenarios in which this scoring method is utilized. They are each briefly reviewed below.

Graduate Record Examination (GRE).

The GRE, is “the most widely accepted graduate admissions test worldwide” (Hays, 2013, p192). GRE scores are utilized as admission criteria for entry into graduate school. The standards utilized to assess an individual’s results vary in accordance with the needs of a program, (Hays, 2013). The GRE General Test contains a verbal, quantitative and written portion. The verbal portion of the test examines an individual’s ability to understand and use written material. The quantitative portion examines an individual’s quantitative reasoning and understanding of mathematical concepts. The quantitative and verbal subtests are combined and scored on a 130-170 point scale, with a mean around 150 and 10 point standard deviations (Dulan, et al, 2013). The written section of the GRE General Test is scored separately on a scale of 0-6 with a mean at about 3 and standard deviations at just under 1 (Dulan, et al, 2013) Scores are reported in one point increments and converted into percentages in comparison to a norm-referenced sample (Dulan, et al, 2013). This newly revised scoring system, developed in August, 2011, replaces a 40-year-old norm-referenced sample (Cohen, et al 1996). According to ETS, the revised GRE scoring system provides greater precision in assessing variations in performance between test takers (Dulan, et al, 2013).

Wecshler Intellgence Tests (WISC).

David Weschler defines intelligence as “the aggregate or global capacity of the individual to act purposefully, to think rationally and to deal effectively with his environment.” (Weschler, 1944, p3). The adult version of the Weschler intelligence test was originally developed in 1939.  It has undergone several revisions with the latest version coming out in 2008 (Hays, 2013) The Adult WISC contains 15 subsets that are combined into a verbal score alongside a performance score (Hays, 2013). The WISC for children assesses four main areas, including: “Verbal comprehension, perceptual reasoning, working memory, and processing speed” (Hays, 2013, p173). The adult and child versions of the WISC tests are based on samples of about 2000. Individuals (Hays, 2013). It has a standardized norm-referenced scoring system with a mean at 100 and standard deviations of 15 (Hays, 2013).

Terranova Testing.

The Terranova test assesses language, reading, math, science and social studies for students from kindergarten through grade school (Hays, 2013). The Terranova test is a byproduct of the NCLB “No Child Left Behind Act” (Hays, 2013). It is used to monitor student progress and assess the quality of curriculum and instruction provided. According to McGraw hill, the Terranova test norms are based on a sample of 200,000 children who were administered the test in 2011 (McGraw Hill, 2013). Both criterion referenced and norm referenced standards are utilized to assess Terranova scores.   The norm-referenced scores are provided in the form of a national percentile score in the above-mentioned subject areas (McGraw Hill, 2013). An objective performance index compares a student’s scores against predetermined criterion (McGraw Hill, 2013).

Criterion-Referenced Scores

Description & Interpretation

            Criterion-referenced tests use established criteria as a means of assessing an individual’s score (Hays, 2013). Criterion-referenced scores describe a person’s level of knowledge and skill mastery (Cohen, et al, 1996). In comparison, norm-referenced scores show how a person has done in comparison to others (Cohen, et al, 1996).  Types of scores provided by these tests can include measures of speed, quality, or precision of performance (Mertler, 2007).   Scores such as pass/fail, or below average/average/above-average are not uncommon in these scoring methods(Mertler, 2007).

In order to assess an individual’s level of knowledge against a predetermined standard, criteria are utilized as a reference point.   These criteria provide cutoff scores along a continuum as a means of interpretation of an individual’s raw score (Mertler, 2007).   The process of determining these criteria is known as “standard setting” (Mertler, 2007, p101). The criteria can vary greatly from a continuum-based model to a dichotomous perspective (Cohen, et al, 1996). Nonetheless, it is important to note that the criteria are based on expert judgment. Therefore, it is difficult to understand exactly what goes into the process without interviewing the experts.

Much controversy exists regarding the criteria setting process, as it relates to high-stakes testing (Hays, 2013; Mertler, 2007). High stakes testing is a practice that involves assessing students regularly to evaluate the curriculum and instruction they receive (Hays, 2013).   While a well-intentioned endeavor, critics state it limits teacher creativity, and ignores critical components of a student’s achievements (Mertler, 2007).

Three Examples

A key benefit of criterion-referenced tests, is their ability to assess an individual against a predetermined standard. The utility of this form of measure can be found within a wide diversity of contexts which are discussed below.

Substance Abuse Assessment.

Alcohol Abuse can be thought of as the consumption of alcohol despite negative consequences (Hays, 2013). Alcohol dependence includes symptoms alcohol abuse alongside the presence of tolerance and withdrawal (Hays 2013). The Michigan Alcoholism Screening Test (MAST) is a 24-item screening inventory comprised of yes or no questions (Mclellan, 2001).   It can be completed in less than 15 minutes, and assesses for symptoms of alcoholism (Mclellan, 2001). “Scores of 5 or more indicate alcoholism, scores of 4 suggest the possibility of alcoholism, and scores of 3 or less indicate the absence of alcoholism” (Hays, 2013, p145).

Assessment of Early Reading Difficulty.

The “Dynamic Indicators of Basic Early Literacy Skills” (DIBELS) test, assesses early reading development in students from kindergarten through six grade (Mertler, 2007). It assesses phoenomic awareness, reading comprehension, fluency, and basic phonics (Mertler, 2007). The criterion-referenced scores provide a benchmark assessment that identifies students with an early learning disability, who are in need of additional instruction (Mertler, 2007).   It is also utilized to monitor progress of students of student’s reading skills against grade level standards (Mertler, 2007). On the scoring reports of a DIBEL assessment are a criteria levels that show on a table how an individual’s results compare against this benchmark level (Mertler, 2007).

College Entrance Exams

The Scholastic Aptitude Test (SAT), utilizes criterion scoring to assess an individual in three areas: (1) critical reading ability, (2) mathematics, and (3) writing. (Hays, 2013). The scores in each section can range from 200-800 with a combined possible score from 600-2400. The purpose of this test is to assess a college applicant’s “academic ability and intellectual skills” (Hays, 2013, p185).   It is useful as a predictor of future academic in college applicants and is useful for higher education institutions in this context.

Self-Referenced Scores

Description & Interpretation

            Self-referenced scores utilize a test-taker’s previous performance as a point of comparison (Hays, 2013). Without testing norms or pre-established criteria, the results of self-referenced tests provide intra-individual comparisons in order to assess growth or change (Brown, 1996; McDermott et al, (1992). Another term for self-referenced scores is IPSATIVE testing. (Ipsative, 2009).

A key criticism of this form of assessment, is that it fails to adhere to the principles of psychometrics (Brown, 1996). McDermott, et al (1992) define this form of scoring as a “personal-relative metric” (p505). Brown (1996) describes self-referenced tests as ordinal measures. Hays (2013), defines ordinal scales as “rank or nominal categories…in which the relative size among intervals are difficult to know” (p88).   From a mathematical perspective, this means self-referenced tests soring methods do not have equal intervals, or an absolute zero point (Hays, 2013). It is therefore, impossible to subtract, multiply, or divide these scores. Additionally, statistical concepts such as measures of central tendency and standard deviation, are meaningless (McDermott et al, 1992)

As a result of these unique characteristics there is little agreement on how to best treat self-referenced scores statistically (Martinussen, et al, 2001). While some individuals remain highly critical of IPSATIVE measures (McDermott, et al, 1992), others remain optimistic and state the distortion in IPSATIVE testing is minimal (Hughes, 2011; Martinussen, et al, 2001). In my research for this paper I found three examples of situations in which this scoring method are useful, despite these limitations.

Three Examples

Strong Interest Inventory.

The strong interest inventory is a self-referenced assessment utilized for purposes of educational and career planning. E.K. Strong first developed it in 1927 and its latest version was published in 2004 (Hays, 2013). “This widely researched test contains six sections: occupations, subject areas, activities, leisure activities, people and your characteristics” (Hays, 2013, p.228). Individual’s respond to questions based on answers similar to a Likert scale. The Personal Style Scale is a measure that can indicate elements of your personality that you desire to express in your career (Prince, 1998). Other scales such as the Occupational Scale helps determine if your interests match up with a particular career or field (Prince, 1998)

Myers Brigs Type Indicator.

The Myers Briggs Type Indicator (MBTI) was developed in the 1920’s by Katherine Briggs and her daughter Isabel Myers (Hays, 2013). It utilizes IPSATIVE measures and is comprised of a series of forced choice questions based Jungian theory (Hays, 2013). Four personality dimensions are measured in order to determined an individual’s personality type based on Jungian theory: (1) introversion versus extroversion, (2) intuition versus sensing, (3) feeling versus thinking, and (4) perceiving versus judging (Hays, 2013). The MBTI has received criticism due to a lack of research that supports its theoretical foundations (Hays, 2013). Additionally, there are others who question the validity and reliability of IPSATIVE personality assessments that utilize forced choices questionnaires (Martinussen, et al, 2001). Nonetheless, this established personality assessment is used frequently in the context of vocational and relationship counseling.

Ipsative Assessment in Higher Education.

An interesting article I found for this assignment discusses the potential benefits of IPSATIVE assessment in education (Hughes, 2011). Assessment plays a central role in education, primarily as means to maintain standards (Hays, 2013; Hughes, 2011). Since educational assessment utilizes criterion and norm referenced scoring methods, the feedback it provides has an externalized focus (Hughes, 2011). These methods exclude a valuable opportunity to use assessment as an integral part of the learning process (Hughes, 2011). IPSATIVE measures can provide a unique counterbalance to externalized criterion measures, and would a learner’s performance to personal goals. The assessments provided would be highly motivating as a measure of one’s own progress. Despite these benefits, Hughes (2011) does admit it is unrealistic for these forms of assessment to be utilized alone. However, they can be a valuable component when taken alongside norm-referenced and criterion-referenced assessment methods (Hughes, 2011).



Brown, H. (1996). Strength and limitations of ipsative measurement. Journal of Occupational  and Organizational Psychology. 69, (pp. 49-56).
Cohen, R., Swertdlik, M., & Phllips, S. (1996). Psychological testing and assessment: An introduction to tests and measurement. Mountain View, CA: Mayfield Publishing Company.
Dulan, S. W., & Advantage Education (Firm). (2013). McGraw-hill’s GRE: Graduate recordexamination general test. New York: McGraw-Hill.
Hays, D.G. (2013). Assessment in counseling a guide to the use of psychologicalassessment procedures (5th Ed.). Belmont, CA: Brooks/Cole, Sengage Learning.
Hughes, G. (2011) Towards a personal best: A case for introducing ipsative assessment in higher education. Studies in Higher Education 36(3) 353-367.
Ipsative (2009). Oxford University Press.
Johnson, C., Wood, R, Blinkhorn, S. (1988). Spuriouser and suprioser: The use of ipsative personality tests. Journal of Occupational Psychology. 61 153-162.
Martinussen, M., Richardsen, A. M., & Vårum, H. W. (2001). Validation of an ipsative personality measure (DISCUS). Scandinavian Journal of Psychology, 42(5), 411-416. doi:10.1111/1467-9450.00253
McDermott, P. A., Fantuzzo, J. W., Glutting, J. J., Watkins, M. W., & Baggaley, A. R. (1992). Illusions of meaning in the ipsative assessment of children’s ability. The Journal of  Special Education, 25(4), 504-526. doi:10.1177/002246699202500407
McGraw Hill (2013) TerraNova common core. Retrieved from: file:///Users/kathleenjohnson/Downloads/wpTerraNovaCommonCore.pdf
Mclellan, A. T. (2001). Michigan Alcoholism Screening Test (Mast). In R. Carson-DeWitt (Ed.), Encyclopedia of Drugs, Alcohol & Addictive Behavior (2nd ed., Vol. 2, pp. 728- 729). New York: Macmillan Reference USA.
Mertler, C. (2007). Interpreting standardized testing scores: Strategies for data-driven instructional decision making.. Thousand Oaks, CA: SAGE Publications, Inc. doi:10.4135/9781452232317.n6
Norm-Referenced Scoring. (2009). In E. M. Anderman & L. H. Anderman (Eds.), Psychology of Classroom Learning (Vol. 2, pp. 643-645). Detroit: Macmillan Reference USA. Retrieved from
Norm-Referenced Testing. (2009). In E. M. Anderman & L. H. Anderman (Eds.), Psychology of  Classroom Learning (Vol. 2, pp. 645-648). Detroit: Macmillan Reference USA. Retrieved from
Prince, J. R. (1998). Interpreting the strong interest inventory: A case study. The Career Development Quarterly, 46(4), 339.
Smith, D. K., & Stovall, D. L. (2002). Individual norm-referenced ability testing. In R. B. Ekstrom, & D. K. Smith (Eds.),Assessing individuals with disabilities in educational, employment, and counseling settings. (pp. 147-171) American Psychological Association. Weschler, D. (1944). The measurement of adult intelligence (3rd ed.). Baltmore MD: Williams &Wilkins

Share This: