David R Heise
Department of Sociology
Bloomington, IN 47405
March 30, 2000
For Methodology section of the
International Encyclopedia of the Social and Behavioral Sciences
Social measurements translate observed characteristics of individuals, events, relationships, organizations, societies, etc into symbolic classifications that enable reasoning of a verbal or logical or mathematical nature. Qualitative research and censuses together define one realm of measurement theory, concerned with assignment of entities to classification categories embedded within taxonomies and typologies. Another topic in measurement theory involves scaling discrete items of information such as answers to questions so as to produce quantitative measurements for mathematical analyses. A third issue is allowing for fallibility in observations—or measurement errors—so that theoretical models can account for social phenomena, even while not fitting the details of every observation.
1. Qualitative Assessments
Classification assimilates perceived phenomena into symbolically labeled categories. Anthropological studies of folk classification systems (D’Andrade 1995) have advanced understanding of scientific classification systems, though scientific usages involve criteria that folk systems may not meet entirely.
Two areas of social science employ classification systems centrally. Qualitative analyses such as ethnographies, histories, case studies, etc. offer classifications—sometimes newly invented—for translating experiences in unfamiliar cultures or minds into familiar terms. Censuses of individuals, of occurrences, or of aggregate social units apply classifications—usually traditional—in order to count entities and their variations. Both types of work depend on theoretical constructions defining how classification categories are linked.
Every classification category is located within a taxonomy. Some more general categorization, Y, determines which entities are in the domain for the focal categorization, X; so an X always must be a kind of Y; and "X is a kind of Y" is the linguistic frame for specifying taxonomies. Concepts constituting a taxonomy form a logic tree, with subordinate elements implying superordinate items.
Taxonomic enclosure of a classification category is a social construction that may have both theoretical and practical consequences. For example, if only violent crimes are subject to classification as homicides, then "homicide" is a kind of "violent crime," and deaths caused by executive directives to release deadly pollution could not be homicides.
A typology differentiates entities at a particular level of a taxonomy in terms of one or more of their properties. The differentiating property (sometimes called a feature or attribute) essentially acts as a modifier of entities at that taxonomic level. For example, in the U.S.A. kinship system siblings are distinguished in terms of whether they are male or female; in Japan by comparison, siblings are schematized in terms of whether they are older as well as whether they are male or female.
A scientific typology differentiates entities into types that are exclusive and exhaustive: every entity at the relevant taxonomic level is of one defined type only, and every entity is of some defined type. A division into two types is a dichotomy, into three types a trichotomy, and into more than three types a polytomy.
Polytomous typologies often are constructed by crossing multiple properties, forming a table in which each cell is a theoretical type. (The crossed properties might be referred to as variables, dimensions, or factors in the typology.) For example, members of a multiplex society have been characterized according to whether they do or do not accept the society’s goals on the one hand, and whether they do or do not accept the society’s means of achieving goals on the other hand; then crossing acceptance of goals and means produces a four-fold table defining conformists and three types of deviants.
Etic-emic analysis involves defining a typology with properties of scientific interest (the etic system) and then discovering ethnographically which types and combinations of types are recognized in folk meanings (the emic system). Latent structure analysis statistically processes observed properties of a sample of entities in order to confirm the existence of hypothesized types and to define the types operationally.
1.3 Aggregate Entities
Aggregate social entities such as organizations, communities, and cultures may be studied as unique cases, where measurements identify and order internal characteristics of the entity rather than relate one aggregate entity to another.
A seeming enigma in social measurement is how aggregate social entities can be described satisfactorily on the basis of the reports of relatively few informants, even though statistical theory calls for substantial samples of respondents to survey populations. The key is that informants all report on the same thing—a single culture, community, or organization—whereas respondents in a social survey typically report on diverse things—their own personal characteristics, beliefs, or experiences. Thus reports from informants serve as multiple indicators of a single state, and the number needed depends on how many observations are needed to define a point reliably, rather than how many respondents are needed to describe a population’s diversity reliably. As few as seven expert informants can yield reliable descriptions of aggregate social entities, though more are needed as informants’ expertise declines (Romney et al.1986). Informant expertise correlates with greater intelligence and experience (D’Andrade 1995) and with having a high level of social integration (Thomas and Heise 1995).
Case grammar in linguistics defines events and relationships in terms of an actor, action, object, and perhaps instrumentation, setting, products, and other factors as well. Mapping sentences (Shye, Elizur, and Hoffman 1994) apply the case grammar idea with relatively small lists of entities in order to analyze and represent relational phenomena within social aggregates. For example, interpersonal relations in a group can be specified by pairing group members with actions such as loves, admires, annoys, befriends, angers. Mapping sentences defining the relations between individuals or among social organizations constitute a measurement model for social network research.
2. Quantitative Measures
Quantitative measurements differentiate entities at a given taxonomic level—serving like typological classifications, but obtaining greater logical and mathematical power by ordering the classification categories.
An influential conceptualization (Stevens 1951) posited four levels of quantification in terms of how numbers relate to classification categories. Nominal numeration involves assigning numbers arbitrarily simply to give categories unique names, such as "batch 243." An ordinal scale’s categories are ordered monotonically in terms of greater-than and less-than, and numbering corresponds to the rank of each category. Numerical ranking of an individual’s preferences for different foods is an example of ordinal measurement. Differences between categories can be compared in an interval scale, and numbers applied to categories reflect degrees of differences. Calendar dates are an example of interval measurements—we know from their birth years that William Shakespeare was closer in time to Geoffrey Chaucer than Albert Einstein was to Isaac Newton. In a ratio scale categories have magnitudes that are whole or fractional multiples of one another, and numbers assigned to the categories represent these magnitudes. Population sizes are an example of ratio measurements—knowing the populations of both nations, we can say that Japan is at least 35 times bigger than Jamaica.
A key methodological concern in psychometrics (Hopkins, 1998) has been: How do you measure entities on an interval scale given merely nominal or ordinal information?
2.1 Scaling Dichotomous Items
Nominal data often are dichotomous yes-no answers to questions, a judge’s presence-absence judgments about the features of entities, an expert’s claims about truth-falsity of propositions, etc. Answers of yes, present, true, etc. typically are coded as "one" and no, absent, false, etc. as "zero." The goal, then, is to translate zero-one answers for each case in a sample of entities into a number representing the case’s position on an interval scale of measurement.
The first step requires identifying how items relate to the interval scale of measurement in terms of a graph of the items’ characteristic curves. The horizontal axis of such a graph is the interval scale of measurement, confined to the practical range of variation of entities actually being observed. The vertical axis indicates probability that a specific dichotomous item has the value one for an entity with a given position on the interval scale of measurement. An item’s characteristic curve traces the changing probability of the item having the value one as an entity moves from having a minimal value on the interval scale to having the maximal value on the interval scale.
Item characteristic curves have essentially three different shapes, corresponding to three different formulations about how items combine into a scale.
Spanning items have characteristic curves that essentially are straight lines stretching across the range of entity variation. A spanning item’s line may start as a low probability value and rise to a high probability value, or fall from a high value to a low value. A rising line means that the item is unlikely to have a value of one with entities having a low score on the interval scale; the item is likely to have a score of one for entities having a high score on the interval scale; and the probability of the item being valued at one increases regularly for entities between the low and high positions on the scale.
Knowing an entity’s value on any one spanning item does not permit assessing the entity’s position along the interval scale. However, knowing the entity’s values on multiple spanning items does allow an estimate of positioning to be made. Suppose heuristically that we are working with a large number of equivalent spanning items, each having an item characteristic curve that starts at probability 0.00 at the minimal point of the interval scale, and rises in a straight line to probability 1.00 at the maximal point of the interval scale. The probability of an item being valued at one can be estimated from the observed proportion of all these items that are valued at one—which is simply the mean item score when items are scored zero-one. Then we can use the characteristic curve for the items to find the point on the interval scale where the entity must be positioned in order to have the estimated item probability. This is the basic scheme involved in construction of composite scales, where an averaged or summated score on multiple items is used to estimate an entity’s interval-scale value on a dimension of interest (Lord and Novick 1968).
The more items that are averaged, the better the estimate of an entity’s position on the interval scale. The upper bound on number of items is pragmatic, determined by how much precision is needed and how much it costs to collect data with more items. The quality of the estimate also depends on how ideal the items are in terms of having straight-line characteristic curves terminating at the extremes of probability. Irrelevant items with a flat characteristic curve would not yield an estimate of scale position no matter how many of them are averaged, because a flat curve means that the probability of the item having a value of one is uncorrelated with the entity’s position on the interval scale. Inferences are possible with scales that include relevant but imperfect items, but more items are required to achieve a given level of precision, and greater weight needs to be given to the more perfect items.
Declivitous items have characteristic curves that rise sharply at a particular point on the horizontal axis. Idealized, the probability of the item having a value of one increases from 0.00 to the left of the inflection point to 1.00 to the right of the inflection point; or alternatively the probability declines from 1.00 to 0.00 in passing the inflection point. Realistically, the characteristic curve of a declivitous item is S-shaped with a steep rise in the middle and graduated approaches to 0.00 at the bottom and to 1.00 at the top.
The value of a single declivitous item tells little about an entity’s position along the interval scale. However, an inference about an entity’s scale position can be made from a set of declivitous items with different inflection points, or difficulties, that form a cumulative scale. Suppose heuristically that each item increases stepwise at its inflection point. Then for a given entity within the range of variation, items at the left end of the scale all will have the value of one, items at the right end of the scale all will have the value zero, and the entity’s value on the interval scale is between the items with a score of one and the items with a score of zero.
If the that items’ inflection points are evenly distributed along the interval scale, then the sum of items’ zero-one scores for an entity constitutes an estimate of where the entity is positioned along the interval scale. That is, few of the items have a value of one if the entity is on the lower end of the interval scale, and many of the items are valued at one if the entity is at the upper end of the interval scale. This is the basic scheme involved in Guttman scalogram analysis (e.g., see Shye 1978, Part 5). On the other hand, we might use empirical data to estimate the position of each item’s inflection point on the interval scale, while simultaneously estimating entity scores that take account of the item difficulties. This is the basic scheme involved in scaling with Rasch models (e.g., Andrich 1988).
Entities’ positions on the interval scale can be pinned down as closely as desired through the use of more declivitous items with inflection points spaced closer and closer together. However, adding items to achieve more measurement precision at the low end of the interval scale does not help at the middle or at the high end of the interval scale. Thus, obtaining high precision over the entire range of variation requires a large number of items, and it could be costly to obtain so much data. Alternatively, one can seek items whose characteristic curves rise gradually over a range of the interval scale such that sequential items on the scale have overlapping characteristic curves, whereby an entity’s position along the interval scale is pinpointed by several items.
Regional items have characteristic curves that rise and fall within a limited range of the interval scale. That is, moving an entity up the interval scale increases the probability of a particular item having a value of one for a while, and then decreases the probability after the entity passes the characteristic curve’s maximum value. For example, in a scale measuring prejudice toward a particular ethnic group, the probability of agreeing with the item "they require equal but separate facilities" increases as a person moves away from an apartheid position, and then decreases as the person moves further up the scale toward a non-discriminatory position. A regional item’s characteristic curve is approximately bell-shaped if its maximum is at the middle of the interval scale, but characteristic curves at the ends of the scale are subject to floor and ceiling clipping, making them look like declivitous items.
If an entity has a value of one on a regional item, then the entity’s position along the interval scale is known approximately, since the entity must be positioned in the part of the scale where that item has a non-zero probability of being valued at one. However, a value of zero on the same item can result from a variety of positions along the interval scale and reveals little about the entity’s position. Thus multiple regional items have to be used to assess positions along the whole range of the scale. The items have to be sequenced relatively closely along the scale with overlapping characteristic curves so that no entity will end up in the non-informative state of having a zero value on all items.
One could ask judges to rate the scale position of each item, and average across judges to get an item score; then, later, respondents can be scored with the average of the items they accept. This simplistic approach to regional items was employed in some early attempts to measure social attitudes. Another approach is statistical unfolding of respondents’ choices of items on either side of their own positions on the interval scale in order to estimate scale values for items and respondents simultaneously (Coombs 1964).
Item analysis is a routine aspect of scale construction with spanning items, declivitous items, or regional items. One typically starts with a notion of what one wants to measure, assembles items that should relate to the dimension, and tests the items in order to select the items that work best. Since a criterion measurement that can be used for assessing item quality typically is lacking, the items as a group are assumed to measure what they are supposed to measure, and scores based on this assumption are used to evaluate individual items.
Items in a scale are presumed to measure a single dimension rather than multiple dimensions. Examining the dimensionality assumption brings in additional technology, such as component analysis or factor analysis in the case of spanning items, multidimensional scalogram analysis in the case of declivitous items, and non-metric multidimensional scaling in the case of regional items. These statistical methods help in refining the conception of the focal dimension and in selecting the best items for measuring that dimension.
2.2 Ordered Assessments
Ordinal data—the starting point for a three-volume mathematical treatise on measurement theory (Krantz et al 1971; Suppes et al 1989; Luce et al 1990)—may arise from individuals’ preferences, gradings of agreement with opinion statements, reckonings of similarity between stimuli, etc. Also, analyses of dichotomous items sometimes are enhanced by incorporating ordinal data, such as rankings of how close items seem to one’s own position. Conjoint analysis (Luce and Tukey 1964; Michell 1990) offers a general mathematical model for analyzing such data. According to the conjoint theory of measurement, positions on any viable quantitative dimension are predictable from positions on two other quantitative dimensions, and this assumption leads to tests of a dimension’s usefulness given just information of an ordinal nature. For example, societies might be ranked in terms of their socioeconomic development and also arrayed in terms of the extents of their patrifocal technologies (like herding) and matrifocal technologies (like horticulture), each of which contributes additively to socioeconomic development. Conjoint analyses could be conducted to test the meaningfulness of these dimensions, preliminary to developing interval scales of socioeconomic development and of patrifocal and matrifocal technologies. Specific scaling methodologies, like Rasch scaling and non-metric multidimensional scaling, can be interpreted within the conjoint analysis framework.
Magnitude estimations involve comparisons to an anchor, for example: "Here is a reference sound. … How loud is this next sound relative to the reference sound." Trained judges using such a procedure can assess intensities of sensations and of a variety of social opinions on ratio scales (Stevens 1951; Lodge 1981). Comparing magnitude estimation in social surveys to the more common procedure of obtaining ratings on category scales with a fixed number of options, Lodge (1981) found that magnitude estimations are more costly but more accurate, especially in registering extreme positions.
Rating scales with bipolar adjective anchors like good-bad often are used to assess affective meanings of perceptions, individuals, events, etc. Such scales traditionally provided seven or nine answer positions between the opposing poles with the middle position defined as neutral. Computerized presentations of such scales with hundreds of rating positions along a graphic line yield greater precision by incorporating some aspects of magnitude estimation. Cross-cultural and cross-linguistic research in dozens of societies has demonstrated that bipolar rating scales align with three dimensions—evaluation, potency, and activity (Osgood et al. 1975). An implication is that research employing bipolar rating scales should include scales measuring the standard three dimensions in order to identify contributions of these dimensions to rating variance on other scales of more substantive interest.
3. Errors in Measurements
Theories set expectations about what should be observed, in contrast to what is observed, and deviation from theoretical expectations can be interpreted as measurement error. "It is only through knowledge of the physical world, either in the form of common-sense knowledge or in the form of knowledge of the laws of physics that we know that there are any errors at all in our measurements" (Kyburg 1992, p.78).
3.1 Reliability and Validity
Items in a composite scale can be viewed as mutually dependent on a latent variable, with each item additionally influenced by random factors varying from one item to another. An item correlates with the latent variable because the latent variable partially determines that item’s values. Items correlate with one another since all are mutually dependent on the latent variable. Items correlate with the composite score created from them since the items are the determinants of the score. And the composite score correlates with the latent variable because the composite score is determined by the items, which in turn are partially determined by the latent variable. Correlations of items with the latent variable cannot be computed directly, but they can be estimated from the empirically observable inter-item correlations. The correlation of the composite score with the latent variable also cannot be computed directly, but can be estimated from the inter-item statistics (Heise and Bohrnstedt 1970).
A reliability coefficient estimates the extent to which scores obtained with a scale are reproducible. A reliability coefficient may be based on the internal consistency of items in the scale, extrapolating from the inter-item correlations to an estimated correlation of scores from one administration of the scale to another hypothetical administration (Heise and Bohrnstedt 1970). A reliability coefficient also may be obtained by repeatedly administering the scale and computing test-retest correlations, though then one has to adjust for changes that occurred in the time between the two measurements (Heise 1969). Beyond estimating the extent of reproducibility, reliability analyses may examine how measurement errors correlate with different testing occasions, forms, administrators, etc. (Traub 1994).
Measurement error attenuates the size of correlations, making them less extreme than they would be if measurements were perfect. A correlation coefficient can be corrected for attenuation if reliabilities have been obtained for the variables involved in the correlation. The strategy of correcting for attenuation is essentially what is involved when employing "measurement models" in structural equation analyses of systems of variables. In effect, the measurement model defines a latent variable’s reliability of measurement, and this reliability information allows estimating the variable’s relations with other variables at the levels that would be found if measurements were perfect. Enticing as it seems, the strategy of using reliability information to correct for attenuation in complex systems yields dependable outcomes only when measurement errors are relatively minor (Heise 1986).
Validity concerns the extent to which scores measure a dimension of interest. While an unreliable measure cannot be a valid measure of anything, a reliable measure is not a valid measure unless it correlates with the dimension of interest and does not correlate with irrelevant factors (Heise and Bohrnstedt 1970). Lacking a criterion measure to use as a standard for correlations, validity has to be assessed in terms of judgments about whether item content obviously measures the dimension of interest, or in terms of how well measurements work in predicting outcomes that are theoretically expected.
Bias is a form of invalidity in which measurements are too high or too low for a distinct group of measured entities. Social psychologists have examined some factors that contribute to bias in social research. For example, wording and ordering of questions can foster or suppress particular answers and thereby contribute to biases (Schuman and Presser 1981). Informants may deceive purposefully in some cases (Goffman 1974, Chapter 4), and their fabrications could lead to biased measurements of aggregate entities.
3.2 Scale Errors
Ranks assigned to the answer categories of multiple-choice "Likert" items (e.g., agree strongly, agree, disagree, disagree strongly) often are analyzed as "assumed" intervals. Stevens condemned this procedure in his writings on levels of quantification (eg, Stevens 1951), along with any procedure that employs lower level measurements in analyses requiring a higher level scale type. Additionally, conjoint measurement theorists (eg, Michell 1990) argue that empirical tests assuring that a dimension really is quantitative and measurable should precede construction of any measurement scale, since measurements on tested scales are the only way to obtain scientific theories.
In opposition, Borgatta (1992) argued that social dimensions are intrinsically quantitative at the theoretical level, that the failings of actual scales to capture perfectly the theoretical dimensions is a form of measurement error, and that such measurement errors can be ameliorated by using more items and more discriminating items in the measurement scales. Purist arguments against standard research practices in quantitative social science are hyperbolic, according to Borgatta, and should not encumber most research activities.
One resolution of the controversy is this. Rigor is essential when social measurements affect the lives of individuals. For instance, an achievement test that is used to decide individual career paths should have been validated in the framework of conjoint measurement theory, the test should be scaled to provide interval measurements, and the test scores should have high reliability. High quality measurements also are required for accurate parameterization of mature theoretical models, especially those that are complex and nonlinear. On the other hand, exploratory research attempting to build the foundations of social and behavioral science pragmatically has to use measurement procedures that are available and affordable in order to determine whether a particular concept or social dimension is worth further theoretical and methodological investment.
4. Measurement and Theory
Theories and measurements are bound inextricably.
In the first place, taxonomies and typologies-which are theoretical constructions, even when rooted in folk classification systems-are entailed in defining which entities are to be measured, so are part of all measurements.
Second, scientists routinely assume that any particular measurement is wrong to some degree-even a measurement based on a scale, and that combining multiple measurements improves measurement precision. The underlying premise is that a true value exists for that which is being measured, as opposed to observed values, and that theories apply to true values, not ephemeral observed values. A notion of measurement error as deviation from theoretical expectations is widely applicable, even in qualitative research (McCullogh 1984; Heise 1989).
Third, the conjoint theory of measurement underlying many current measurement technologies requires theoretical specification of relations between variables before the variables' viability can be tested or meaningful scales constructed. This approach is in creative tension with traditional deductive science, wherein variable measurements are gathered in order to determine if relations among variables exist and theories are correct. In fact, a frequent theme in social science during the late 20th century was that measurement technology had to improve in order to foster the growth of more powerful theories. It is clear now that the dependence between measurements and theories is more bi-directional than was supposed before the development of conjoint theory.
Andrich, David 1988 Rasch Models for Measurement Sage, Newbury Park, CA
Borgatta, Edgar F 1992 Measurement. In: Borgatta, E F, and Borgatta, Marie L (eds) Encyclopedia of Sociology 3: 1226-36 Macmillan, New York
Coombs, Clyde H 1964 A Theory of Data John Wiley & Sons, New York
D'Andrade, Roy 1995 The Development of Cognitive Anthropology. Cambridge University Press, New York
Goffman, Erving 1974 Frame Analysis. Harper Colophon, New York
Heise, David 1969 Separating reliability and stability in test-retest correlation. American Sociological Review 34: 93-101
Heise, David 1986 Estimating nonlinear models: Correcting for measurement error. Sociological Methods & Research 14: 447-472
Heise, David 1989 Modeling event structures. Journal of Mathematical Sociology 14: 139-169
Heise, David, and Bohrnstedt, George 1970 Validity, invalidity, and reliability. In: Borgatta, Edgar, and Bohrnstedt, G (eds.) Sociological Methodology: 1970. 104-129 Jossey-Bass, San Francisco
Hopkins, Kenneth D 1998 Educational and Psychological Measurement and Evaluation, 8th Edition, Boston, Allyn and Bacon
Krantz, David H, Luce, R D, Suppes, P, and Tversky, A 1971 Foundations of measurement. Volume 1: Additive and polynomial representations. Academic Press, New York
Kyburg, Henry E Jr 1992 Measuring errors of measurement. In: Savage, C Wade, and Ehrlich, Philip (eds.) Philosophical and Foundational Issues in Measurement. 75-91 Lawrence Erlbaum Associates, Hillsdale NJ
Lodge, Milton 1981 Magnitude Scaling: Quantitative Measurement of Opinions. Sage, Newbury Park, CA
Luce, R D, Krantz, David H, Suppes, P, and Tversky, A 1990 Foundations of Measurement. Volume 3: Representation, axiomatization, and invariance. Academic Press, New York
Luce, R Duncan, and Tukey, John W 1964 Simultaneous conjoint measurement: A new type of fundamental measurement. Journal of Mathematical Psychology 1: 1-27
McCullagh, C Behan 1984 Justifying Historical Descriptions. Cambridge University Press, New York
Michell, Joel 1990 An Introduction to the Logic of Psychological Measurement. Lawrence Erlbaum Associates, Hillsdale NJ
Osgood, Charles, May, W H, and Miron, M S 1975 Cross-Cultural Universals of Affective Meaning. University of Illinois Press, Urbana, IL
Romney, A Kimball, Weller, Susan C, and Batchelder, William H 1986 Culture as Consensus: A Theory of Culture and Informant Accuracy. American Anthropologist 88: 313-338
Samuel Shye (ed.) 1978 Theory Construction and Data Analysis in the Behavioral Sciences. Jossey-Bass, San Francisco
Schuman, Howard, and Presser, Stanley 1981 Questions and Answers in Attitude Surveys: Experiments on Question Form, Wording, and Context. Academic Press, New York
Shye, Samuel, Elizur, Dov, and Hoffman, Michael 1994 Introduction to Facet Theory: Content Design and Intrinsic Data Analysis in Behavioral Research. Sage, Thousand Oaks, CA
Stevens, S S 1951 Mathematics, measurement, and psychophysics. In: Stevens, S S (ed.) Handbook of Experimental Psychology. 1-49 John Wiley and Sons, New York
Suppes, P, Krantz, David H, Luce, R D, and Tversky, A 1989 Foundations of measurement. Volume 2: Geometrical, threshold, and probabilistic representations. Academic Press, New York
Thomas, Lisa, and Heise, David R 1995 Mining Error Variance and Hitting Pay-Dirt: Discovering Systematic Variation in Social Sentiments. The Sociological Quarterly 36: 425-439
Traub, Ross E 1994 Reliability for the Social Sciences: Theory and Applications. Sage, Thousand Oaks CA
David R. Heise, Indiana University