"Measurement theory" and its mis-application to attack Range Voting

(Or: are voting methods allowed to use numbers?)

Warren D.Smith, May 2013.    (Skip to summary)

Genesis

The British Association for Advancement of Science in 1932 tasked a committee to report on "quantitative measurement of sensory events." It produced its final report in 1940. This was stimulated by the "sone scale of loudness" purported to measure "objective scale of [subjective] auditory sensation."

Encyclopedia Brittanica, "Sone": Loudness is a subjective characteristic of a sound (as opposed to the sound-pressure level in decibels, which is objective and directly measurable). Consequently, the sone scale of loudness is based on data obtained from subjects who were asked to judge the loudness of pure tones and noise. One sone is arbitrarily set equal to the loudness of a 1,000-hertz tone at a sound level of 40 decibels above the standard reference level (i.e., the minimum audible threshold). A sound with a loudness of four sones is one that listeners perceive to be four times as loud as the reference sound.

Such scales are crucially important for purposes such as telephony, computer speech synthesis, audio compression, etc. But one member of the BAAS committee claimed any such quantitative scale "is not merely false but in fact meaningless unless and until a meaning can be given to the concept of addition as applied to sensation." (Final report p.245.) Meanwhile other members had extremely opposite views!

You can already tell that that quote was hogwash. A counterexample is "temperature," an apparently obscure and little known concept unfamiliar to eminent members of the British Association for Advancement of Science. Temperature is meaningful and measurable, despite the fact that the "sum of two temperatures" seems meaningless and you never do that ("what is the total temperature of this apple and this cup of coffee?").

S.S.Stevens, who'd co-invented the Sone scale, tried to resolve this controversy in his influential paper

On the Theory of Scales of Measurement, SCIENCE magazine 103,2684 (June 1946) 677-680

by proposing four "scale" notions, which he named 'nominal', 'ordinal', 'interval' and 'ratio'. [Interval and Ratio scales automatically are Ordinal also, which in turn automatically are Nominal also.] Stevens claimed that "the statistical manipulations that can be [meaningfully] applied to empirical data depend on the type of scale" – and some kinds of scales make sense in certain situations while others do not and would be meaningless. Here is Stevens' table, but we have added two additional rows to it at bottom.

Scale nameBasic empirical operationsMathematical groupSome "permissible" statistics
NominalEquality testing onlyPermutation group of 1-to-1 bijections x'=f(x) among scale elementsNumber of cases. Mode. Contingency correlation.
Ordinalalso: x<y and x>y testing "Isotonic" group, x'=f(x) where f monotonic increasingMedian. Quantiles. Disallowed: determining the midpoint (x+y)/2 of two scale values x,y.
Intervalalso: comparing differences "General linear" group x'=ax+b, [a>0, b both real]Ordinal stuff, and also: Mean. Standard deviation. Rank-order correlation. Product-moment correlation.
Ratio"ordinal" operations and also: comparing ratios "Similarity" group x'=ax, [a>0 real]Ordinal stuff, and also: Coefficient of variation.
Angular (new)Comparing pair-differences "Modular additive" group x'=x+a mod 360, where a is any real Ordinal stuff, and also: mean squared difference. Disallowed: multiplying by constants.
Range (new)"interval" operations and also computing and comparing convex linear combinations αx+(1-α)y where 0≤α≤1; see text for more Ordinal stuff, and also: Weighted and unweighted means. Disallowed in general: multiplying by constants>1.

Stevens in later work introduced a 5th kind of scale, the 'log-interval' scale, which form interval scales upon taking logarithms and hence is not really an additional scale type. These have invariantive group containing maps of form x'=bxa where a,b are positive reals.

A Criticism

We here want to argue that Stevens missed at least one kind of scale which we have christened 'range' and added to the bottom of Stevens' table. (Actually, he also missed 'angular' scale, which I adjoined just to make the point that range was not the only one he missed.) All we need to do to show this is to name some quantity (such as "temperature" above) which is meaningful and measurable, but does not fit into Stevens' 4 categories, however does fit into our new category.

To do this, consider "the probability P that some bit-outputting randomized procedure A, outputs '0'." Obviously, for any given A, we can measure P(A) to arbitrary accuracy. So it is measurable and meaningful. Also obviously, 0≤P≤1 always. This tells us that P is not an interval or ratio scale because those necessarily involve all real numbers and all positive real numbers as scale elements, respectively. Now further, a convex combination S=αQ+(1-α)R of two probabilities Q and R (where 0≤α≤1) has a natural meaning, since if Q=P(A) and R=P(B) then

S = P(use A with probability=α and B with probability=1-α).

So it is simply wrong to contend that Stevens' four scales are the only possibilities.

Means, and more generally weighted means, have meaning for Range scales (since they can be computed using convex combinations only).

Range scales based on the real interval [0,1] – since any finite interval can be regarded as [0,1] after a renormalization, we might as well only consider it – could permit even more natural/meaningful operations. For example, consider the operation x'=1-x. That arises naturally via P(B)=1-P(A) where B≡"use A except complement its output bit." If this is permissible then Stevens' "invariantive group" would include x'=x and x'=1-x.

For another example, consider the squaring operation x'=x2. In our probabilistic algorithm example, replace method A with "run A twice (with independent randomness) and output the logical OR of the two outputs." The probability that this outputs 0 is P(A)2. More generally we can multiply two scale elements: consider C≡"run both A and B (with independent randomness) and output the logical OR of the two outputs." Then P(C)=P(A)·P(B).

Subsequent – altogether too much – philosophical mumbo jumbo and pseudoscience

David H. Krantz, R. Duncan Luce, Amos Tversky, and Patrick Suppes in their book

Foundations of Measurement volume I (Academic Press 1971, 577 pages plus at least 2 more volumes came out later)

set themselves the goal of developing a mathematical theory of this crud. Which sometimes could be very complicated. But still later Louis Narens claimed that their theory "has never been satisfactorily justified" hence proposed in his own book Measurement Theory (MIT Press 1985) to redo the whole area himself axiomatically. But Narens' entire book, as far as I can see, is 100% divorced from experiment. It never mentions even a single experiment. It is entirely pure mathematics resting on foundations of set theory. The Krantz et al book is also pretty divorced from experiment, but not 100%. More like 97%. It includes this obnoxious quote on p.419:

In summary... expected-utility theory is neither entirely adequate nor entirely inadequate... it does not provide a complete description of individual decision making under uncertainty [apparently they here are referring to experiments in which confused human subjects made choices which are clearly not utility-optimal for any utility function]... but cannot be discounted as a descriptive model... the conscious attempt to behave rationally is... a significant fact about man.

Even later, Jose A. Diez produced a 2-part paper

Jose A. Diez: A Hundred Years of Numbers. An Historical Introduction to Measurement Theory 1887-1990, part I and part II, Stud. History Philosophy of Science 28,1 (1997) 167-185 and 28,2 (1997) 237-265

purporting to review the whole subject, since he claims it is now a "mature theory."

I have a low opinion of those Narens and Krantz tomes and this entire field. It involves a tremendous amount of formalism (over 1000 pages worth) actually accomplishing nearly nothing, combined with an absolutely appalling ignorance and apathy about science generally and the most important developments in the real world related to measurement.

The "range scale" is certainly not the only omission from Stevens' list. For example, apparently he had never heard of Einstein, whose (apparently little-known and largely irrelevant) contributions "special and general relativity" introduced new kinds of "invariantive groups" (as Stevens calls them) into science. (Einstein is never mentioned by Stevens, never mentioned by Diez, and only mentioned once in FoM and that only in passing. Even more incredibly, essentially the same is true about quantum mechanics!) Thus, e.g, "speed" is another kind of measurable meaningful quantity that is bounded (by the speed of light). And there are still more things in heaven and earth that are undreamt of in their philosophy. For example it is nowadays well known that "mass density" is upperbounded by the Planck density, while "length" is unmeasurable below the Planck length. ("Planck" is yet another scientist whose "minor" contributions to measurement are unmentioned in Diez's review!)

And apparently none of these works proposes our "range scale" concept. Indeed Narens, with Alper, even attempted to prove it does not exist! Specifically, Narens later produced yet more books

Theories of Meaningfulness, Lawrence Erlbaum Associates, 2002
Introduction to the Theories of Measurement and Meaningfulness and the Use of Invariance in Science, Lawrence Erlbaum Associates, 2007

and in Narens' 2002 book he remarked on pages 53-55 (we quote, with slight editing)

It is natural to ask what kinds of scales there are up to conjugation [i.e. equivalence] and whether the ones listed by Stevens are the only interesting ones. Narens (1981a, 1981b) developed a classification system to investigate this issue and his results together with those of Alper(1983, 1987) indicate that Stevens' classification covers most – but not all – of the theoretically interesting... scale types. Their main result is...
Theorem 2.3.1: Suppose S is a homogeneous, finitely point unique, ordered scale with real image. Then S is equivalent to a "super-ratio scale," i.e. whose invariantive group G obeys R⊆G⊆L where L is the invariantive group for the the log-interval scales while R is the invariantive group for the ratio scales.

Note that neither our "range scales" or "angular scales" are permitted by the Narens-Alper classification theorem – Narens and Alper cleverly created enough artificial conditions to exclude them. Oops.

Narens then goes on to discuss (pp. 60-68) Luce's "possible psychophysical laws" [R.D.Luce: Psychological Review 66,2 (March 1959) 81-95] which again (we see from Narens' tables p.62) fail to admit range scales!

...but Narens thankfully comes to the conclusion Luce had over-reached – e.g. "Rozeboom's criticism" p.62 pointed out the slight problem that the law of radioactive decay would according to Luce be "impossible." We might also mention that Luce apparently would outlaw trigonometry. On pp.81-86 Narens discusses the alleged improvement/update of Luce's "possible laws" by Roberts & Rosenbaum 1986, but according to Narens' tables p.82 and 85 these too disallow range scales, special relativistic velocity, etc. Finally, pp.68-81 discusses Falmange & Narens' "meaningful quantitative laws." I suspect range scales would not be permitted in these either, and in any case Narens' discussion does not mention them.

We disembark this runaway train here.

What does this have to do with Range voting?

The present essay was stimulated by an insanely wrong-headed attack on range voting – actually, incredibly, an attack on every voting method that uses numbers!! – by M.Balinski & R.Laraki. They prefer an alternative and more complicated, but related voting system – which they invented and called majority judgment (MJ) – based not on "greatest average score wins" but rather on "greatest median score wins, with an additional tie-breaking scheme." MJ also uses, not a numerical score-set, but rather a set of 6 verbal scores

Excellent / Bien / Assez Bien / Passable / Insuffisant / a Rejeter.

We quote the attack from Balinski & Laraki's paper

Election by Majority Judgment: Experimental Evidence, pages 13-54 in Bernard Dolez, Bernard Grofman, Annie Laurent: Studies In Public Choice: In Situ and Laboratory Experiments On Electoral Law Reform: French Presidential Elections, Springer 2011.

We have numbered their paragraphs for later reference (verbatim quotes, but sometimes with comments by me in [square brackets]):

1. Is it reasonable to use numerical scales in voting? The answer is a resounding no, for several reasons:

2. The numbers mean nothing unless they are defined: proposals to use weights give them no definition. Their only real "meaning" is found in their strategic use. This induces comparisons, which immediately leads to Arrow's paradox... E.g. with these actual ballot instructions

Give a grade to each of the twelve candidates: either 0, or 1, or 2 (2 the best grade, 0 the worst). To do so, place a cross in the corresponding box etc. The candidate elected with [this] method is the one who receives the highest number of points.

3. nothing is said concerning the meaning of 0, 1, or 2. The numbers induce relative, so strategic, behavior. Other numbers could have been given. For example, with {-1,0 +1} mathematically there is no difference, but were these numbers used the behavior of the voters would almost surely have been different. [In fact, this experiment later was tried and voter behavior was significantly different.]

4. When numbers are used, they may well not be used in the same way at all: when a 0-100 scale is used, some voters may view 80 to be an excellent grade, others may see it as merely middling.

5. Even if the numbers did provide a common language, they will almost certainly not be a proper interval measure [in the sense of Stevens – it is here that Balinski & Laraki invoke "measurement theory"] – that depends on who the candidates are and how the voters give their grades. For example, the 0-20 scale used in France is a common language, but an 18, 19, or 20 is unheard of in philosophy or literature, so the scale is not an interval measure. Once the distribution of the grades is known – after many elections (or many examinations) – it is possible to determine whether the scale is an interval measure and, if not, to correct it (as did the Danes). But then it is too late, since the weights must be announced ahead of time.

6. Even if it turned out that the scale did approximate an interval measure, the procedure depends on irrelevant alternatives, [hence] is subject to Arrow's paradox: for if one or several candidates drop out, the distribution of the remaining grades will almost certainly be different, so the scale is no longer an interval measure. [For example, in the French 2007 presidential election, the counts of the number of times each of their 6 verbal scores was used, changed considerably when all scores for the 8 "unimportant" among the 12 candidates were removed.]

Balinski and Laraki also attack approval voting. After noting that in their MJ experiment in Orsay France 2007, approval voting would have elected Bayrou if scores≥"assez bien" were approved, but Sarkozy if only scores≥"bien" were, they complained:

7. Approval voting is extremely sensitive to the question posed. Imagine what would have happened if the threshold had been either higher or lower. This shows that approval voting's two-word language is insufficient and arbitrary.

OK, we've had enough. It is time to respond to these attacks.

1a. It is rather strange to see the mathematician Balinski rejecting the use of "numbers" as "meaningless" since "undefined." Numbers are certainly more concise, and I would think better defined (especially for voters some of whom might be poor French speakers or come from different cultures), than essentially all adjectives such as "excellent," "passable," and "assez bien." Indeed, if it were me voting, I would have thought "assez bien" and "passable" meant exactly the same thing, even though Balinski & Laraki think the first is clearly superior! No such difficulty happens with "3>2"; that is agreed by everybody from every culture.

Several French→English dictionaries (e.g. 1982 Harrap's) say "assez bien" means "good enough" and "passable"→"passable." The 2013 Merriam-Webster English→English dictionary says "passable"="good enough."
We also warn the reader that "bien assez" has a third meaning ("quite enough") different from "assez bien." After consulting several French speakers I agree with Balinski & Laraki that "assez bien" is commonly regarded as superior to "passable"; I am just pointing out its non-obviousness to a non-native speaker like me, even with aid from dictionaries.

Indeed, the question of quantifying the strength of adjectives has been examined by the psychometricians Jones & Thurstone 1955. They examined the semantic meanings, to respondents, of 51 scale-point descriptors using numerical scales, and subsequently presented a listing of words and phrases ranging from those expressing "greatest like" to those conveying "greatest dislike." That is, they succeeded in constructing a "continuum of meaning" ranging between the end points "best of all" to its extreme opposite "despise" (p.33), providing the mean scale value and standard deviation of each of the tested words and phrases. Myers & Warner 1968 and later teams such as Bartram & Yelding 1973, Vidali 1975, Wildt & Mazis 1978, and Braunsberger & Gates 2009 all redid the same sort of study over again. They declared a high degree of success in the sense that their numbers were "surprisingly consistent among very diverse groups of people." However, they were not completely consistent. Thus Myers & Warner 1968 showed there are statistically significant differences between the quantified meanings of various words, as perceived by housewives versus students: "slightly poor" was rated 8.48±1.83 (mean±standard deviation) by 25 undergraduate students but 5.92±1.96 by 25 housewives, which is a 4.77σ discrepancy, i.e. taken by itself would have 99.9999% confidence of being genuine. But actually since this was "cherrypicked" as one of the largest among about 100-200 such discrepancies, the confidence really should be only about 99.98% that it is real.

P.Bertram & D.Yelding: The development of an empirical method of selecting phrases used in verbal rating scales, Journal of the Market Research Society 15 (1973) 151-156.
Karin Braunsberger & Roger Gates: Developing inventories for satisfaction and Likert scales in a service environment, J. of Services Marketing 23,4 (2009) 219-225.
L.V.Jones & L.L.Thurstone: The psychophysics of semantics: an experimental investigation, Journal of Applied Psychology 39,1 (1955) 31-36.
J.H.Myers & W.G.Gregory Warner: Semantic properties of selected evaluation adjectives, J. Marketing Research 5,4 (November 1968) 409-412.
J.J.Vidali: Context effects on scaled evaluatory adjective meaning, J. Market Research Society 17,1 (1975) 21-25.
A.R.Wildt & M.B.Mazis: Determinants of scale response: label versus position, J. Marketing Research 15,2 (May 1978) 261-267.

Is it democratically fair to partially disenfranchise (say) housewives because they interpret some word differently? Or is it more fair to provide a numerical scale whose meaning is unambiguously defined by the rules of the voting system?

1b. Why must vote-scores have some sort of Balinski-approved "meaning" at all? Why must they be "measurable" according to Stevens' notions at all? Voting systems input "votes" (which are information packets, e.g. bitstrings) and output a winner. Full stop.

Voting is an exercise of power, not a sentiment. The votes need not be "measured" ala Stevens, they merely must be "transmitted." The only true meaning of a vote, in this general setting, is defined – and wholy, completely, and mathematically defined – by the rules the voting system uses to deduce the winner from the votes.

1c. However, there could be additional (untrue) "meaning" carried around by humans as psychological baggage. For example, within the MJ voting system, the true meaning of "assez bien" is defined solely by the MJ winner-determining-algorithm. The untrue meaning that "assez bien"="passable" is carried around by me, and by various dictionaries, as psychological baggage. Naturally, we wish that the 'baggage' and 'true' meanings of votes should correspond as closely as possible for as many people as possible! I also contend (and apparently Balinski & Laraki agree, since they approvingly use Stevensian "measurement theory") that more humans would be happier the more Stevens-like properties votes have. I.e. while it is not necessary that votes viewed as "quality measures" be addable, multiplyable, </=/>-comparable, or whatever, the more of that kind of stuff approximately-works, the better a lot of humans will probably feel about it, and the better the voting system will probably work in practice.

1d. For standard greatest-average-wins range voting using numerical scores, the baggage/true meanings-correspondence seems close, and the "range" Stevensian properties exactly work. In contrast, Balinski & Laraki contend that their 6 verbal MJ scores merely form an "ordinal scale," enjoying a strictly smaller set of Stevensian properties. I.e. with MJ only "=", ">" and "<" have "meaning." With MJ, taking the midpoint (x+y)/2 of two scores "has no meaning" unlike in standard range voting (using the real interval [0,1] as the score-set) in which the meaning of m=(x+y)/2 is precisely this: two votes m are equivalent to one x and one y vote.

Now Balinski & Laraki's contention that MJ scores form an "ordinal scale" is substantially justified by the fact that the MJ voting system is monotonic, i.e. a voter increasing her score for Sarkozy can help, but cannot hurt, Sarkozy's winning chances. Thus ">" has its naively expected meaning. Standard range voting also is monotonic, hence also forms an ordinal scale, and my stronger contention that range voting's scores form a range scale is justified by the "true meaning" of convex combination for any rational α with 0≤α≤1: if α=a/b then m=αx+[1-α]y has the "meaning" that a+b votes m are equivalent to a votes x and b votes y. In a nonmonotonic voting system like instant runoff voting (IRV), this kind of "ordinal meaning" of ">" is more dubious and arguably (which I think Stevens would have contended) absent.

1e. But in order for a vote to have "meaning" in the eyes of a voter it also would seem desirable for the "participation property" to hold: you, by casting an honest vote, cannot worsen the election winner (with your view of "worsen") versus if you had not voted at all. Without this property, your vote could be "less meaningful" than nothing! And MJ fails the participation property while average-based range voting obeys it! (IRV also fails it.)

1-SUMMARY: At this point it is clear that standard range voting scores have strictly more meaning than MJ scores both according to Balinski & Laraki's very own preferred bludgeon – measurement theory – as well as the participation criterion. In short, their whole attack was exactly wrong, is refuted, shows the opposite of what they thought, and lies completely in ruins.

But wait, there is more.

Contradiction 2 versus 3: Balinski & Laraki began their attack by immediately contradicting themselves. They assert the only true meaning of numerical voting scores lies in their strategic use, i.e. as defined by the rules the voting system uses to elect winners. (I devoutly agree, for general voting systems and whether or not the votes are numerical.) And they then correctly claim voter behavior empirically differs when {-1, 0, +1} and {0, 1 , 2} score sets are employed. This contradicts their original assertion. (The resolution of this paradox is, as we said, the presence of psychological baggage distorting "true meaning," but Balinski & Laraki simply left it there as a unresolved self-contradiction.)

Non-logical 4: This exact "argument" can be used against Balinski/Laraki MJ, not for it. So it has no logical impact at all. E.g. I could equally well have said (using their exact same sentence but with a few items swapped) "When adjectives are used, they may well not be used in the same way at all: when an 'Excellent'↔'A Rejeter' scale is used, some voters may view "Bien" to be a 90 while others may see it as merely 60."

Just wrong 5: Balinski & Laraki here implicitly assume (or act as though) the only possible Stevensian scales are the four originally proposed by Stevens. As we have shown there is at least one further such notion, the "range" type scale. (Also the question of whether "18 is unheard of in philosophy or literature" has absolutely nothing to do with it – Stevens in his paper nowhere mentions philosophical or literature use as a criterion – plus we anyhow are rather surprised to hear that 18 has never been used in literature.)

Absurd garbage 6: Both standard range voting and MJ evade Arrow's "impossibility theorem." But the crux of the matter is "Arrow's IIA condition" which says, essentially, that removing candidates should leave the voting systems's output-ordering of the remaining candidates unaffected. If, after we remove some candidates we allow voters to change their ballots to make them more strategic, then the winner could change. In that case, we still get "Arrow's paradox" with both MJ and average-based range voting. For example, in a 3-candidate race where a voter scores Sarkozy="A Rejeter", Bayrou="Insuffisant", and Royal="Excellent", after Royal drops out, that voter, if allowed to do so, might modify her vote to still score Sarkozy="A Rejeter" but Bayrou="Excellent." Balinski & Laraki seem to think that due to the magic-meaning property of their magic words, no voter would ever do that, whereas, if we instead were using average-based range voting with score set {0,1,2,3,4,5} with Sarkozy=0, Bayrou=1, Royal=5, then voters would change to Bayrou=5. That contention is absurd garbage.

Strange 7: It is quite odd that Balinski & Laraki attack approval voting but not MJ, because approval voting is the special case of MJ voting when there only are two allowed scores "approve" and "disapprove"! (It also coincides with standard range voting using score set {0,1}.) They complain that the election results could be sensitive to the baggage-meaning of the word "approve." Well, of course! And in MJ, the election results would similarly be sensitive to the baggage-meaning of any and all the 6 Balinski-approved magic words. (For example, if "assez bien", "passable", "insuffisant" and "a rejecter" all meant the same, then the winner in their Orsay MJ study would change.) So what? Apparently, Balinski & Laraki believe that some words, such as "assez bien," have magic meanings, while others, such as "approve" and "3 on an 0-5 scale," do not.

Oh.

I would say that unfortunately humans can and do attach strange time-varying culture-varying baggage meanings to words, albeit I would guess that "3" will probably remain comparatively unambiguous across all cultures at all future times. The question of which verbal or numerical scales work the best in the presence of such stresses (and how badly they are hurt) is an experimental question, which simply is not answerable by abstract mathematical "theories of measurement" such as Narens 2002, and not by Balinski and Laraki by Proclamation either.

For example, {-1, 0, +1} score voting was used for centuries to elect the Venetian doge. It seems to have worked well, and we are unaware of any complaints that their votes all were meaningless and undefined. For another example, the "meaningless" {0,1,2} score voting and approval voting systems elected the president France wanted – Bayrou – in 2007 (as did {0,1,2,3,4,5} score voting using Balinski & Laraki's own vote data), while the MJ system extrapolated from Balinski & Laraki's data to all France (here I am using their own extrapolation) instead elected Sarkozy. How can it be that two meaningless voting systems outperformed their MJ system in their own experiment?!?!

A tremendous amount of experimental evidence and literature exists about different verbal and numerical scales (e.g. Balinski & Laraki's "too late" claim in their 5 is also false), which Balinski & Laraki almost entirely ignored. We shall discuss it, but on a different web page.

Summary

Stevensian "measurement theory" is clearly incomplete, wasn't needed, and to the extent we do complete and apply it, shows precisely the opposite of Balinski & Laraki's contentions. Balinski and Laraki's entire attack on all voting methods that use numbers (e.g. average-based range voting) has been completely busted from start to finish.


Return to main page