Intelligence: The History of Psychometrics

According to an authoritative and highly cited (1839 times since 1996) decree from a task force assembled by the American Psychologist: ¹

Intelligence is a very general mental capability that, among other things, involves the ability to reason, plan, solve problems, think abstractly, comprehend complex ideas, learn quickly, and learn from experience. It is not merely book-learning, a narrow academic skill, or test-taking smarts. Rather, it reflects a broader and deeper capability for comprehending our surroundings "catching on," "making sense" of things, or "figuring out" what to do.

The early speculative metaphysical sense of intelligence (thinkers of classical antiquity had the notion of nous, an essential ‘good sense’ derived from the cosmos) came under the microscope, like many psychosocial phenomena, in the late 19^th century. A crude proto-positivism (Francis Galton) was followed by polished positivism (Charles Spearman and Karl Pearson), then the heyday of eugenics in the early 20^th century, followed by a reaction in the 1960s. Shortly thereafter, the AJP lost enthusiasm for psychometrics and other journals evolved to fill the niche. Today, the field is dominated by neurophysiology and soon by impending genome-wide association studies (GWAS). Specialists expect strong evidence one way or another to be generated by neuroscience and genomics, but until then one can safely assume that the true degree malleability or heritability of intelligence will remain controversial.

The apparent consensus among “mainstream” psychometricians is that IQ (now referred to as Spearman’s g factor) is predictive of academic performance, health, divorce probability, happiness, and is stable over long periods of time ("reliable"). ² Critics contend that measures of g are culturally influenced and not as predictive as the psychometricians claim. Neurophysiologists like Adrian Owen argue that the most popular tests (Raven’s Progressive Matrices and the Wechsler Adult Intelligence Scale) wrongly condense multiple, distinct mental aptitudes into one IQ score (his team found three distinct neural systems for short-term memory, reasoning, and verbal ability). ³ In this essay, however, the author will be less concerned with the fact of the matter and more interested in the time course of shifting sentiments within the field.

The Positive Foundation

Statistician, anthropologist, eugenicist, and cousin of Darwin, Francis Galton was the forefather of the scientific study of intellect. First a prodigy and later a polymath, Galton was particularly skilled with numbers – always counting and measuring. He even derived a series of equations for the perfect cup of tea, based on the temperature of the water and volume of tea. He also calculated whether all of the world’s gold would fit in his living room (it would). It was odd, then, that Hereditary Genius, published in 1869, was more like anthropological kinship algebra than a Comtean (later Durkheimian) statistical positivism.

Granted, there was some use of the Laplace-Gauss distribution (or what would come to be called the Normal distribution) in analyzing 200 scores on the Cambridge Mathematical Tripos and 72 civil service exams – finding a bell-shaped curve. But since twin studies did not exist back then, Galton hermeneutically delved into the biographical records of “eminent men” in Britain. He speculated a fair amount, including on widely held stereotypes like:

There is the fact that men who leave their mark on the world are very often those who, being gifted and full of nervous power, are at the same time haunted and driven by a dominant idea, and are therefore within a measurable distance of insanity. This weakness will probably betray itself occasionally in disadvantageous forms among their descendants. Some of these will be eccentric, others feeble-minded, others nervous, and some may be downright lunatics. ⁴

Most notably, Galton found that eminence declined as relations became more distant – the grandson of a great scholar was much less likely to achieve a similar rank than his father or uncle. Although Galton formally invented the notion of “regression to the mean,” the stochasticity of social tumult seems to have escaped him – as did the undoubtedly critical socioeconomic component of social ranking (i.e., 19^th century Britain was not exactly a pure, perfectly competitive Walrasian meritocracy). Galton did, in this work, invent the forerunner to statistical hypothesis-testing and paved the way for an apparently more scientific methodology.

Charles Spearman

The two most highly cited articles in the American Journal of Psychology over the last three years have been General Intelligence Objectively Determined and Measured and The Proof and Measurement of Association Between Two Things, both published in 1904 by one Charles Spearman, working under famed ‘frog leg-fascinated German’ Wilhelm Wundt in the Leipzig lab. The articles would have a huge impact on statistics and psychometrics. Spearman claimed he could isolate a measure of raw intelligence ("the general factor") using his new method of factor analysis. This kernel of wit would later be called Spearman's g or fluid intelligence in the Cattell-Horn-Carroll theory.

Spearman sought to determine the degree to which a "hidden underlying cause" or general factor accounts for the variance between two other quantities. Spearman's own is actually less opaque than even more recent definitions:

"Another-theoretically far more valuable [than Galton's r squared correlation]-property may conceivably attach to one among the possible systems of values expressing the correlation; this is, that a measure might be afforded of the hidden underlying cause of the variations. Suppose, for example, that A and B both derive their money from variable dividends and each gets 1/x of his total from some source common to both of them. […]

Evidently, A and B need not necessarily derive exactly the same proportion of their incomes from the common source; A might get his 0.20 while B got some totally different share; in which case, it will be found that the correlation is always the geometrical mean between the two shares. Let B be induced to put all his income into the common fund, then A need only put in 0.20 ² = 0.04, to maintain the same correlation as before; since the geometrical mean between 0.04 and 1/x is equal to 0.20.” (Proof and Measurement, 74)

In other words, the square of the variance represents the degree to which observable variables move together as a result of some hidden variable. This is now known as latent variable analysis – of which factor analysis is only the first iteration (along with principle component analysis). In the final section of Proof, Spearman demonstrates a “masterclass” on how to make enemies – by criticizing all of the prior models including Karl Pearson’s – who responded with a blistering rebuttal the same year. Pearson, as a statistician, was in another league entirely. ⁵ Regardless, the mathematical significance of Spearman’s work cannot be overstated – the conclusions with regard to intelligence are more tenuous.

Spearman speculated that g tapped into a "mental energy" which stemmed from a neurophysiological source. He experimentally determined how well various cognitive abilities correlated with one another. In the example of the positive manifold below, the subject that correlates most strongly with all of the others is Classics, followed by French, etc. and is considered the most “g loaded.”

(Adapted from General Intelligence, 275)

Spearman’s methods have been adapted in contemporary studies to specific cognitive tasks. Below are subtest intercorrelations in a sample of Scottish subjects who completed the Wechsler battery. The subtests are Vocabulary, Similarities, Information, Comprehension, Picture arrangement, Block design, Arithmetic, Picture completion, Digit span, Object assembly, and Digit symbol. The bottom row shows the g loadings of each subtest. Adapted from Chabris 2007. ⁶

Early Sentiments

According to Susanne Langer, in her Philosophy in a New Key, certain ideas overwhelm the public mind with such force that the view is proffered as the ultimate answer all remaining questions. Positivism was just such an ideology at the turn of the 20^th century. Psychologists were just beginning to lament the subtlety of their statistical results, with Spearman writing:

When we without bias consider the whole actual fruit so far gathered from this science -- which at the outset seemed to promise an almost unlimited harvest-- we can scarcely avoid a feeling of great disappointment. […]

It must reluctantly be confessed that most of Wundt's disciples have failed to carry forward the work in at all the positive spirit of their master. For while the simpler psychoses of the Laboratory have been investigated with great zeal and success, their identification with the more complex psychoses of Life has still continued to be almost exclusively ascertained by the older method of introspection.

This pouring of new wine into old bottles has not been to the benefit of either, but rather has created a yawning gulf between the Science and the Reality. The results of all good experimental work will live, but as yet most of them are like hieroglyphics awaiting their deciphering Rosetta stone. (General Intelligence, 203-4).

A major question in early psychometrics was this: to what extent does mental chronometry (reaction time) and accurate of sensory discrimination (such as distinguishing between two similar audio tones, or degrees of heat, or when two pin pricks are closer or farther on the skin, etc.) correlate with the performance of cognitive tasks associated with intellect? ⁷

The early investigator Guy M. Whipple at Cornell detailed the extent to which the test-taking environment affected performance on reaction times. This includes where the observer directs their attention and whether the subject has consumed stimulants. Whipple also found that reaction times for schoolchildren were not dependable predictors of cognitive performance, and that a reaction time closest to the median was most predictive of academic performance, stating “the most intelligent children, as indicated by class-standing, are most able to follow instructions, and therefore to approximate the norm, while the less attentive children are erratic and prone to yield either premature or delayed reactions. (493)” ⁸

Whipple stressed the importance of laboratory procedures, holding variables constant, and not conducting the tests in an uncontrolled environment (like a school). He also was the first to highlight the tautological instruction problem:

The outcome of the reaction-time test (and, indeed, of any psychophysical test) upon school children will, furthermore, depend not only upon the objective conditions of the test, upon the nature of the instructions given, etc., but also to an appreciable extent upon the ability of each child to understand and carry out these instructions. When, therefore, a test is affected in this way, any assumed correlation between the quantitative results and the general intelligence of the group of children tested is, in reality, but a correlation of general intelligence with itself.

This problem would later be addressed in tests that require little to no instruction, such as Raven’s Progressive Matrices (example below). ⁹^,¹⁰

In the early 20^th century, psychometrics continued building itself a positivist foundation with more rigorous tests and procedures. Six years after Galton’s death, however, in an example of retrospective pseudopositivism, Lewis Terman assigned Galton an IQ score based on the voluminous biography Life, Letters and Labors of Galton written by Galton’s protégé, Karl Pearson. Terman led a major project at Stanford involving a gifted cohort that would later be known as “Terman’s Termites,” who outperformed the average significantly, but much of this achievement may have been attributable to SES. Terman strongly believed in the predictive power of IQ, even at a young age (as determined by the Stanford Binet test).

Reviewing Galton’s childhood, Terman was astonished to find that his subject allegedly knew the alphabet by eighteen months, could read at two and a half, could write coherently at four, and by six was reading the Illiad and Odyssey. Just before his fifth birthday, Galton wrote the following note to his sister and devoted tutor:

MY DEAR ADELE,
I am 4 years old and I can read any English book. I can say all the Latin Substantives and Adjectives and active verbs besides 52 lines of Latin poetry. I can cast up any sum in addition and can multiply by 2, 3, 4, 5, 6, 7, 8, 9, 10. I can also say the pence table. I read French a little and I know the clock.
FRANCIS GALTON,
Febuary (sic) 15, I827. ¹¹

Galton did misspell February, but granted that’s an unphonetic word. Further, at age eight Galton was placed in class with fifteen-year-olds. When he was fifteen, he was admitted to medical school. A speculative passage from Terman shows his willingness to offer speculation as consensus:

It is well known that, in general, a high correlation obtains between favorable mental traits of all kinds; that, for example, children superior in intelligence also tend to be superior in moral qualities. Francis Galton was no exception to this rule, as indicated by the following letter written by his mother when the boy was only eight years old:

‘Francis from his earliest age showed highly honorable feelings. His temper, although hasty, brought no resentment, and his little irritations were soon calmed. His open-minded disposition, with great good nature and kindness to those boys younger than himself, made him beloved by all his school fellows. He was very affectionate and even sentimental in his manners.

His activity of body could only be equalled by the activity of his mind. He was a boy never known to be idle. His habit was always to be doing something. He showed no vanity at his superiority over other boys, but said it was a shame that their education should have been so neglected.’

It is ironic that at this age Galton attributed his own outperformance to education rather than an innate aptitude. Ultimately, Terman assigned Galton an IQ score of 200 – that is, his mental age was twice his chronological age. (A score of 100 is defined as the median, and one standard deviation is ± 15). Given the rising importance of ostensibly rigorous testing around this period, it is surprising that Terman would get away with such a speculative assessment.

Mid-Century Developments

Psychometrics had picked up steam in concert with eugenics, which was going strong in the United States prior to World War II (thanks in large part to the Rockefeller Foundation and the Cold Spring Harbor Laboratory), during which Hitler gave the field a bad reputation. But in the thirties, psychometrics enjoyed augmented credibility and funding.

As a result, larger sample sizes and college students came to be used rather than schoolchildren. In a seminal 1937 study at Vassar, the distinction between academic performance and certain distinct cognitive skills (mathematical vs. verbal reasoning) was investigated. The researchers found the skills correlated, but that everyone (even those good at math) had a harder time with verbal reasoning. This might have been due to the novelty and the risk that subjective ideas could distract from the logic of the syllogism-like word problems themselves.

One of the conclusions was: “Intellectual ability, as represented by high academic standing, and reasoning ability, as represented by the above test, are related,” ¹² however, high academic standing might be more indicative of work ethic than g. Indeed, the authors did not recognize this possibility explicitly but certainly had a hint:

Students of low academic standing took less time with the test than those of high academic standing, and also had the greater number of errors in reasoning. Within the group of high academic standing, however, time and errors were not related. The inference is, therefore, suggested that the better students gave the test more serious consideration.

The late forties and early fifties saw much larger samples as a result of data collected during the war. They were also carried out for longer periods, enabling the determination of re-test reliability. One study found reliability between lifetime sessions of Raven’s Progressive Matrices of 0.83 to 0.93, and a 0.86 correlation with the Stanford-Binet. The investigators also generated an early curve describing the rate at which cognitive performance declines with age. Postwar America was an authoritarian, hierarchical place and one goal of all this research was to enable institutional decision-makers to better sort people for work.

We are now clearly in a position to compare people of similar intellectual capacity one with another and can proceed to study individual differences determining their suitability for scholastic pursuits and skillful work. ¹³ (246)

Theories of innateness were too similar to Nazi eugenics for the American psychological community, and John Watson’s form of behaviorism was coming into common parlance. B.F. Skinner’s writing would later augment this trend. The idea of environmental factors determining behavior rather than genetic propensities may have been carried too far (particularly with the “empty organism” language that enabled the psychologist to ignore inner cognitive processes as a black box), but behaviorism would be the antidote to the macabre Freudian psychoanalysis dominating the collective consciousness at the time. ¹⁴

The nature versus nurture debate would take the shape of scientific racism and the Just-World Hypothesis of socioeconomic stratification. Despite the notions of mid-century psychologists as hardcore racists or nativists, many were willing to concede that test scores were a combination of nature and nurture – but expressed certainty that the scores themselves were meaningful tools of social policy.

Despite the difficulties associated with the term 'race' it is the best designation we have for groups who differ radically in appearance, culture, and philosophy. Granting that all human beings are similar in basic characteristics, there are still obvious differences that must be reckoned with in communities where different racial groups are represented in large numbers.

One approach to the problem of such differences is performance on so-called intelligence tests. It is immaterial in this connection whether these tests really measure innate mental ability or reflect environmental differentials or represent a mixture of both (which is probably the case), the manifest differences are real and must be allowed for in educational, vocational, and civil and social activities of a community. (90)

Finding slight deficiencies in ethnic minorities in a medium-sized (2139

highschoolers) study in Hawai’i, the author stated:

These differences, of course, apply only to the Hawaiian situation and have no reference to these races as a whole. The selective effects of immigration make it impossible to judge a race from the limited representations in any one place outside the native country. (95) ¹⁵

The empirical problem (setting aside for a moment the ethical or social concerns) of analyzing populations with a shared genetic profile is that these groups intermingled extensively throughout human history. The current, arbitrary division of races (Caucasian, African, Asian, Middle Eastern, etc.) is too broad to be meaningful. One is likely to find greater genetic variance between two Africans than between the prototypical Caucasian, Asian, African, etc., so referring to any genotype as “African,” juxtaposed to any other racial grouping, is unconstructive. It would be more accurate to speak of Bantu, Wolof, Xhosa, Yoruba, Zulu, etc. and lump everyone else into “Non-African,” if we were really interested in characterizing genetic diversity. ¹⁶^, ¹⁷

Still, many psychometricians believe there are meaningful advantages conferred to certain groups that are genetically determined. Today, according to researchers like Arthur Jensen and many signers of the “mainstream” psychometrician editorial letter in the Wall Street Journal, Ashkenazi Jews and East Asians are the uncontested high-achieving ethnic groups. Some data have found that Ashkenazi (but not Sephardi) Jews possess a median IQ that is one standard deviation above the population average. Suffice it to say that this issue gets more than its share of attention ¹⁸ and it may be desirable for these groups to be wrong about their innate talent if found to be the case – when groups see themselves as superior the tendency is for them to oppress other groups in a Spencierian hierachy that figures like Rudyard Kipling would take to its imperialist conclusion.

No such Ashkenazi twin study has been conducted, but the results aren’t likely to convince many entrenched believers on either side that the advantage is innate – “maybe it was the gefiltefish that their mother ate in the womb. They must have gotten more docosahexaenoic acid. ” In any case, Zionists will be dismayed to find that Ashkenazim likely descended not from the ancient people of Judea but from the Khazars, a Turkic people centered in the Caucasus who converted to Judaism in the 8^th century and migrated to Eastern Europe circa the 12^th century. ¹⁹

Contemporary Sentiments

The AJP hasn’t published much research on intelligence since the 1970s. At least three factors might explain this trend: First, the subject is impolitic and it’s unclear what social benefit could come of this knowledge, second, neuroscience has taken over, and third, two new dedicated journals have emerged (Intelligence, in 1977 and Personality and Individual Differences in 1980).

21^st century psychometricians are a bit more nuanced than the last generation. They stand by the position that intelligence can be accurately measured and that it is indeed predictive – but that we shouldn’t put intellect “on a pedestal”:

Although this chapter has made the case for a Law of General Intelligence, it will end with a caution against putting intelligence, IQ, or g on a pedestal above the many other dimensions along which individual human beings differ, such as creativity, personality, confidence, patience, ethicality, and the like.

Intelligence may be the single best predictor of many life outcomes, but those of us who study intelligence should be especially vigilant against the tendency to associate it with moral worth or to exalt it as the only important human trait. Rather than rename other mental abilities like social skill [or athletic ability] as “intelligences” and pit them against general intelligence ([Howard] Gardner, 1993; [Daniel] Goleman, 1995), we should study each for its own value in understanding the diversity of human behavior.

Philosophy becomes science when the questions become answerable thanks to the systematic application of new tools (methodological, like new standards of rigor in publication, and technical – like fMRI). Questions that were “left” to psychology have been poached by neuroscience, which promises to offer more definitive biophysical answers for human behavior. William James would be thrilled, as the new method of studying cognitive function is the epitome of cerebralism.

Neurophysiologists are interested in the heritability of intellect but also how environment changes the neural architecture. Molecular research has yielded evidence for the involvement of certain proteins like brain-derived neurotrophic factor (BDNF) and microcephalin, while brain imaging has correlated cognitive performance with discrete brain regions, rates of glucose metabolism, relative brain size and the degree of myelin insulation (enabling faster action potential propagation down the neuron). Jung and Haier developed the standard model, the parieto-frontal integration theory of intelligence (P-FIT). The involved regions are diagrammed below. The prefrontal cortex and hippocampus have long been associated with attention, memory and learning. Interestingly, the Neural-Efficiency Hypothesis seeks to explain data showing that those with high IQ test scores have significantly lower rates of glucose metabolism in these regions. ²⁰

The burgeoning field of genomics has been using huge samples (hundreds of thousands) to identify sequence differences indicative of intelligence. The effect sizes for each individual gene are quite small – intelligence, like height or weight, is multifactorial. ²¹ Many genes contribute, a phenomenon called quantitative trait loci. Certain genes stand out, like single nucleotide polymorphism variants of apolipoprotein E, which is also predictive of Alzheimer's and general longevity.

The largest, multi-billion dollar genomics institution in the world, the Beijing Genomics Institute, has begun a large genome-wide association study (GWAS) by collecting the DNA of those who have achieved a high degree of academic attainment, and/or have high standardized test scores. The researchers from a “small” study (N = 3511) claimed to discover a “lower bound” of 0.5 heritability of intellect on Raven’s. ²² Paul Thomson and his team at UCLA used a new form of diffusion tensor analysis known as high angular resolution diffusion imaging (HARDI) to correlate neural density and myelination in the prefrontal cortex using monozygotic and dizygotic twins. ²³

The effort to distil Spearman’s g from distinct neurophysiological processes may be an echo of the positivist mathematicization that characterized the discipline since William James. From the late 19^th to the mid 20^th century, psychologists, like economists and sociologists, sought to emulate physics -- desperately wishing to isolate natural laws.

Take for example Weber’s law, or the Difference Threshold. It is defined as the “minimum amount by which stimulus intensity must be changed in order to produce a noticeable variation in sensory experience.”