The Titan test was also developed and is scored by Dr. Hoeflin. It is also a 48-item take-at-home test modeled much after the Mega.
Certainly, we would like to have been able to provide item analysis and at least been able to review norming data for the Titan, but even without that analysis and data, some of us expressed comfort with continuing to use the Titan for admissions at the present time if it had not been for known compromises based on the following considerations.
There are the matched-pair data which provides the scores of 114 subjects on both the Mega and Titan test. See figure 15 below. The mean of the Titan raw scores in this set is 20.1 and the mean of the Mega raw scores is 22.3. The difference between means was highly significant (p>0.001) according to a t-test. So across the full range of scores, the Titan is, perhaps, two problems tougher. The correlation between tests was 0.82.
Examining the raw scores of the subjects with combined Mega and Titan raw scores of 48 (n=46) -- people near the Prometheus Society membership criteria interest range -- reveals that the means of the two tests for that group were Mega= 31.4, Titan =31.3. The difference between means being statistically insignificant, as one might expect.
Figure 15 shows a correlation between scores of individuals taking both the Mega and Titan. Using score pairing equipercentile equating methods for calibration, the fourteenth Titan test score was a 36 and the fourteenth Mega score was a 35. See figure 16. The 46th Titan score was a 24 and 46th Mega score was also a 24 -- a fairly close pairing.
A consensus opinion of those on the committee having done both tests, is that the Titan is 2 to 3 problems harder than the Mega. The statistical evidence, however, seems to indicate that the Mega may be a bit more difficult, but at the higher ranges we are trying to measure, they are almost identical. It is interesting that Ron Hoeflin also has characterized the Titan as more difficult at the lower range, and equivalent at the upper end.


The Titan appears to be less compromised at this point in time than the Mega -- our impression is that most people that examine both tests opt to use the Mega because the Titan appears more
difficult at first glance and, perhaps, "less fun". Answers to the Titan problems have on occasion appeared on the Internet over the last couple of years. A serious problem in this regard is that we cannot perform item response (IRT) or other analyses necessary to develop a sub-test. We do not even have enough data to effectively check its characteristics.
According to data supplied by the membership officer, very few people have been admitted to Prometheus by the Titan, so evidently people aren't "leaking in" due to this test being too easy or answer leakage being too severe as of yet.
We feel that it is most unfortunate to have to recommend suspension of this test from our qualification list at this time and hope that sufficient data will be provided in the near future so that the test can again be certified for use by the Society. Ron has assured us that he will provide the data so that we will be able to add an addendum to our recommendation if the data warrant the Titan's retention in some form. However, as of this time there is insufficient data to work around the known compromises to this test and we must stop the leak.
8.4.4 LAIT (scored before Dec. 31, 1993)
The norming data on the LAIT has not been made available to this committee by the test developer. However, since the LAIT is no longer being scored, having been retired some time ago when its answers were published, we are not concerned about continued Prometheus Society criteria erosion vulnerabilities due to this test. Many members have been accepted into the Society based on scores on this test in the past and members of record at two dates in the past have been assured entry to the Society so it seems reasonable to retain LAIT scores obtained prior to Dec. 31, 1993 as satisfying entry criteria.
There have been legal problems and some controversy with regard to the legitimacy of this test, but we do not believe that these are of much concern since the test is no longer being scored.

Cursory review of Kevin Langdon's 2nd norming of the LAIT together
with more recent data relating LAIT scores to Mega scores as shown in the
following figure 17 has persuaded us that it is reasonable to retain a
LAIT-IQ score of 164 as satisfying the 1-in-30,000 of the general population
criterion, though it would have been nice to have had more data.
g loading of the LAIT:
The following excerpts are from Grady Towers's "Letters to Kevin Langdon" (Noesis 131 -- Special Issue on Psychometric Issues, 11, September 1998). Grady discussed LAIT/Mega analyses in the "3rd" leter dated 4/28/98, factor analysis in his "4th" and "5th" letters dated 7/27/98 and 8/24/98. He wrote:
There are two kinds of factor analyses extant in psychometrics: Principal Components Analysis and Common Factor Analysis. Common factor analysis is the preferred method.
What I did was to factor analyze the correlations between the LAIT and 24 Verbal items on the Mega Test, with 12 Spatial items, and 12 Numerical items. I found two important factors: the first column represents g loadings, and the second is a verbal/non-verbal bifactor.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Kevin's reply is an article entitled "Reply to Grady Towers" (Noesis 131 -- Special Issue on Psychometric Issues, 16, September 1998).
A couple of caveats are in order. First of all, the SAT has changed fairly substantially over the years. The analyses that we have performed and the use to which the SAT has been put in norming other tests in this report involves exclusively what we call the "old" SAT. To distinguish this version, it is essential to note that: The "new" SAT has been deployed since April 1, 1995. The "old" SAT was administered prior to that date.
The maximum score of 1600 on the new SAT V+M appears to map to the score range of 1510 to 1600 on the old SAT. Given the shape of the score frequency distribution in general, we believe that most 1600's on the new SAT would fall below 1560 on the old SAT. For example, 453 out of 1,127,021 students who actually took the test in 1996-7 (probably representing some 3.5 million total 17 year olds) scored 1600 on this new SAT. This is about 1 out of 7,726 that would correspond to about a 158 IQ. We have yet to see sufficient statistically reliable data on the numbers of participants receiving these high scores from one year to the next on the new SAT, but until and unless these reveal something other than we anticipate from what we have seen, the new SAT is definitely not suitable for our purposes.
8.5.2 The SAT data correlations with IQ
The SAT does correlate highly with g. This is discussed by Arthur Jensen in The g Factor. Jensen says on pages 559-560 that, "Data obtained from 339 college students support the notion that much of the variance in SAT scores can be attributed to g (it is unclear from the text whether pre or post recentered SAT scores were used). College students are a somewhat restricted sample, so it would be expected that if the sample was the entire population, the correlations could be even higher. The g-loading of the SAT-M is shown as .698, and the g-loading of the SAT-V is .804. The g-loading of most IQ tests is around .80. Another source, Nicholas Lemann, estimates in an article, "The Great Sorting" (Atlantic Monthly, Sept. 1995) that the correlation between the verbal score and IQ is .60 to .80.
8.5.3 Cautionary notes and considerations
There are cautionary notes to be added, though: g-loading is both a function of the test involved and the population being measured. Jensen's data was obtained from a small sample of college students (it is reasonable to view this as a controlled condition due to the population being entirely represented by college students -- this could provide a control for other significant factors that affect SAT scores. The size of the population used in the ETS data has not been specified. According to Thomas J. Bouchard (a widely recognized researcher in the U.S. at the University of Minnesota studying IQ correlations between monozygotic twins), research in correlating IQ with SAT scores has been inconsistent. The Standford Binet and SAT have been found to correlate anywhere between .445 and .8. The WAIS and SAT correlations fall in about the same range according to Bouchard. While the SAT and other college admissions tests may be adequate measures of g for small homogeneous populations, e.g., group of native-English-speaking US students that have had an almost identical academic background that would include learning vocabulary lists and four years of high school math (the test uses no higher than 9th grade math), and who also have had similar lifestyles and academic motivations. These limitations clearly preclude the SAT from ever becoming the sole test from which to select members world wide.
While most cognitive abilities tests are influenced by education and cultural factors, SAT tests, because of their more specific academic focus, are probably less effective in measuring "g" for people who fall into categories that one finds in more diverse populations (e.g., unsuitable education, lack of motivation to learn required subjects -- verbal/mathematical, or those suffering from math phobia, attention deficit disorder (ADD), depression, dyslexia, adverse effects of exam pressure, young children, foreign examinees, etc.). However, these conditions probably also significantly reduce the possibility of interest in membership in Prometheus.
Finally, it is possible that scores can be increased without a corresponding increase in g through long-term study undertaken with the specific goal of raising test scores (as of yet there is insufficient data on this). Individuals may be able to put in extra study and practice relative to the normal comparable population and considerably improve his/her mathematical and verbal aptitudes. In this regard, long-term coaching should be distinguished from short-term coaching; research on the latter by the College Board indicates that short term coaching produces scores that are within the standard error of the test. See http://www.collegeboard.org/press/html9899/html/981123a.html. It is also worth noting that some minimal study and coaching are fairly typical of SAT participation so that such may be the norm which is already taken into account in the general population distribution.
Discussion by Messick and Jungblut in "Time and method in coaching for the SAT" (Psychological Bulletin, Vol. 89, 1981) provide an argument against the efficacy of coaching to obtain uncharacteristic high scores. Discussion of the issue on pages 400-402 in The Bell Curve cites this paper; there is an excellent graph on p. 401 showing score increments for the SAT-V and SAT-M plotted in separate curves vs. hours of study.
Some facts from the text and the graph:
| hours of study | Verbal | Math | Total |
| 30 | +16 | +25 | +41 |
| 100 | +24 | +39 | +63 |
300 hours of study might be expected to reap a 70 point increment on the combined score, 600 hours 85 points.
The cited article is a review of all studies done to that date on this issue. These documented improvements involve the average increments at all levels and are therefore weighted for differences occurring at the average level; increments at the high end of the scale must certainly be less. One would do well to remember that coaching for the SAT is a profitable mini-industry in the U.S. Extravagant claims are to be expected on a routine basis from this industry (as for any other).
Rebuttals to this study are available like The Princeton Review (The studies are intra-institutional like studies by ETS - information about these studies can be obtained by contacting The Princeton Review directly or found in books published by Princeton Review) which claims to provide unbiased studies that prove significant improvement is possible (well over 200 points). (Other material that explores this issue are available by Samuel J. Messick in "Effectiveness of Coaching for the SAT" and "Individuality in Learning". Similar criticisms to those of extravagant gains have been made about the claims put forward by Hernstein and Murray. See for example, Measured Lies: The Bell Curve Examined; Cracks in the Bell Curve; Intelligence, Genes, and Success: Scientists Respond to the Bell Curve 'Statistics for Social Science and Public Policy'; Inequality by Design : Cracking the Bell Curve Myth; The Bell Curve Debate; History, Documents, Opinions; The Bell Curve Wars.) Also, ETS have sometimes been accused of biased statistical approaches that may significantly influence conclusions obtained. See for example, Stephen Levy's "ETS and the Coaching Cover-up," in the March 1979 issue of New Jersey Monthly.
While all members of the Membership Committee acknowledge that there are valid criticisms of the SAT, we are in general agreement that these criticisms are insufficient to preclude its use for our purposes.
8.5.4 Intelligence filter operative in selection of SAT participants
It is well known that the SAT is administered selectively to high school age students in the US. On page 35 of The Bell Curve it is stated that, "By 1960, a student who was really smart -- at or near the 100th percentile in IQ -- had a chance of going to college of nearly 100%." There is a graph on the same page showing three curves for percentile IQ vs. percent of college attendance. The curves are for the 1920s, early 1960s and early 1980s. From the graph, it appears that in the 1980s and in the 1960s, a student at the 96 percentile IQ had about a 92% chance of attending college (and, by implication of taking the SAT).
From the notes in The Bell Curve on page 692, note 7: "...from top quartile [of PSAT scores], 79% went to college; of those in the top 5%, more than 95% went to college." The data in the first example used IQ scores, not SAT scores.
There is another graph on p. 37 showing two curves, one for students entering college, one for completing the B.A. as a percentage vs. percentile IQ. Quote from p. 36: "...Meanwhile about 70% of the top decile of ability were completing a B.A."
For the graph on p. 35 of The Bell Curve, the curve for the 1980s is drawn from data from the National Longitudinal Survey of Youth. This study, the backbone of much data in The Bell Curve, used IQ not SAT for its cognitive ability estimate.
As the curves in these graphs show no signs of "bending over" at the higher IQ ranges, this ought to allay fears about appreciable numbers of people at the top not taking the test. See for example, figures 19 & 20 below.
We have examined the effects of selective intelligence filtering to assess the extent to which participants differ from the general population. Only about one in three seventeen to eighteen year-olds in the US take this test although virtually all "college bound" students do take it. Filter assessment has been assisted by the availability of the National High School (NHS) survey that assessed the distribution of all students independent of whether they would have taken the SAT otherwise.
Figure 18 shows the frequency distribution of college bound students for a given year.
The distribution of scores are again quite obviously not distributed according to the normal distribution although the skewing is less than for the Mega. There are again many more nominally high scores than a normal distribution would predict. In figure 19, which is described in more detail in the selective filter methodology description of section X, the effective filter is shown on an enlarged scale as the roughly diagonal curve indicating progressively intense selection based on intelligence. The deviation at the bottom is obviously because students with excessively low IQs do not even attend high school and therefore were not even included in random samples. See Kjeld Hvatum's table presented in section 8.3.3 where the range of retadation is shown to extend well into the score levels on the SAT which are effectively missing.
The degree to which this composite filter fits the SAT data is shown particularly well in the plot on a log scale shown in figure 20. The similarity in form of this filter and that which is evident in the Mega data suggests that many of the same type of pressures must exist and again, that individuals are capable of very accurate assessments of their own cognitive abilities.


It is interesting that Kjeld Hvatum in his "Letter to Ron Hoeflin" (In-Genius, Vol. 15 ,August 1990) says,

This is very essentially what we have found, but one cannot just assume that the top 1/3 of the overall US high school population takes the SAT as shown in the figure above -- it is more complicated and the filtering more effective than that.
8.5.5 The ability of the SAT to discriminate at the high end of its scale
The graphs in figures 21 and 22 below show that the SAT has the ability to discriminate throughout its complete range of raw scores. Figure 21 shows a slight non-linearity between raw vs. scaled scores starting near a total score of 1540. On other administrations of the test (see figure 20) the questions are evidently more difficult and the raw vs. scaled graph is linear all the way to the top, suggesting that the test is indeed discriminating through its complete range.


The difference between 1600 and 1560 is typically 2 to 4 problems on the "old" (pre-recentered) SAT. However, when figuring percentile equivalents for the SAT, it should be remembered that it is based upon a sample size of approximately 1 million actual test takers selectively sampled from a general population size in excess of 3 million. It isn't unreasonable to assume that the general population percentiles that we assign to the SAT at the top end (for which selection is the highest) are accurate for the test group as a whole. In fact, however, in a population of 3 million there should be over 100 individuals scoring at the 1-in-30,000 level. On any given year less than ten individuals obtained a perfect score on the old SAT with on the order of 100 or less scoring 1560 or more and, therefore, it is is safe to say that the 1-in-30,000 level is achieved by these individuals.
8.5.6 Establishing a credible 1-in-30,000 of the general population raw score cutoff
As indicated throughout this report, we have chosen not to accept theoretical positions on what the distributions of test scores will be at the high end of the psychometric range nor even if it is intelligence that is being discriminated at the extreme tails of distributions, preferring actual data to accepted notions and legitimate claims of rarity to unverified claims of "super intelligence." In keeping with this philosophy, we note that of three million people in the general population for which a single SAT applies, 100 would satisfy the rarity condition. Therefore, for a given year, looking down the top 100 scores, we find for example for 1984 combined V+M for College-Bound Seniors:
| Score | Number |
| 1600 | 5 |
| 1590 | 0 |
| 1580 | 27 |
| 1570 | 19 |
| 1560 | 39 |
| 1550 | 75 |
| 1540 | 96 |
| 1530 | 108 |
| 1520 | 188 |
| 1510 | 217 |
| 1500 | 278 |
| Score Range | Number |
| 1591-1600 | 35 |
| 1581-1590 | 8 |
| 1571-1580 | 149 |
| 1561-1570 | 71 |
8.6.1 Mensa testing approaches
Because of much greater membership, Mensa can afford quite extensive testing programs. Facilities and psychometric instruments are available throughout the world. In much the way that this committee is attempting to assist the Prometheus Society in establishing tests that it can warrant with credibility, Mensa accepts scores on various tests -- which change from time to time.
It is understood in this regard that Mensa's discrimination problems are much less demanding than ours because of their considerably lower qualifying standard. They do provide a paradigm, however, and if it were possible to tap into their resources and global support, it would have considerable merit. Greg Scott addressed this possibility in his article, "For Acceptance of Mensa Supervised Tests" (Gift of Fire, Issue 99, September 1998). We have, therefore, considered tests whereby individuals may be qualified for entry to Mensa. We have also considered counter arguments as put forth by Kevin Langdon in his article "Mensa Tests and Other Standard Tests" (Gift of Fire, Issue 81, January 1997) that was in response to Greg Scott's article as well as other issues that we have encountered.
You will see these various lines of reasoning pursued in the following sections.
8.6.2 Cattell Culture Fair III
Cattell Culture Fair III (A+B) has a history of use since the early 1920s, but the present edition is dated 1960 and was revised in 1963. Mensa used this test prior to its latest adoption of the Raven Advanced (both tests are still used by Mensa in the UK although now dropped in the US).
The features of this test are as follows:
88 for IQ 165
89 for IQ 167
90 for IQ 168
91 for IQ 169
93 for IQ 173
95 for IQ 176
97 for IQ 179
99 for IQ 183
100 for IQ 187 (extrapolated)
The following are features of the test:
8.6.3 Raven's Advanced Progressive Matrixes (RAPM)
Raven's Advanced Progressive Matrixes is one of a series of nonverbal tests of intelligence developed by J.C. Raven (1962). Following Spearman's theory of intelligence, it was designed to measure the ability to educe relations and correlates among abstract pictorial forms and it is widely regarded as one of the best available measures of Spearman's g, or of general intelligence (e.g., Jensen, 1980; Anastasi, 1982). As its name suggests, and of particular significance to the Prometheus Society, it was developed primarily for use with persons of advanced or above average intellectual ability.
Like the other Raven's matrices tests, the APM is composed of a series of perceptual analytic reasoning problems, each in the form of a matrix. The problems involve both horizontal and vertical transformations: Figures may increase or decrease in size, and elements may be added or subtracted, flipped, rotated, or show other progressive changes in the pattern. In each case, the lower right corner of the matrix is missing and the subject's task is to determine which of eight possible alternatives fits into the missing space such that row and column rules are satisfied. The APM battery consists of two separate groups of problems. Set I consists of 12 problems that cover the full range of difficulty sampled from the Standard Progressive Matrices test. Standard timing for Set I is 5 minutes. This set is generally used only as a practice test for those who will be completing Set II. Set II consists of 36 problems with a greater average difficulty than those in Set I. Set II can be administered in one of two ways: either with or without a time limit of 40 minutes. Administering Set II without a time limit is said specifically to assess a person's capacity for clear thinking, whereas imposing a time limit is said to produce an assessment of intellectual efficiency (Raven, Court, & Raven, 1988).
Phillip A. Vernon, in his review of the APM (Test Critiques, 1984) writes that "the quality of the APM as a test is offset by the totally inadequate manual which accompanies it. For interpretive purposes, the manual provides 'estimated norms' for the 1962 APM which allow raw scores to be converted into percentiles (but only 50, 75, 90, and 95) and another table for converting percentiles into IQ scores." John Johansen, a graduate student at the University of Minnesota and former regular poster to the Brain Board, came into possession of the 1962 version of the test for use in his research (this form is no longer used for testing) along with 27 pages of written text about the implementation, scoring and standardization of the test. In a post to the Brain Board at (http://www.brain.com/bboard/read/iq-archive3/1599), he provided the following information applicable to the untimed 1962 version of the test:
Untimed intraday (go until you give up) 1962 distribution for 20
year olds, 30 year olds and 40 year olds. Scores balanced for guessing.
|
|
|
||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Ignoring the above caveat about inaccurate norms above the 99.9th percentile, the above data indicates that there is about a 4 point raw score difference between 2 and 3 sigma on this test. If this difference carries on to the next "sigma," this would give associated scores of:
|
|
|
||
|
|
|
|
|
|
|
|
|
|
Bors and Stokes administered the timed version of the APM to 506 students (326 women, 180 men) from the Introduction to Psychology course at the University of Toronto at Scarborough. Subjects ranged in age from 17 to 30 years, with a mean of 19.96 (standard deviation=1.83). Enrollment in the Introduction to Psychology course was considered roughly representative of first-year students at this university. The scores on Set II for the 506 students ranged from 6 to 35 with a mean of 22.17 (standard deviation=5.60). This performance is somewhat higher than that of the Raven's 1962 normative group but considerably lower than Paul's 1985 University of California, Berkeley sample.
Additional data supporting the conclusion that the RAPM (either timed or untimed) does not discriminate at the 1/30,000 level is taken from Spreen & Strauss (Compendium of Neuropsychological Tests, 2nd Edition, 1998), and shown in the tables below.
A middle-of-the-road approach would be to use the recent University of Toronto at Scarborough data and to assume that the mean of the test group corresponds to about 1 SD above the mean of the general population, and to further assume that the SD of the general population would be about the same as the standard deviation of the test group. Finally assuming a normal distribution in the test group, the 1-in-30,000 level would correspond to 22.17 + 3 * (5.60) = 39, which is 3 raw points above the test's ceiling of 36.
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| untmd. | 40 min | 40 min | 40 min | 40 min | 40 min | 40 min | 40 min | 40 min | 40 min | |
| (n=71) | (n=195 | (n=104) | (n=104) | (n=157) | (n=49) | (n=52) | (n=104) | (n=61) | (n=34) | |
| 95 | 33 | 29 | 34 | 30 | 34 | 28 | 32 | 34 | 30 | 33 |
| 90 | 31 | 27 | 32 | 28 | 32 | 26 | 31 | 32 | 28 | 31 |
| 75 | 27 | 23 | 29 | 25 | 30 | 22 | 28 | 30 | 25 | 28 |
| 50 | 22 | 18 | 25 | 22 | 27 | 19 | 25 | 27 | 21 | 24 |
| 25 | 17 | 13 | 21 | 19 | 25 | 15 | 23 | 25 | 17 | 21 |
| 10 | 12 | 10 | 18 | 16 | 22 | 12 | 20 | 22 | 13 | 18 |
| 5 | 9 | 8 | 16 | 14 | 21 | 10 | 19 | 21 | 11 | 16 |
The data above does bring up the issue of age variation of IQ data which is not typically addressed by other instruments that we've used for Prometheus Society entry requirements and that is perhaps something that should be considered. (In the case of the SAT and GRE tests, there is not typically much variation in the ages of those taking the test and no such data was used in norming any of the take-at-home tests we've used. Spreen and Strausse have provided the information for the table below:
|
|
|
|||||||||||
|
|
|
|
|
|
|
|
|
|
|
|
||
| (n=28) | (n=53) | (n=72) | (n=77) | (n=121) | (n=69) | (n=33) | (n=36) | (n=27) | (n=33) | (n=54) | ||
| 95 | 32 | 32 | 32 | 32 | 32 | 32 | 31 | 30 | 29 | 27 | 25 | |
| 90 | 30 | 30 | 30 | 30 | 30 | 30 | 29 | 28 | 27 | 25 | 23 | |
| 75 | 27 | 27 | 27 | 26 | 26 | 26 | 26 | 25 | 24 | 22 | 19 | |
| 50 | 20 | 20 | 20 | 19 | 19 | 19 | 19 | 18 | 16 | 14 | 12 | |
| 25 | 15 | 15 | 15 | 15 | 15 | 14 | 14 | 13 | 12 | 10 | 8 | |
| 10 | 10 | 10 | 10 | 10 | 10 | 10 | 9 | 8 | 7 | 6 | 4 | |
| 5 | 7 | 7 | 7 | 7 | 7 | 7 | 6 | 5 | 4 | 3 | 2 | |
Tests completed at leisure. Source: J. Raven (1994)
Curiously, American Mensa does not list the RAPM among its currently accepted tests, although UK Mensa does. Perhaps this is a more "international" test than others we have reviewed and considering its quality, we should probably continue to consider its possible use, especially as an "auxiliary" test to be submitted in conjunction with other tests that are deemed capable of discriminating at the 1-in-30,000 level.
8.6.4 California Test of Mental Maturity (CTMM)
The reliability coefficients are said by Bert Goldman, Dean of Academic Advising at the University of North Carolina, in reviewing the "California Short-Form Test of Mental Maturity, 1963 Revision" in The Seventh Mental Measurements Yearbook, to indicate adequate reliability. He says further that:
No rationale is given for using eight school levels with the Short Form and only six school levels with the Long Form. Further, five factors are included in the Long Form and only four in the Short Form. No reason is given for eliminating the Spatial Relationships factor from the Short Form. However, earlier in this review it was pointed out that among the five factors this one provided the poorest reliability coefficients.
In sum, as far as group tests of intelligence are concerned, the CTMM appears to rate among the best. Its format is clear and easy to follow, its material appears durable, the norms appear representative, and its reliability while being weaker at the lower levels generally seems satisfactory. Data on validity are lacking, but if its shorter version is comparable, then considerable evidence suggests that the Long Form is valid. This leads to a question that has long stood in this reviewer's mind. Why both tests? Why not just the CTMM-SF? The Short Form takes less time to administer than the Long Form, research is available concerning its validity, and in terms of reliability it does not contain the Long Form's weakest factor (Spatial Relationships)."
There are several interesting pieces of data that would seem to suggest the CTMM may be an appropriate test for inclusion on our list. For example, the following score pair data is available on Darryl Miyaguchi's web site for the "OMNI Sample":
LAIT vs. CTMM: 5 cases -- CTMM substantially lower score in every case. Average difference = 12.8 IQ points.
Cattell vs. CTMM: 24 cases -- CTMM substantially lower score in every case. Average difference = 12.6 IQ points.
In neither of the situations described above did the difference seem to be IQ (Mega raw score) dependent! In fact in the data included for that norming, roughly the same number of individuals reported LAIT, Cattell and CTMM scores as follows:
CTMM high scores: 179, 162, 154, 154, 150...total of 30 scores
Cattell high scores: 191, 178, 172, 169, 164...total of 35 scores
LAIT high scores: 171, 170, 169, 167, 166...total of 35 scores
It is noted that in "Mensa Tests and Other Standard Tests" (Gift of Fire, Issue 81, January 1997), Langdon has suggested that the CTMM is inappropriate for admission to our Society because it has "a ceiling of 3.5 sigma," which is in accord with Grove's mention of a ceiling of 158. In no case was a 4-sigma LAIT or Mega score confirmed in the OMNI Sample by a CTMM score. The CTMM scores tend in general to be much lower than the other two as can be seen in figure 4 above. This impression is further confirmed by inspection of figure 6 above where, if CTMM scores were used for norming the Mega, standard scores on the Mega would have to be dropped (as against raised!) by as much as ten points since the CTMM score of 155 corresponds to the Mega cutoff score of 36! Clearly, if anything, the CTMM seems to underestimate IQ at these high scores. However, we have to reject the CTMM because its ceiling of 158 is too low for our entry criterion.