January 27, 2004

NAEP scores and Simpson's Paradox

Jay Mathews is back in the WaPo with an article on what the media are missing when reporting test scores, and a good explanation of Simpson's paradox:

Mention Gerald W. Bracey's name in any assemblage of educational pundits and you will often hear an awkward silence...Bracey has often offended self-appointed experts like me by exposing us to the truth, and he is rarely invited to any of our parties.

His article [ the February issue of the American School Board Journal], "Simpson's Paradox and Other Statistical Mysteries," exposes a great gap in our coverage of test score results. With great regularity, mainstream newspapers like mine, as well as popular magazines and the big networks, report on the lack of improvement in our public schools. We use words like "stagnant" or "sluggish" or "static" or "flat" to describe the achievement levels as measured by the National Assessment of Educational Progress (NAEP...

But here comes Bracey to explain that we are being deceived by Simpson's Paradox. A statistician named Edward Hugh Simpson came up with this a half century ago. It works on all kinds of phenomena. Bracey defined it for me this way: "Simpson's Paradox occurs when the aggregate group score shows one pattern but subgroups show a different pattern."

When you break down the NAEP and SAT data into ethnic subgroups, for instance, you find that minorities have improved their averages markedly, which is exactly what our increased spending on schools had been designed to achieve. On the NAEP reading test, for instance, non-Hispanic white 17-year-olds had only a small improvement. They went from 291 points to 295 points, while the overall average went from 285 to 288 points. But African Americans in that same period jumped 26 points, from 238 to 264, and Hispanics increased 19 points, from 252 to 271.

The same thing happened with the SAT. Non-Hispanic whites showed a modest increase of 8 points, from 519 in 1981 to 527 in 2002, while African Americans were up 19 points, from 412 to 431, Puerto Rican Americans were up 18 points from 437 to 455 and Mexican Americans up 8 points from 438 to 446. Asian Americans increased 27 points, from 474 to 501.

To the math-challenged among us, this makes no sense. How could almost every ethnic group increase significantly while the overall average went up barely, or not at all?

As Bracey explains, we are overlooking two important factors: (1) minorities make up a much larger portion of the total testing population than they did before, and (2) although they have shown significant improvement, their averages are still relatively low. When you add more low scorers, even if they improve over time, you are not going to see much improvement in the overall average.

In other words, when you're lumping together all test-takers, and you see only a tiny rise in the overall average, that doesn't mean that, within every subgroup, there's only a tiny rise. If there are big differences in both sample sizes and peformance in subgroups, as there is on most standardized tests, then the disaggregated data will tell a much different story - sometimes, the opposite story - than the dataset as a whole (which is why, in statistical analysis, you always disaggregate your data).

Howard Wainer, who is a very well-known, well-respected, and prolific psychometrican, has published a great deal on several statistics paradoxes, including the Simpson's one. Here's an article from 1994 in which he examines NAEP results for black and white test takers:

[For the 1992 NAEP 8th grade math assessment] Nebraska's average score was 277 New Jersey's average score was 271. On the face of it, it appears that 8th grade students in Nebraska do better in mathematics than their counterparts in New Jersey. We note further however that when we examine [mean]performance by ethnic group we find:


Nebraska White= 281 Black= 236
New Jersey White= 283 Black= 242

How can this be? Even though Nebraska does better overall than New Jersey, New Jersey's students in both of the major ethnic groups outperform their Nebraska counterparts. This is an example of what statisticians have long called Simpson's Paradox (Wainer, 1986c, Yule, 1903). It is caused by the differences in the ethnic distributions in the two states.


Nebraska White = 87% Black= 5%
New Jersey White = 61% Black =16%

Each state's mean score is a product of the mean score within each ethnic group and its proportional representation in the population. Thus Nebraska's mean is composed of the White mean weighted by 87% and the much lower black mean weighted by only 5%. In New Jersey Whites represent a much smaller segment of the population and so are given a smaller weight in the calculation of the overall mean.

If we standardize all states to a common demographic mixture, say the demographics of the United States as a whole, we find that New Jersey's standardized mean is 274 and Nebraska's is 270. Which is the right number?...To answer this we have to know what is the question that the number will be answering.

If the question is of the sort, "I want to open a business in either New Jersey or Nebraska. Which state will provide me with a population of potential employees whose knowledge of mathematics is, on average, higher?" The unadjusted mean scores provide the proper answer.

If the question is, "I want to place my child in school in either New Jersey or Nebraska. In which state is my child likely to learn more mathematics?" The standardized scores give the right answer...If your child is White, he/she is likely to do better in New Jersey. If he/she is Black, he/she is likely to do better in New Jersey...Presenting the data in a disaggregated way allows these sorts of questions to be answered specifically, but if a single, overall number is needed to summarize the performance of a state's children, for questions like this, one must standardize.

Anyway, back to Jay's takeaway argument:

You can argue that the failure of the white students to improve significantly is a matter of concern, but it is also clear that we have been obscuring the good news about minority score improvements by focusing so much on lack of change in the aggregate scores.

Posted by kswygert at January 27, 2004 11:04 PM
Sitemeter