October 06, 2003

Another whack at the dead horse

Devoted reader and darn good blogger John Rosenberg forwards me the text of an article from the Chronicle of Higher Education about that favorite dead horse of the anti-testing crowd, the ostensible racial bias of SAT items. I don't have a subscription, but I do have John's email, so I can post the "highlights" and comment on them:

Before a question appears on the SAT, it is carefully pretested by the Educational Testing Service, which administers the college-entrance exam. As a result, test designers know in advance how members of different groups fare on each and every item. On some questions, white test takers consistently pick the right answer, while black and Hispanic students trip up, even though the questions appear to be similar in content and level of difficulty. On other pretested items, however, minority test takers perform consistently better than white test takers.

But because of test-construction rules used by the ETS, which do not take race into account but which are aimed at maximizing the test's reliability, the only questions that end up on the SAT are "white-preference questions," says Jay Rosner, executive director of the Princeton Review Foundation, a nonprofit organization that helps underprivileged students prepare for standardized tests.

On the October 2000 SAT, he says, every one of the 138 questions on the test favored white students, and none of the pretested questions that had favored minority students were included. He says he has found the same bias in the tests in other years, based on extensive data purchased from the testing service.

Horse puckey. Only three paragraphs in, and both Mr. Rosner and the Chronicle are already demonstrating an appalling lack of understanding of how tests are constructed, the definition of bias, and how items are analyzed for differential functioning, or DIF.

Items are removed from the test if they show DIF in pretesting. An item that shows DIF is one on which ability-matched examinees of different subgroups have different probabilities of answering items correctly. The key is the matching, and this key point is what no testing critic or racial bias agitator wants to address.

If, for an item, high-scoring white students have a significantly different probability of answering the item right than similarly high-scoring black/Hispanic/Asian students during the pretesting, that item will not be assembled into the test. It does not matter which group the item favors; the item will be removed from the pool. This is how psychometricians define "bias," and by this definition, the SAT is not biased, because SAT items go through several layers of this review process

Anti-testing activists insist on moving the goalposts and fudging the definitions. Their definition of bias is, did blacks as a whole do worse on this item than whites as a whole? For a test of academic achievement in a country in which black students are more likely to come from poor homes, more likely to attend poor schools, and (thanks to subversive cultural elements), less likely to value academic achievement, by this definition, the items will be biased simply because blacks as a whole ARE doing worse on the items than whites. The test score will have differential impact on blacks and whites, but this is no more inherently unfair than the fact that a driver's license exam will have differential impact on those who know how to drive vs. those who don't.

That nonsense above about "every one of the 138 questions on the test favored white students" is, in fact, nonsense because Mr. Rosner is comparing all black students to all white students, and because white students tend to come from more academically-rigorous schools, a larger proportion of white students are answering each item correctly. He's simply looking at group mean differences.

The problem is that true bias has little to do with group mean differences, and everything to do with the measurement of extraneous constructs. Here's a retro example of such constructs. If a group of male and female students answer a math item which asks about how many yards of tulle and jacquard fabric are needed to make a set of curtains, we might expect to see male-female DIF. We might see that high-scoring males have less chance of answering the item correctly than high-scoring females.

We then remove the item from the test - not because we think it's "unfair" for women to have the advantage, but because the item is measuring something that it wasn't intended to. We don't care if boys or girls know what "jacquard" means; all we care is whether they can calculate square yardage. "Fairness" is not the issue; validity is. An item measuring a nuisance construct is an invalid item.

This is what testing critics refuse to address. On the one hand, they castigate the shortcomings of the public school system for minorities, as well they should. But then they turn around and castigate the black-white score gap on the SAT, which is one of the biggest pieces of evidence proving that minority students do get shafted in the public school system. The SAT test score gap supports a call for K-12 reform. It supports the NAACP claim that black students don't get challenged enough in school, or are too quickly shoved into special education programs. It supports the claim that black students get shortchanged. Why, then, attack the test?

There is no indication in this article that Mr. Rosner has shown that the lower black scores are invalid; he's only shown that they exist. He did not show that the items measure extraneous concepts that are ineffable for even high-scoring black students, which is the only way the items could be biased, despite the insistence of testing critics to use whatever definition of "bias" suits their fancy.

So what exactly is his evidence?

Mr. Rosner is quick to point out that the bias he charges is not intentional on the part of the College Board. "They're not racists," he says. "The test company uses a completely neutral, colorblind system for picking questions. However, that system predictably, consistently, and reliably yields questions that favor whites dramatically over othersubgroups."

How does that happen? Mr. Rosner says that the explanation comes from the way that designers of the SAT -- and of many other standardized tests -- define reliability. A reliable question that is intended to be difficult, for instance, is one on which those who score highest overall consistently do well, and on which those with low scores consistently do poorly, he says. Therefore, the highest achievers among test takers set the standard, and questions on which they do well look more statistically reliable than those on which they do poorly, even though some minority students may score higher on those questions.

"It's entirely internally cyclical and self-reinforcing," says Mr. Rosner.

This doesn't sound like any definition of reliability with which I am familiar, and I have no idea where Mr. Rosner is getting this information. To start with, difficulty and discrimination are two different components of an item, and while they can be related, they don't have to be. A question can be both reliable and easy, which means that the high-scorers aren't "setting the standard." The only way in which high-scorers set the standard is to have a test on which all the items are difficult, or all the easy items are completely unreliable, and neither of these apply to the SAT.

And isn't what Mr. Rosner is saying here is that the College Board should be using a non-color-blind method to assemble the test, and use items that are of lower reliability (by his definition)? Like the hapless Mr. Freedle, also mentioned in this article, Mr. Rosner doesn't want bias eradicated; he just wants more items that favor blacks and a less reliable test so that the score gap will go away without effort. That way, we can all ignore the miserable job our K-12 system is doing of educating minority youth.

Wayne Camera of the College Board has quite a few good replies to both researchers:

Wayne J. Camara, vice president of research and development at the College Board, calls Mr. Rosner's analysis "fairly superficial and unsophisticated."

"We spend millions of dollars each year ensuring the fairness and validity of the SAT," he says. "It's a very intricate test-assembly process that we go though," he adds, noting that each question goes through at least six reviews, including a "fairness review" using widely accepted statistical methods.

"It's not just ethnicity and race, or male and female," Mr. Camara says. "We look at persons with disabilities and people with different religious preferences as well."

Mr. Camara agrees that some types of questions show performance differences among different groups, but he describes those as "very small differences between students at that ability level." [Emphasis mine]

And he says those differences can be explained by different ability levels. [Emphasis mine, again] For instance, he says, "on geometry items, males do substantially better than females." But Mr. Camara says that such differences should not lead the testing service to stop testing geometry in the name of fairness. [in other words, group mean differences do not mean bias exists.]

"It's not practical, and it's not psychometrically defensible to privilege ethnic difference over the content," he says.

What about Mr. Rosner's offer to build a fairer test by picking different questions?

"He's using racial- and ethnic-group difference as the primary criterion for assembling a test," says Mr. Camara. "It's not difficult to do that, but the test you end up with is not measuring what you want to measure."

Exactly. Both Mr. Rosner and Mr. Freedle envision a world in which tests are assembled so that both the black and the white test-takers will do exactly the same overall, and if this means that the items are genuinely biased, or unreliable, or that the test measures a lot of noise along with reading comprehension and math skills, so be it. This is what such critics consider to be "fair," in fact, and it highlights just how far away the concept of "fair" has moved from the concepts of "valid" and "color-blind" for the anti-testing agitators in this era of racial over-sensitivity.

Posted by kswygert at October 6, 2003 03:26 PM
Sitemeter