May 16, 2003

Racial bias on the SAT?John

Racial bias on the SAT?

John Rosenberg of the always-excellent Discriminations sent along a link to a review in today's Chronicle of Higher Education. I don't have a subscription, but John was kind enough to reprint the review in full in his email to me:

The SAT is not making the grade, says Roy O. Freedle, a former senior research psychologist at Educational Testing Service. He writes that the test is biased against minority students and needs to be reformed to more accurately represent their achievement and potential.

Mr. Freedle compared the performances of black students and white students on what are considered the easy questions and hard questions on the test. Among students who had received the same overall score, he says, the black students had consistently scored a little better on the hard questions and a little worse on the easy ones. Mr. Freedle hypothesizes that the easy questions, in both the verbal and math sections of the test, use a more common vocabulary, which is open to a wider variety of interpretations and associations based on one's cultural background. However, the hard questions, he says, use a rarer vocabulary that has fewer meanings and is more likely to be encountered only in an academic setting.

His proposed solution is a simple one: score only the hard questions. Mr. Freedle calls this method of scoring the test "the Revised-SAT, or R-SAT." He suggests sending colleges an R-SAT score along with the regular one, reasoning that it would result in more black students' being admitted to prestigious institutions.

Subscribers to Harvard Educational Review can read the article online, and others can obtain information about the journal at http://www.gse.harvard.edu/~hepg/her.html

Okay. That's an....interesting theory. Without having read the report, it appears Mr. Freedle is defining bias solely as the differential probability, for different subgroups of equal ability, of answering items correctly. That's not a bad definition of bias, but most psychometricians would look for more evidence.

Before stating definitively that the SAT items were biased one way or another, a researcher might want to examine the factor analytic structure of the test, to see if it differs for the different subgroups (meaning, roughly, that the items seem to be related to one another in different ways). Internal differences can also point to bias - does the rank ordering of item difficulties differ for different subgroups? And finally, if there's a difference in the slope of the regression lines - if college GPA, for example, is significantly less predictable from SAT scores for blacks than for whites - that's also evidence of test bias. The SAT is known to overpredict first-year GPA for black students (more so for males than females), but in that case the intercepts of the lines are different for different groups, not the slopes.

So it seems Mr. Freedle is talking here only about bias due to differential item functioning (DIF) - when members of different groups who have the same abilities have different probabilities of answering an item correctly. Even this one method is not universally accepted - there is plenty of controversy about which DIF statistic to use, or what matching score to use for the subgroups - but let's assume for the moment his calculations of DIF are correct.

However, there's still a problem here. Mr. Freedle is describing the SAT items as biased, in this case meaning exhibiting DIF. But he isn't suggesting that we try to rid the SAT of DIF. He is suggesting that we supplement a test that is partly biased against high-scoring blacks, and partly biased against high-scoring whites, with the addition of a test that is solely biased against high-scoring whites. I mean, you can try to gloss over this fact by talking about how it benefits black students, but unlike many other things in life, item bias is a zero-sum game. If high-scoring black students have a better chance of answering an item correctly than high-scoring white students, then the item is measuring something other than what it is intended to measure, and white students are going to be disadvantaged by a recounting of those items.

Mr. Freedle is suggesting that SAT items, in addition to measuring their primary dimensions of verbal and math skills, are measuring a second dimension - this "academic vocabulary." By suggesting that we emphasize it, he is suggesting that it is an important dimension, and not just "noise." But if the items - especially math items - are not intended to measure this vocabulary, then this dimension is noise, and there's no justification for enhancing the noise by over-emphasizing the biased items.

Mr. Freedle obviously has an ideological end in mind. He believes more black students should be admitted to top-tier universities. But to suggest that we support this endeavor by emphasizing biased items - again, he's the one who has defined the items as "biased" - is not psychometrically sound.

The more I thought about this, the more unsure I was of the soundness of his arguments, and whether there were data that would refute them. So, I decided to visit the College Board's online research library (yeah, on a Friday night, I know - I lead such an exciting life), and whaddaya know, there's already a rebuttal to Mr. Freedle's study posted on the site. It assumes the reader has read the report, but it also provides a great deal of explanation for just where Mr. Freedle went wrong:

Roy O. Freedle's recent article in Harvard Educational Review, entitled "Correcting the SAT's Ethnic and Social-Class Bias: A Method for Reestimating SAT Scores," is based on small differences between white students' responses and the responses of students from other ethnic groups to test items that were discussed by a number of researchers...Although any study that purports to reduce group differences must be looked at seriously, Freedle's study is so flawed that its conclusions are misleading.

There are myriad technical problems with the report, including misuse of regression and differential item functioning (DIF), and even a misunderstanding of how scores on the SAT are calculated. But one need not be a psychometrician to understand the fundamental problem with the study. The reduction in group differences is not the result of more sensitive or appropriate measurement, but rather, it is because the proposed measure relies mostly on students' guessing the answers to test questions.

To probe a little deeper, let us examine more closely Freedle's argument around DIF. Researchers have found that, on average, African-American, Hispanic, and Asian-American students tend to choose the correct response on easy test questions slightly less often than white students with an equal total test score. In contrast, they choose the correct response on difficult test questions slightly more often than white students with an equal total test score. Noting that this phenomenon occurs with SAT vocabulary questions but not with critical reading questions, Freedle suggests that the College Board should dispense with SAT critical reading questions, as well as the easier half of all vocabulary questions to improve the scores of ethnic minority test-takers.

Te suggestion that critical reading be dropped or de-emphasized on the SAT, given its importance for success in college, would not be educationally or psychometrically sound even if it were based on a credible analysis..Freedle himself notes that the critical reading items lack what he calls "the familiar pattern of bias."

To summarize so far - Mr. Freedle is suggesting dropping items that show no bias, according to his own results. The College Board alleges that he doesn't even correctly grasp the scoring method of the SAT, much less calculate DIF in the proper fashion. Doesn't look good for Mr. Freedle, does it?

Let us look briefly at the data for the so-called SAT-R Section that Freedle recommends. On the difficult items that are included in the SAT-R, African-American candidates receive an average score of 22 percent out of a perfect score of 100 percent. Since there are five answer options for each question, 22 percent is only slightly above what would be expected from random guessing, namely 20 percent. White candidates do somewhat better, achieving an average score of 31 percent. [I'm assuming this gap is smaller than for the SAT overall.] The results indicate that this test is too hard for either group and would be a frustrating experience for most students. There are simply too many questions that are geared to those with a much higher level of knowledge and skill than is required of college freshmen. Extending Freedle's argument, we could substantially reduce all group differences if the test were made significantly more difficult so that all examinees would have to guess the answers to nearly all of the questions. We could then predict that each subgroup would have an average of 20 percent of their answers correct, based on chance...

In brief, Freedle's suggestions boils down to capitalizing on chance performance. This kind of performance may represent either random guesses, or unconnected bits of knowledge that are not sufficiently organized to be of any use in college studies.

Very interesting. I hadn't even considered the guessing argument, but then, I wasn't aware of just how difficult the difficult items were. The College Board is claiming that the proposed revised SAT would not be a true measure of anyone's ability, because it would be so difficult a test that most test takers would be guessing the answers. If black students at high ability levels guess better than white students, that is most certainly not a valid measure of ability.

As the College Board puts it, "Freedle's suggestions boils down to capitalizing on chance performance." For those of you not in the field of psychometric research, the statement that one is "capitalizing on chance" is synonymous with saying, "You started with the end result in mind, and now you're trying to prove that the data show more than they actually do, and if you collect another set of data, you'll get a different answer, because your results aren't going to generalize." It's an important and fundamental criticism to make against a research study.

The rebuttal also emphatically denies that the mathematics questions measure any sort of secondary vocabulary dimension, which removes any justification whatsoever for creating a revised SAT for difficult math items. Overall, the rebuttal feels pretty definitive to me, but it won't surprise me if reporters pick up on Mr. Freedle's article without mentioning the rebuttal. The buzzwords of "racial bias" and "SAT" will be just too tempting for some to ignore, and chances are they won't look further to assess the validity of Mr. Freedle's claims.

Posted by kswygert at May 16, 2003 09:55 PM
Sitemeter