January 03, 2003

"Doubts" have been "raised" about

"Doubts" have been "raised" about high-stakes testing

Got a couple of emails about this article in the NYT, and Mark Kleiman's comments on it. Let me say right up front that I am not at all surprised to see articles that raise "doubts" about testing in the New York Times, because the NYT tends to print anything that is negative about testing, and very little that is positive. The article reports almost in passing that the study, which allegedly shows that high-stakes testing leads to poorer academic performance and increased dropout rates, is financed by teachers' unions, which is enough to make me very skeptical of the "unbiased" approach of the study right off the bat.

So why should we be doubtful about high-stakes tests?

...after adopting [high-stakes state] exams, twice as many states slipped against the national average on the SAT and the ACT as gained on it. The same held true for elementary-school math scores on the National Assessment of Educational Progress...Trends on Advanced Placement tests were also worse than the national average in 57 percent of those states, while movement in elementary-school reading scores was evenly split — better than the national average in half the states, worse in the other half. The only category in which most of the states gained ground was middle-school math, with 63 percent of them bettering the national trend.

The study's author, Ms Amrein, concludes that "Teachers are focusing so intently on the high-stakes tests that they are neglecting other things that are ultimately more important." Well, perhaps this is true, if we can conclude that correlation implies causation, and we can't. It's possible that in some states, the high-stakes exams in place are constructed to measure educational achievement that is not the same as what is measured on the SAT, ACT, and NAEP, and that teachers are neglecting those skills to focus on the state exams. That isn't the only possible explanation, of course, but even if it were, why are the SAT and ACT being used as benchmarks here, when state tests and college entrance exams don't necessarily measure the same things? I also find it amusing that the oft-demonized SAT is described here as a measure of what's "ultimately more important," because such a definition allows the study's authors to criticize state exams. I never knew there was such support for the SAT among state testing critics, did you?

And hey, how 'bout those math scores? Can we get some confetti over here, given that the math scores improved more often than the AP tests declined?

Perhaps most controversial, the study found that once states tie standardized tests to graduation, fewer students tend to get diplomas. After adopting such mandatory exit exams, twice as many states had a graduation rate that fell faster than the national average as those with a rate that fell slower. Not surprisingly, then, dropout rates worsened in 62 percent of the states

And this is no surprise; at least, it shouldn't be, to anyone who has taken a moment to think about the purpose of exit exams. These kind of exams cannot replace a sound school curriculum and they will not "fix" poor schools. This kind of graduation-rate change is not just a reason to criticize the exit exams, though. It's a wake-up call. Exit exams are notoriously easy to pass and students are often given multiple chances to pass them. Why can't students pass these tests?

The reason for [the increased dropout rate] is not solely that struggling students grow frustrated and ultimately quit, the study concluded. In an echo of the findings of other researchers, the authors asserted that administrators, held responsible for raising tests scores at a school or in an entire district, occasionally pressure failing students to drop out. In lawsuits, educators have testified that students were held back rather than promoted to a grade in which high-stakes tests were administered, and that others were expelled en masse shortly before testing days. But neither those witnesses nor this study has been able to quantify that circumstance nationally.

I'm not surprised to hear that there are isolated cases of fudging the numbers - in this case, the students - to make the results look better, but this is not happening nationwide. It's not happening on a large scale. Students are dropping out because the exit exams that were designed are fairly reasonable, but not perfect - and they're not subject to grade inflation, social promotions, and all those other elements that the teachers' unions who funded this study have a hand in. I'm not a fan of exit exams, but we can't drop them without taking a close look at what it means when, as in the case of Santa Ana, California, more than half the 11th-graders cannot pass, on multiple tries, a test designed at a 6th-to-10th-grade skill level.

The study has drawn its share of detractors, in no small part because one of its authors, David Berliner, has been a critic of school vouchers and other education proposals often championed by conservatives

And it was commissioned by a group "which has opposed using any one test to determine when students graduate, schools get more money and teachers are replaced". The commission came from a anti-testing group, the funding came from the teachers' union, and one of the authors opposes school vouchers, and presumably the accountability that comes with them. Thus, it's not unreasonable to be skeptical about the results from this study.

Soundness of the data aside, some of Mr. Berliner's critics question whether such tests are to blame for the poor showings. "You almost never have a pure cause-and-effect relationship," said Chester E. Finn, assistant secretary of education in the Reagan administration. "Yes, you're introducing high-stakes tests, but maybe you're also changing the way you license teachers, or extending the school day, or changing textbooks. There's always a lot of things going on concurrently, so you really cannot peg everything to the high-stakes tests."

That would be the "correlation does not imply causation" concept that I mentioned earlier. I'm not saying that I know an easy way to track school progress, because it's a moving target. But it's apparent to me that testing critics who rush to criticize the high-stakes decisions based on a small sample of test scores are here rushing to draw a lot of conclusions from, well, a small sample of test scores. Let me see if I have this straight - if SAT scores go down over a short period of time, that's cause for alarm, but if state testing scores go down over the same period of time, we should ignore that, and not hold the schools accountable for it?

A larger question raised by the study is what effect, if any, it will have on the public debate over high-stakes testing. While many educators will most likely hold it up as proof that such exams are flawed, largely because they appear to offer inadvertent encouragement to schools to constrain the curriculum and squeeze out underachievers, others see the issue as more open-ended. "Should we just make better tests," asked Anthony G. Rud Jr., associate professor of education at Purdue University, or "is there something fundamentally wrong with testing in this matter?"

Great question! Let me answer that - there is nothing fundamentally wrong with testing in this matter. Testing can be done in an awful fashion, just like anything else, but research such as this, if in fact this is valid research, can be useful as a springboard for making better tests. It is in no way a powerful indictment of tests themselves.

And on that note, let me seque to Mark Kleiman's generally-approving comments on the article and the study in question.

Mr. Kleiman believes that high-stakes testing fails because the presence of high stakes encourages cheating. By that reasoning, all grades should be eliminated, because certainly students cheat in class, and all taxes should be eliminated, because taxpayers certainly like to cheat on their 1040s. Indeed, any strict set of rules which have ever been broken, or any set of standards circumvented, can be tossed out the window, by this logic.

Mr. Kleiman believes that high-stakes tests fail because they cost too much and only measure a subset of what we want students to know. This completely ignores the research on computerized testing in schools, which is much cheaper once implemented and can enhance the diagnostic aspect of testing (Joanne Jacobs has a great Tech Central Station article on this topic here). And tests always measure only a subset of the construct domain of interest, but that isn't a shortcoming of tests, just a definition. This doesn't stop teachers from giving kids subject tests in the classroom, nor does it stop the government from requiring driver's licence exams. The fact that you can't measure everything about an educational ability or domain with one test doesn't matter. What matters is whether the constructs that the test is measuring are a useful subsample of the important constructs, and whether the test is a reliable and valid enough measure of the general educational domain to allow for generalizability from the scores to overall educational achievement.

Mr. Kleiman believes that measurement error for these tests is so large that it's hard to make judgments based on year-to-year fluctuations. I agree with him on this point, but it is not a given that a high-stakes test will have a large measurement error. What's more, the authors of this study have no problem with making a negative judgement about high-stakes testing based on year-to-year fluctuation on the SAT, ACT and NAEP. So even testing critics must agree that some tests do measure educational constructs validly and reliably, and there's no reason that some high-stakes decisions can't be based on good tests.

And here's Mr. Kleiman's last point:

The critiques of the study by the proponents of testing, including Chester Finn, are pretty pitiful; that suggests that, despite its funding by a coalition of teachers' unions, the study must be methodologically sound. (Finn is reduced to suggesting that the other educational "reforms" in the state-level packages that included high-stakes testing must be at fault.).

This one comment contains so many errors that I don't know where to begin.

1. Dr. Finn's comments are not "pitiful", unless you consider it pitiful that he has to point out to a published researcher that you can't assume causation from correlation.

2. As for the relative dearth of comments from the pro-testing crowd in this article, I'm amazed that the NYT put any such comments in at all. It's well-known in the psychometric field that few journalists have the names of many psychometricians on their Rolodexes, and fewer still bother to present both sides of the story. The unbalanced, anti-testing attitude in this article is by no means representative of the general psychometric consensus or research consensus on this topic, nor of the attitudes of parents and educators. But the NYT rarely sees fit to ask any of us for our opinions.

3. Dr. Finn is not "reduced" to suggesting that other variables might be confounding the results; it's as valid and strong (and, unfortunately, common) a criticism as you can make of such studies.

4. Finally, I don't know what to make of the scare quotes that Mr. Kleiman places around the word "reforms." Apparently he doesn't think it feasible that any of the budget cuts, teaching staff restructuring, new curricula, and other massive educational changes of the last few years could have any sort of effect on educational performance. He's talking through his hat here.

But then, I agree completely with his final conclusion: "But there is now no case whatever for continuing to combine high stakes with low measurement quality. Been there, done that, got the T-shirt. Stinks. Next case, please." Too bad the study authors seem to have studiously overlooked making any conclusions from their data that could help to improve high-stakes testing, rather than eliminate it.

Bottom line? Well, I haven't read the study, so I can't judge it any more carefully. Doesn't it seem, though, that the anti-testing critics are just plain mad about the fact that high-stakes testing scores have risen in some states, and they can't wait to find some reason why this just can't be a good thing?

(Hat tip to Franco B. and SCFoth, both of whom sent the article and Mr. Kleiman's comments to me.)

Posted by kswygert at January 3, 2003 06:21 PM
Sitemeter