June 18, 2003

More test scoring errors?

A new study released by the National Board for Educational Testing and Public Policy claims that the number of human scoring errors reported by testing companies has risen to an indefensibly high number. The report, not surprisingly, cautions against the use of exam scores in high-stakes decisions, and bemoans the lack of an overseeing US agency to audit and regulate testing processes and products.

The authors point out that all test scores contain some form of error, if only random measurement error; that both random and human error should be considered when using test scores for high-stakes decisions; and that forcing companies to produce high-stakes tests without sufficient development time is asking for disaster. These points are correct. However, I don't feel the authors really convey any useful solutions.

For one, the question of what to use for high-stakes assessments, if not tests, is left unanswered. The usual suspects - grades, performance assessments, holistic judgements, and the like - are not only no less likely to be error-prone than test scores, but are more difficult to assess for error in the first place. Using them to compare schools within states, or states within the country, is not feasible. Schools may decide to use a combination of assessments for exit exam purposes, but the standardized tests were often chosen specifically because, due to their objective nature, they do reduce the random measurement error.

Does it improve matters if a school tries to balance out a standardized test, which is objectively scored but may have human error, with a performance assessment, which is guaranteed to contain more random measurement error than the standardized test, and perhaps more human scoring error as well? The authors warn against using a single score to make a decision - but what if the other possible scores for that decision are less reliable and more subjective? It might be appropriate to use these scores in conjunction with the objective test, but that's by no mean a given.

One of the testing errors listed in this study (page 36 (19-1999)) is an example of this. This error is essentially a training error for human scorers - an error that is all too common in performance assessments, and one that could have been avoided with a more standardized, objective judgment.

It bothers me as well that the authors don't distinguish between degrees of errors. For example, of the 78 errors listed in Appendices A and B, 17 of them are situations that involve only one or two miskeyed, typographically-incorrect, or otherwise-flawed items on a test. Errors, yes; indications that the testing company is going to hell in a handbasket, no. Lumping these smaller, well-nigh-inevitable flubs in with the more substantial equating and scoring problems dilutes the impact of the bigger problems.

The authors also fudge the numbers a bit by counting some errors more than once if they affected more than one state. For example, TerraNova's massive miscalculation of percentile scores is counted four times in the Appendix (p. 37). Is an error more than one error if it affects more than one set of tests? In one sense, perhaps, but in another sense, if it's all one root cause, then it shouldn't be counted as multiple errors.

And in two cases, the authors have labeled something that was not under the control of the testing company as an error. On page 44, error (3)1980, we see that in 1980 ETS informed 163 students that tests were lost. This could have been ETS's error; more likely, the person at the other end who was responsible for shipping the tests did not follow directions - or the tests were lost (or stolen) through the mail. I can personally vouch for the fact that this has happened at companies other than ETS.

And speaking of stolen tests, check out page 47, error (19)2001. In this situation, someone stole a test form, and ETS, after suspecting cheating, demanded a retest at that high school. That's NOT a testing error. People steal test forms all the time, and test companies are forced to declare the forms and items missing. They may choose to do a retest, and if the scores soar upwards, it's perfectly legitimate for them to suspect cheating. If they catch the thieves, they can slap them with theft and copyright violation charges.

Can the amazing proliferation of testing in the late 90's explain some of these errors? It certainly can, in more ways than one. For starters, more tests mean more errors. It's also possible that testing errors which would not have been discovered before are being discovered now, because of the increasingly-high-stakes nature of many of these new exams. With the big, established testing companies, more quality assurance checks are in place now, and so more errors are being caught. This doesn't necessarily mean more errors are occurring. I think the numbers from 1976 are impossibly low, and that there were some errors back then that went undiscovered.

It's probably true, though, that not only are there more tests, but some of them have been rushed into production and administration, at speeds that would not have previously been acceptable. New testing companies have also sprung up to meet the need, and not all of these companies follow good quality assurance guidelines.

Some of the errors reported in this study are indeed problematic, and could have been avoided with better quality assurance systems (and more time for test development). The states and school districts have their roles to play as well, in choosing testing companies carefully and following standard procedures. As for the suggestion of a US agency to oversee testing, well, I'm skeptical about the potential of a federal agency to streamline the situation. Is the testing industry really analogous to the airline industry, in which a much-needed reduction in fatalities was acheived through federal intervention? Or would futher meddling from Washington DC only exacerbate the problem?

Posted by kswygert at June 18, 2003 11:55 AM
Sitemeter