March 23, 2006

A little oops with a big impact, part four

The deluge continues:

An additional 393 students who took the SAT in October earned higher scores than were initially reported to them, bringing the total number of test takers who received incorrectly low scores to 4,411, the College Board announced on Wednesday.

In a written statement, the College Board said it learned last weekend that its test-scanning contractor, Pearson Educational Measurement, had not "fully evaluated" 27,000 of the total 495,000 exams from October. Two weeks ago, the board said it already had rescanned nearly all of the tests and found approximately 4,000 students whose actual scores were higher than those first reported. After reviewing the 27,000 tests that had not yet been "fully evaluated" this week, Pearson found another 375 students who had received falsely low scores.

Last week, the College Board announced that it had failed to include 1,600 exams in its initial review. After those answer sheets were rescanned, the board found an additional 18 students with falsely low scores, according to its statement on Wednesday. The board did not disclose the number of students who had received falsely high scores.

Problem is, admissions officers may be just as interested in falsely-high as falsely-low scores. And again, why was any public statement at all made before the entire pool of suspect exams had been fully evaluated?

Posted by kswygert at 11:28 AM | Comments (3) | TrackBack

March 20, 2006

A testing blunder with federal dollars at stake

Sigh...

A scoring error by a standardized testing company changed accountability reports for 14 Alabama schools, putting four on probation when they actually met their goals for reading and math - while another 10 got passing marks they didn't earn, education officials said Monday.

The mistake by Harcourt Assessment Inc. affected nearly 2,500 students in grades 3-8 in 589 schools. An independent firm rescored the tests and found that 14 schools were put in the wrong category.

Four had been placed on a statewide "needs improvement" list when they had actually made Adequate Yearly Progress - a major part of proving accountability under the federal No Child Left Behind act.

Posted by kswygert at 06:55 PM | Comments (6) | TrackBack

March 15, 2006

A little oops with a big impact, part three

That little oops continues to make negative press:

The College Board disclosed a new problem yesterday in its efforts to assess and correct mistakes in the scoring of its October SAT test: an overlooked batch of 1,600 exams that have not been checked for errors.

The admission that there were still unchecked tests came a week after the board began notifying colleges that it was raising the SAT scores of 4,000 students whose tests had been graded incorrectly because of processing problems at a Texas scanning facility.

The revelation meant that colleges were likely to face a second scramble to reassess additional applicants just as the admissions season was drawing to a close.

This is just ugly. So ugly, in fact, that I'm wondering why this story was released last week before the full picture was known internally. It's very hard to understand why a testing organization would report an error before accounting for all possible test forms that could have been affected.

The error was first uncovered January 31st, when exams were hand-scored at the request of students. It's only Mar 15th. Six weeks is not a lot of time to investigate a scoring problem on an exam given to half a million students. Did the College Board feel pressure to rush to the press with the announcement of the mistake? It's good that they now appear to be disclosing everything they find, but this sort of trickling information makes the public wonder what it hasn't been told, and in fact, if the College Board even realizes the extent of the problem.

It also causes news organizations to start using words like "scandal," rather than error.

Posted by kswygert at 12:05 PM | Comments (12) | TrackBack

March 09, 2006

A little oops with a big impact, part two

Remember how I said yesterday that it's difficult enough to report a testing error, but it's really embarassing to report an error too early and then have to amend that report?

Uh-huh.

A day after the College Board notified colleges that it had misreported the scores of 4,000 students who took the SAT exam in October, an official of the testing organization disclosed that some of the errors were far larger than initially suggested. With college counselors and admissions officials scrambling to take a second look at student scores in the final weeks before they mail out acceptances and rejections, Chiara Coletti, the College Board's vice president for public affairs, said that 16 students out of the 495,000 who took the October exam had scores that should have been more than 200 points higher.

"There were no changes at all that were more than 400 points," Ms. Coletti said. She did not say how many students had errors that big...On Tuesday, Ms. Coletti characterized the largest errors as in the 80- to 100-point range.

As I said before, I don't see a problem with the time it took to investigate the scoring error. It's not wonderful that the error was uncovered by students requesting score checks, instead of via a QC process, but that happens in the real world. What could have been avoided, though, was the two-part error reporting here, where the public discovers on the second day that the initial error report understated (or otherwise mischaracterized) the severity of the problem. The cause might have been simple - incorrect information provided to the public affairs VP - but it can have a really ugly PR affect. Things like this convince the public that the testing company is not providing the full picture.

Posted by kswygert at 11:19 AM | Comments (2) | TrackBack

March 08, 2006

A little oops with a big impact

Bad press for the SAT as "technical errors" are reported:

About 4,000 students who took the SAT last October received test scores that were lower than they should have been — some by as much as 100 points — because of technical problems in the scoring process, the College Board said yesterday.

The College Board, which administers the SAT, said it had begun to notify college admissions offices, high school counselors and affected students this week in letters and in e-mail messages, and expected to complete the process by Thursday. It also said that it planned to return registration fees and charges for sending test scores to colleges to the students whose scores were in error.

It should be noted that this is less than 1% of the examinees who took the SAT reasoning test in October of 2005, but 4000 unhappy examinees is still a lot of examinees. Snotty comments from FairTest spokesmen aside, it's just plain silly to propose that testing companies admit any error before they've had a chance to (a) fix it and (b) figure out what caused it. Rushing to notify the public before doing a complete investigation could lead to further embarassments (like having to amend the first notification as complexities are uncovered). When half a million test forms are involved, a four-month period of error investigation doesn't seem overly long to me.

Of course, someone could make the argument that SAT scores should be held for four months following exam administration, so that any errors could be corrected in time, but my guess is colleges and examinees - and FairTest - won't go for that.

Posted by kswygert at 12:34 PM | Comments (10) | TrackBack

February 22, 2006

Bubble, bubble, toil and trouble

Saw this snippet on the Salt Lake Tribune's website:

Palm Beach County, Fla., created the controversial ''butterfly ballot'' in the 2000 presidential election that reportedly confused more than a thousand Gore-Lieberman voters such that they wound up marking their ballots for a minor-party candidate.

In February 2006, local education officials told the Palm Beach Post that too many of the county's high school students apparently knew answers on the statewide comprehensive test but were incorrectly marking the answer sheets. The multiple choice questions require only one circle to be darkened on the sheet, but other questions require darkening digits of an actual numerical answer, apparently bewildering students into darkening too many or too few circles.

So the FCAT answer sheet is the actual problem, hmm? My question is - how do the local education officials know when students would have gotten the questions right, when their answers were marked wrong? Did they ask them? Did they just flag forms with too many or too few digits marked? Or did they actually find lots of answer sheets with the correct answer bubbled in in the wrong place? If the student didn't have the right answer, then the answer sheet confusion isn't the real problem here.

The PBP has additional info:

The FCAT math wizards, it seems, haven't quite figured out that 7.5 and 7 1/2 are the same number. That might not seem like such a big deal. Not everybody can be a math genius. Lord knows I'm not. But this is a problem because Florida students who do know that 7.5 and 7 1/2 are the same number could lose FCAT points if they rely on that mathematical certainty.

Post reporter Nirvi Shah explained the glitch in a story on Monday. The state wants to grade the math tests by machine. Shouldn't be a problem, right? Machines have graded multiple-choice tests for decades. Students just "bubble in" the correct circle on an answer form with a No. 2 pencil. But the state's standardized test gurus had a problem with that...the student could guess.

...So, on many math questions, they require what is known as a "gridded response." The student works the math problem, then fills in the correct numbers or symbols of a "grid" that consists of five columns, each containing the numbers 0 through 9 and a decimal point. The middle three columns also offer a "/" symbol used to express fractions, if the student chooses to do so.

As Ms. Shah's story explained, if 7.5 is the correct answer to a problem, students get credit for gridding the 7, a decimal point and then the 5. But students who gridded the 7 in the first column, followed by the 1 in the second column, the / in the third column and a 2 in the fourth column — which works out to 7 1/2 — would be marked wrong.

Even though they were right.

Oh, yeah, that's a problem.

Update: I should clarify my terse comment on this issue. The directions for the exam are indeed clear, and I should have linked to this earlier. It's quite possible that the newspaper coverage was less a sign of a sudden increase in misgrids and more a harbinger of a slow news day.

However, when I originally posted, I was thinking that, if in fact a lot of students are misgridding the answers - especially if there's a recent increase in that - and it's deemed worthy of newspaper coverage, there's a problem. Regardless of whether we assume instructions to be crystal clear, if the same wrong answer crops up again and again (when it's not intended to be an attractive distractor), there may be a problem with the key. It sounds like the problem may be, as one commenter alluded to, that not only do students not know math, they can't follow instructions either. However, it's reasonable to ask if there are in fact a lot of students out there who understand the concept of fractions and don't grid that properly on the exam (I suppose one could argue that their knowledge doesn't count unless they understand the concept of decimals as well).

I didn't originally say much, but I probably shouldn't have posted anything until I had the chance to explore the issue, and the media coverage, further to see if there was anything floating about suggesting that the number of misgrids had actually increased recently (the FCAT site, for one thing, says nothing on this issue).

One last thing, though - if you're going to leave a sarcastic comment about how you know where I work? Not only do you end up looking, well, silly, for getting that completely wrong, but I don't appreciate any public comments on the matter at all. I'll state for the record that I don't work for any "large testing company," but I'm also not opening the question up for public discussion. I figure I provide enough disclaimers and notification to ensure that no one confuses N2P with the official statements of any testing organization, and I don't want to have to add more.

Posted by kswygert at 10:27 AM | Comments (6) | TrackBack

February 07, 2006

If the Jacks have swords, doesn't that make this armed robbery?

Didn't "Cops and Robbers" used to be an innocuous child's game? Now, in today's politicized environment, it appears unwise to bring up the topic of burglary when children are involved:

An elementary school worksheet that tells the story of four people who get away with robbing a house and describes how to do a card trick has drawn criticism from a Baltimore mother who sees it as promoting criminal activity. The worksheet, called "The Four Robbers," is part of a booklet designed to prepare children for Maryland's standardized tests in March. It is intended to teach fourth-graders about sequence of events.

But Kenyona J. Moore, whose 9-year-old brought the worksheet home last week, said it promotes criminal activity to youngsters. "This is being given out to inner-city children," she told The (Baltimore) Sun. "The assumption is they can relate to this, and that's wrong."

The worksheet describes a card trick with four jacks, instructing the person doing the trick to say, "Imagine that the four jacks are robbers. They're going to rob a house." The first card, slipped into the bottom of the deck, represents the first robber, going into the first story of the house. The second and third cards are the robbers on the second and third stories. The fourth card, on top of the deck, is the robber on the roof looking out for police.

The person doing the trick is supposed to say: "Just then, the wail of a siren is heard. The robber on the roof says, 'Cops! Let's get out of here!'" The person peels off the top cards in the deck, showing that "the robber-jacks have magically migrated to the top of the deck!"

I'm finding myself more offended by the fact that Maryland schools are teaching their students card tricks as preparation for the exam. I can't find anything online about what the fourth-grade items might be, so I'm not sure what "sequence of events" means (logically? chronologically? spatially?) and how this trick could be useful.

Posted by kswygert at 03:36 PM | Comments (44) | TrackBack

December 13, 2005

An awfully big testing boo-boo in Ohio

People who criticize exit exams on the basis of the potential for error unfortunately have a point:

A testing company faces a fine after it mistakenly failed hundreds of students on Ohio's new graduation test, state education officials said Monday.

Measurement Inc. graded 1,599 tests and failed 890 students after accidentally converting raw test data to passing and failing grades, the state Education Department said. The error was made on tests given last summer to students entering their junior and senior years, as well as students who were in 12th grade last year but haven't graduated.

Whether the test was the only thing keeping any students from graduating -- and whether anyone might have wrongly been sent back to school this fall -- wasn't immediately clear.

The scores have since been corrected; the state and the company planned to notify 272 school districts this week whose students were affected. The corrected test scores still were not enough to pass for 543 students.

So let's do the math here. 1599 students across two grades. Most exit exams, though given to 11th- and 12th-graders, are set at a level below that. Thus, one would hope that the pass rate on such an exam would be around 70-80%, at least.

Measurement Inc comes up with a pass rate of only 45%. Even given how much we bemoan the current state of public school affairs, that's astonishingly low. The true rate - if we assume the corrected scores to reflect reality - is more like 66%, which is not great, but it does include a variety of school districts and seniors who hadn't managed to graduate yet.

Someone should have caught this - specifically, Measurement Inc. should have. From the description given here, it sounds like raw scores were directly scaled onto the final score scale, and the equating was left out. That's a pretty damn big part to leave out. What's more, the horrendous 55% fail rate for a group of juniors and seniors should have been a major red flag. There's no way that equating was done properly if that were the case, not unless the state of Ohio had recently changed to using much harder test items, or just happened to not require students to actually show up for class last year.

I would not care to defend exit exams to someone who's been disadvantaged by this. When we attempt to convince the public that these tests can be advantageous in pointing out the importance of basic skills and the need for schools to teach them, we have to be willing to guarantee that we, the psychometricians, are not going to screw up on our basic equating and scoring skills. Errors like this are perhaps not unexpected as we gear up for The 21st Century aka The Century of Testing Anyone Who Moves, but errors like this are still unacceptable.

Posted by kswygert at 05:09 PM | Comments (3) | TrackBack

September 29, 2005

How not to design an exam

"Irregularities," indeed.

In a rare move, the State Personnel Board threw out a promotional exam for one of the highest ranks at the California Highway Patrol, ruling that the oral test for deputy chief was so riddled with irregularities that there was no way to tell which of the 17 candidates should have passed. The board's recent audit of the June 2004 exam found that the CHP could not produce the notes taken by two panel members conducting the exam, including former Commissioner Dwight O. "Spike" Helmick, contrary to personnel guidelines.

The exam panel had three members. One, who wasn't named, challenged some competitors' responses, prompted others and tossed the exam materials across a desk in a way that one applicant considered hostile, the audit said.

Completed last month, the report found that the exam panel showed possible bias against one applicant, Hubert A. Acevedo.

I'm always amused (in an annoyed sort of way) when testing critics insist that the innocent little objective multiple-choice item is biased and discriminatory, while "performance-based" exams that use "alternate" methods of assessment must naturally be more fair and accurate. This theory, of course, is assuming that the people involved in rating the examinees bring no biases whatsoever to the table, and that the lack of standardization involved in something like this has no effect on examinee performance. As demonstrated here, the beauty of multiple-choice items that are vetted through many developers and pre-tested on many examinees is that those items can be stripped of bias as much as possible before going live. In an oral exam where an examiner does something inappropriate, those safeguards for the examinee just aren't there.

(Hat tip: Darren.)

Posted by kswygert at 06:00 PM | Comments (4) | TrackBack

April 26, 2005

The perils of bad test items

The Education Wonks link to a NYT article summarizing serious problems with standardized test items:

Beware the perils of ambiguity. It is a mantra that is increasingly pertinent to tests in mathematics and science. The two fields might seem immune from imprecision. But in mathematics, for example, today's tests assess more than a student's ability to do "naked computation," as Cathy Seeley, president of the National Council of Teachers of Mathematics, puts it. In many places, calculators have rendered meaningless the testing of basic computational tasks. Instead, more questions test students' comprehension in real-world contexts. A triangle is a corner garden bed. A rectangular object intersected by a line is a juice box, with a straw. A sloped line on a graph represents a year's worth of payments to the power company.

With these scenarios come variables, and mathematicians and scientists from British Columbia to Boston spend much time picking apart the questions, particularly in online discussion groups. If students are asked how many seeds can be planted in the surface area of a triangular garden, do you put seeds in the corners where there isn't room for plants to take root? What about relevant considerations like seasonality of utility bills or position of the planets? Multiple-choice questions, with no place to show your work and thinking, make such realities more vexing.

These realities should vex everyone who thinks that a lengthy word problem is always more suitable than a simple computation. Word problems certainly have more face validity, and their champions claim they engage students in a way that straightforward computational items do not. But there's a lot more squiggle room in talking about a triangular flower bed than in talking about a triangle.

Field testing, of the type described below, is crucial:

Once questions are written, they are typically reviewed by multiple groups that include test writers, teachers, editors, statisticians and content specialists. And then most developers test the questions on real students in real exam settings. In field testing, statisticians may discover that most top-scoring students selected answer "d" when answer "c" was deemed correct. What made "d" so appealing to the advanced students? Could a flaw in the question have led them to arrive at an equally correct answer? In most cases, the incongruity is a red flag, prompting developers to discard the question.

Any organization that goes live without field testing, especially those who use innovative items, is asking for trouble. But even field testing doesn't catch everything.

I disagree, though, that the situation is always better when multiple-choice items are removed. Yes, multiple-choice items that are poorly-written can confuse students if there is more than one right answer, or no right answer. But all answers must be scored, and it's quite possible that the time and expense needed to create a scoring rubric to account for all possible answers on an open-ended item is more than what's needed to create and field-test a decent-sized pool of MCQ's.

There's a nice list of pros and cons for various item types here. Note that for every type except multiple-choice, scoring becomes more time-consuming, more challenging, or both.

If you're really interested in writing some good multiple-choice items, you can't do better than this set of guidelines, "Constructing Written Test Questions for the Basic and Clinical Sciences," by Case and Swanson. The booklet is tailored to medical science items, but the techniques could be easily adapted to other fields.

Posted by kswygert at 04:54 PM | Comments (10) | TrackBack

January 11, 2005

Sounds like a testing problem

Things like this don't help the public's perceptions of standardized tests:

Students at the Maryland School for the Deaf were asked on a standardized test to match words containing similar sounds, and state education officials promised to adjust the scores after acknowledging the problem.

The state Department of Education also will ensure that questions in this year's version of the Maryland School Assessment are appropriate for hearing-impaired students, spokesman Bill Reinhard said Monday.

The changes follow complaints by James E. Tucker, superintendent of the Maryland School for the Deaf, that the reading section of the test asked third- and fourth-grade students to match pairs of words with similar sounds, such as the vowel sound in "castle" and "manner."

"As a deaf person, I'm not familiar with sounds," Tucker told The Frederick News-Post. "I have a problem answering these questions myself, and I'm an education man."

More grist for the one-size-doesn't-fit-all mill, I suppose. Makes me wonder, too, how many of those I'd've gotten wrong as a kid, with my pronounced Southern accent. I still don't rhyme "route" with "boot", for example, although I suppose the test developers were canny enough to bypass those words with extreme regional variance in pronounciation.

Posted by kswygert at 04:43 PM | Comments (2) | TrackBack

August 31, 2004

Data Recognition Corp isn't recognizing its errors

Great headline on this testing oopsie - "PSSA report spaced out".

A line-spacing error that threw off the standardized test scores reported last week by the Pennsylvania Department of Education has school officials wondering if they can rely on the numbers. The error prompted the department to remove a school-by-school report on the Pennsylvania System of School Assessment test scores from its Web site Thursday, less than 24 hours after posting it.

"There were no problems with the scores themselves," said Stephanie Suran, deputy director of communication for the department. The spacing error put the wrong numbers into the wrong columns...Suran said the error occurred on the department of education's end while they were working with a large computer file from Data Recognition Corp., or DRC...

The data error made a considerable difference in the reported scores for Yough School District.

Superintendent Larry Nemec said that, based on previous newspaper reports, his district was listed as scoring 1,370 in math and 1,420 in reading for results from the 11th grade. The corrected scores listed the district at 1,310 for math and 1,300 for reading, much closer to the minimum proficiency requirements.

Nemec said the state should "work the bugs out" before releasing information.

"If they want to hold us accountable, maybe they ought to be a little bit more accountable on how they send out their scores," he said.

Granted, this isn't a true scoring error, but a reporting error. On the other hand, people are touchy, and anxious, about the whole deal. Releasing scores via the web that are wrong is a bad situation, no matter how easily the rectification.

Posted by kswygert at 04:32 PM | Comments (0) | TrackBack

July 23, 2004

AP exams disappearing in transit

It's not as exciting as Sandy Berger stuffing top-secret documents down his pants, but here's another tale of documents inadvertently going missing:

For the second time in two years, the company that administers advanced placement exams at Walter Johnson High School reported that a group of answer sheets are missing. As a result, halfway into their summer vacations, 44 students may be forced to retake exams they took back in May...

On July 9, [25] students received letters from ETS (Educational Testing Service) informing them that the multiple-choice portions of their AP Psychology tests were missing and unlikely to be found. The letter offered the students two options: Take a new version of the multiple-choice section at no charge or cancel the grade and receive a refund. According to the letter, the students have until Friday to respond.

Another 19 exams were reported missing last week for a total of 44.

Out of 3 millions exams given, 44 isn't a huge number, but missing tests are like homicides. Ideally, there'd be none, and for each one, there's an anguished victim or set of survivors, not to mention a lot of news coverage. The school claims to have followed ETS's mailing instructions "to the letter." And other schools in the same county have suffered similarly within the past couple of years. No one seems happy with ETS's offer to refund money or assign students to a retake, but it's understandable why ETS isn't comfortable with projected scores (due to validity and reliability issues) or with just giving students credit for all the lost items (validity and reliability issues and the potential for abuse by unscrupulous schools).

Is there a way out of this standoff? Not likely, not with the mailing back and forth of 3 million packages every year. Even if the tests were all on computer, that kind of data can vanish, too.

Posted by kswygert at 02:14 PM | Comments (6) | TrackBack

July 19, 2004

The pitfalls of insufficient pretesting

Oregon's educators got caught between a budget crunch and a bad pretest sample, and have decided to toss the results of this year's 10th-grade math exam problem-solving section. Probably not a bad decision, given that 80% of the state's sophomores failed that portion:

The 33,000 students who failed the standardized test - about 80 percent of the state's sophomores - will have to retake the test as juniors. In previous years, about 50 percent of students have failed the test, which requires students to solve one complex math problem and show their work.

State testing officials say they didn't adequately evaluate the difficulty of this year's problem-solving questions before giving the test because they were trying to save money...Legislators had cut back their testing budget, so instead [of doing more pretesting] they used questions that had been tested on students in 2002, which they thought would be just a hair more difficult than questions used in previous years. But when posed to all students, the questions proved to be twice as hard as believed.

To avoid repeating that mistake, the state now will test potential problem-solving questions on a larger, more representative sample of students, and will be requiring schools to take part in the trials, said Cathy Brown, the state's math testing specialist.

I say they release the item, too, so we can see what it is that stumped so many of Oregon's sophomores.

Posted by kswygert at 09:10 AM | Comments (2) | TrackBack

July 13, 2004

A blunder, indeed

Several Devoted Readers sent me the news of this big "ooops!":

Mistakes in the scoring of an examination that 18 states use in licensing teachers caused more than 4,000 people who should have passed it to fail instead, the Educational Testing Service said yesterday. The errors may have prevented many from getting full-time jobs as teachers in the last year.

Robert A. Schaeffer, public education director of the National Center for Fair and Open Testing, which looks skeptically on standardized testing, said the grading errors were only the latest instance of quality-control problems in the industry at a time when testing was growing sharply.

Wow, they managed to wait until the second paragraph before bringing in the critics to claim that this is a harbinger of doom, rather than an isolated error. Given that it's the NYT, that's restraint. Also, notice how they also don't mention here that any field that is "growing sharply" is almost always going to produce more errors than one that is stagnant.

Of course, it was the Washington Times that reported Scheaffer's ridiculous quote about how there's no guarantee that anyone in my field is "highly qualified." Funny, but all the psychometricians and test developers I know have Ph.D.s. The demand for testing does mean that we need more qualified people, but it's absurd to insist that any testing error is evidence in and of itself that the people involved were not qualified (all humans, even qualified ones, can make mistakes).

The dog food industry analogy? Rude, bogus, and a cheap shot. Name me one industry that involves the ingestion of any substance by any critter that is not more highly regulated than any psychological testing or assessment industry. For those testing critics too biased to get the picture, let me explain - tests don't kill people. And that's why we don't have something like the FDA overseeing us.

The errors occurred from January 2003 to April 2004. During that time, the test - the Praxis Principles of Learning and Teaching for Grades 7 to 12, called the Praxis P.L.T. 7-12 - was given eight times, to a total of about 40,000 people.

The testing service began notifying state education departments last Friday afternoon that many of those scored as failing had in fact passed, and started calling the candidates themselves on Saturday.

It said it would reimburse the candidates the $115 it cost each to take the test and would also pay them for materials they used to prepare. The cost of test reimbursement alone will be close to half a million dollars.

Tom Ewing, a spokesman for the Educational Testing Service, said that it had noticed lower scores than usual on two administrations of the test, but that "we thought there were valid explanations for why the scores were lower."

"But when we investigated further," Mr. Ewing said, "we discovered that the short-essay questions were being graded more stringently than normal"...

Besides calling state officials and test takers, the testing service has a toll-free phone line (800-205-2626) for more information. A recording at that number yesterday said that the company was "very sorry that this has occurred" and that it was "committed to addressing any concerns this issue may raise for you."

ETS blundered. ETS found the problem. ETS admitted the error. ETS is trying to rectify the mistake, in both the financial and career-impact domains. In my mind, these are signs of an industry that is functioning in a normal, healthy fashion. Glad to see the article close with a quote that is complimentary to ETS. And note, too, that the error was in the "performance assessment" portion of the exam; testing critics often call for such performance-based items due to an irrational hatred of the more reliable and cheaper multiple-choice items.

I see enough complaints about Praxis on the web, though, that I expect this to bring out many, many responses of how unfair the test is, and how this error must prove...something.

Posted by kswygert at 03:22 PM | Comments (12) | TrackBack

June 15, 2004

Guess there wasn't much of a black market for SATs

Students in New York breathe a sigh of relief, as their missing SATs turn up in the possession of a man who inadvertently took them home with him.

The students, all from the over-achieving Ardsley public schools, were already freaking about having to retake:

Officials of the Ardsley public schools broke the bad news to 123 high school students on Thursday that the SAT exams they took on Saturday had disappeared from the district office. The answer sheets from the exam were in an eight-pound Federal Express envelope on a counter, awaiting pickup on Monday morning. But when the FedEx employee arrived at the office, no one could find it, said Richard E. Maurer, the superintendent of schools...

"Right now I'm very frustrated because I don't know how it left the building," Dr. Maurer said. "I'm very upset and I've apologized to the students and the parents." School officials contacted the Ardsley Police Department about the missing tests, but Dr. Maurer said he did not suspect any mischief...

Between class periods on Thursday, guidance counselors quietly summoned the students whose answer sheets had been lost to tell them about the disappearance. Among them was Derek Weingarten, a 17-year-old junior, who took the SAT I on Saturday, after having just taken it in May, scoring a respectable 1,230.

Mr. Weingarten said he wanted to raise his overall score on the June 5 exam so that he could have a carefree senior year. "I was hoping for a 1,300, and when I came out of the exam I felt I had reached my goal, if not more," he said. "For the test to be gone now is just very disappointing."

Guess what? Mr. Weingarten may have still have that carefree senior year, thanks to the fact that the absentminded SAT "thief" did the right thing and returned the still-sealed exams:

The SAT exams that mysteriously disappeared from the public schools here earlier this week were found on Friday morning when a man called the schools superintendent to say, a bit sheepishly, that he had inadvertently taken them home...

In a telephone interview, Dr. Maurer said the package was completely intact. He would not identify the man who mistakenly took the package, other than to say that he was not a resident of Ardsley. As the mistake was explained by Dr. Maurer, the man had gone to the school district offices on Monday to pick up some other materials, had put down his own papers on a secretary's desk and grabbed the eight-pound SAT package, which was awaiting pickup by FedEx, when he went to retrieve his papers. When the man learned from news reports that the exams were missing, he discovered that he had them.

The guy didn't notice he was carrying an extra eight pounds? Sheesh.

At 10:45 a.m., the high school principal, Dr. James Haubner, made an announcement over the public address system that the SAT's had been found - a piece of news that elicited loud cheers among students...

Tom Ewing, a spokesman for the Educational Testing Service in Lawrenceville, N.J., said that because of the prompt return of the package, there "shouldn't be any delay whatsoever" in grading the exams and reporting the scores.

Posted by kswygert at 10:08 AM | Comments (1) | TrackBack

May 24, 2004

A real tale of testing woes

This puts our high-stakes tests - and occasional testing errors - in perspective:

An Indian teenager killed herself after receiving a mobile phone text message saying she had failed her school leaving exams, although she had actually passed, a report said today. The 17-year-old girl hanged herself yesterday morning after getting the SMS giving her the wrong information, the Hindustan Times reported...

Results of the school-leaving exams of over 250,000 students began being announced yesterday, with cellphone companies for a small fee offering to provide results via SMS to those students giving their roll numbers.

It was uncertain whether the company was at fault for sending the incorrect message or whether the girl had made a mistake in typing down her roll number, the report said.

Pressure from parents and peers on students to score high marks in the exams is immense and each year dozens across the country kill themselves when they find they have failed.

The Board of Education has finally set up an ecounseling hotline. Overdue, I'd say.

Posted by kswygert at 09:15 PM | Comments (2) | TrackBack

May 21, 2004

More blunders for NYC

After all the ruckus over the original NYC third-grade reading exams, you'd have expected quality control on the makeup exam to be especially tight.

Oops.

For the second time in as many months, city educators botched the standardized reading exam; this time distributing a test where the questions failed to match the answer key.

The blunder comes on the heels of last week’s discovery that thousands of students in grades three, five, six and seven unknowingly studied for the original English Language Arts exam utilizing last year’s exam. Department of Education officials said a 20-question passage from the 2003 test was repeated this year, providing certain students with an unfair advantage.

Those students, including about 85 third graders from PS 174 in Rego Park, were told they had to either retake the test or accept a grade scored without the 20 questions—an option approximately 650 students accepted.

A total of 2,400 students took the makeup exam last Wednesday, including 1,300 third graders, whose promotions rest upon a passing grade. However, moments into the exam, instructors noticed that questions did not correspond with the answer booklets...

Despite the confusion, administrators continued with the test, instructing students to circle the answers directly on the test booklet...education officials said they do not expect to invalidate the scores.

Harcourt Assessment, which isn't having the best year, quality-wise, is taking responsibility for the errors. The critics are now screaming for the test results to be invalidated and for students to be assessed only on classroom performance. I don't blame the critics for being upset, but classroom grades aren't exactly standardized and unbiased (nor can they be assumed to be error-free), and grades aren't a useful measure for putting every NYC third-grader on the same reading continuum.

I'm just really, really happy that I don't work for either Harcourt or the NY DOE right now.

Posted by kswygert at 01:57 PM | Comments (4) | TrackBack

May 11, 2004

Booboos in Honolulu

Harcourt Assessment, Inc., is in the news again - and not in a good way:

Specialists at the [Hawai'i] Department of Education are combing through a battery of standardized tests looking for more errors after test coordinators, teachers and students spotted numerous mistakes this spring.

The errors raise questions about the high-stakes tests, which are taken by thousands of Hawai'i students and used to determine whether schools meet annual goals under the federal No Child Left Behind law, with schools that fall short facing consequences.

The state has documented errors in the instructions, samples and the actual tests. After the review is complete, the DOE may either throw out incorrect test questions, give students credit or partial credit for some questions or, as a last resort, have students retake portions of the tests.

The tests were prepared by Harcourt Assessment Inc., a San Antonio, Texas-based company that has a five-year, $20 million contract with the DOE...

Harcourt's president apologized to state officials in Oklahoma last month after errors were found on sample questions on student tests. In the past several years, according to press reports, Harcourt has also been involved in test errors in a handful of other states, including Nevada, where it paid a $425,000 fine after mistakes led to failing scores for more than 700 Nevada high school students.

Posted by kswygert at 09:48 AM | Comments (0) | TrackBack

April 26, 2004

Testing worries in Oregon

Devoted Reader John L. sent me this tale of "testing goofiness in Oregon." The state's 10th-graders are turfing on the math exam, and the test design might be to blame:

State officials are racing to answer why: Was the test too hard, or did schools fail to teach this class to write clear, mathematically sound answers to elaborate math problems?

Last year, half the state's sophomores passed the problem-solving test. So far this year, 82 percent have failed. Another 20,000 sophomores will take a different problem-solving test starting Monday through mid-May...

The state has given the problem-solving test since the early 1990s, part of a decision to go beyond multiple-choice questions when measuring math skills. Students choose one of three multistep math problems, then write an answer that typically runs a page or two. They must show how they solved the problem and how they checked their work; communication counts as much as the right answer...

Every version of the test gives students a choice of a probability question, a geometry question and an algebra question. This year's winter test gave students the chance to prove themselves figuring the odds in a dice game, the dimensions of a hand-made quilt or the speed and mileage of a daughter and her slow-driving dad. By comparing results from this winter's test with results from a year earlier, state officials have determined there wasn't one particularly difficult question on this year's test, they say. All three questions tripped up more students than last year.

Thanks to this, some in Oregon have become highly critical of perfomance assessment items:

Rob Kremer, a longtime critic of Oregon's test system who ran unsuccessfully for state schools superintendent in 2002, said the wild swing in results proves that the state-developed test is unreliable.

"Faddish assessments such as Oregon's math problem-solving tests are not suited for use as large-scale, high-stakes tests," he said.

I don't know if I'd call problem-solving tests in math "faddish," and such items are not automatically unsuitable for high-stakes testing. When the state's employers claim they need more citizens with solid problem-solving skills, they're right, and one way to test those skills is with this type of item.

But such items are more difficult to develop properly, and they may very well test a narrow area of the domain, making it hard to generalize the results to the overall math construct. What's more, that one item counts the same as the multiple-choice exam, so if none of the three options are appealing, an examinee is at a real disadvantage. There's research to suggest that examinees, when given a choice of topics, don't always do a good job of knowing what they're good at.

My reader wanted to know how the following could be possible:

It's fairly easy for test makers to create a new multiple-choice test that is as difficult as the previous year's test, said Edward Haertel , a Stanford professor who is past president of the National Council on Measurement in Education. But when creating tests that require long answers, it is harder to match the difficulty level from year to year...

Haertel said a statistical adjustment, such as the one Oregon testing officials are considering, may be the best step for the state to take.

Although I don't know for sure what Haertel is suggesting, one possibility is to assume the distribution of examinees this year is similar to last year's, and essentially shift the score scale up to match. That's similar to what is done on large-scale standardized tests like the LSAT, which is why a certain number right out of 101 items can translate to a different scaled score from form to form. Obviously, though, it may be unsafe it is to assume the student ability distribution is the same from year to year; if the quality of teaching declined dramatically, it won't be.

A second possibility is to "borrow information," and examine what the historical correlation is between the MCQ's and the performance-assessment items, and use that to adjust scores. If, in the past, students who did really well on the MCQ's also did well on problem-solving, then you'd expect the same to be true now. If it's not, the PA score can be adjusted. However, oftentimes MCQ's and PA items do not correlate highly (if they did, they could be measuring the same thing, and both types might not be needed).

A third option at this point is to re-weight the test sections, given more weight to the more reliable part, the MCQ's. And then there's the "scorched-earth" option:

The U.S. Department of Education would have to approve any move by the state to cancel the results, which would spare schools the consequences of the poor scores, said Ron Tomalis , counselor to the U.S. secretary of education.

Posted by kswygert at 03:35 PM | Comments (0) | TrackBack

April 15, 2004

Not OK in Oklahoma

An unacceptable error: The teacher's manuals for the Oklahoma state standardized exams contain several wrong answers to sample questions.

The testing company, Harcourt Educational Measurement, realized April 8 that some of the answers to sample questions were wrong. The company then notified schools by e-mail and fax after 5 p.m. Friday.

On Monday morning -- just before tests started -- many schools were scrambling to replace pages in the test administration manuals with new pages that had the correct answers to sample questions.

"I received numerous calls from teachers who questioned the credibility of the actual tests based on the number of incorrect answers in the sample questions," said Kathy Dodd, director of student achievement for Union Public Schools.

As well they should.

Posted by kswygert at 03:52 PM | Comments (3) | TrackBack

April 07, 2004

Update on the testing error study

Last June, I commented on the National Board for Educational Testing and Public Policy study on testing errors. I'm delighted to see that one of the study's authors, Kathy Rhoades, commented on that post, and I'm reprinting her comment here in its entirety:

Nice web site.

I'm one of the authors of the National Board study and wanted to correct a few of your observations:

1. Regarding the 1980 ETS error, when ETS informs customers that it loses tests, it is their error. In fact, this type of error has occurred often for ETS, and is likely the result of poor test delivery practices -- in other words, it is up to ETS to design a test-delivery system that ensures tests will not be lost.

2. Similary, the 2001 error is a security error -- it is up to the contractor or test administrator to ensure that tests are not stolen. Test security cannot be taken lightly and errors resulting from poor test security are very serious.

I agree entirely that companies like ETS should do everything possible to insure test security and timely test delivery. Given the determination of some test-takers, though, I still question whether stolen tests should always count as errors. I am aware of one late-1990's LSAT test booklet that was stolen after the exam at knifepoint from the proctor. Other than arming proctors, what could LSAC have put in place to prevent that, and why should it be considered an error on LSAC's part, especially considering that those involved were arrested and prosecuted?

3. Miskeyed items are, arguably, among the most serious errors. Since problems such as these can be spotted easily in the item statistics. If they are not spotted by the contractor, then it is an indication that the contractor is not conducting even basic item-level analyses from which measures of internal test consistency are also established.

A very good point. I had argued that one or two items being miskeyed should not be considered a large error, but Ms. Rhoades is right to say that any such error, no matter how small, should, if it makes it onto a live exam, raise suspicion that the basic item-level analysis process is lacking.

4. The CTB TerraNova error calculation occurred separately for each of the states -- and each state was notified and had their results corrected at separate times.

Make sense to me.

As for your suggestion regarding including consideration of the seriousness of errors, I think it is a good one. Was considering updating the report with that information alongside new errors.

I'm delighted to hear it, and eager to see how the seriousness of errors will be quantified or categorized in the update. I'm also delighted in general to see that the author of this report discovered Number 2 Pencil, hopefully not through an email from a colleague which said, "Look at what this idiot had to say about your study."

Posted by kswygert at 04:00 PM | Comments (0) | TrackBack

March 10, 2004

Testing results "too good to be true"?

Hope this isn't injuriuous to their self-esteem; Minnesota's youngsters aren't quite as good as they thought:

Because of "an error in judgment," the Department of Education inflated the percentage of elementary school students who passed the 2003 Minnesota Comprehensive Assessments in math and reading, Education Commissioner Cheri Pierson Yecke said Monday.

The percentage of students labeled "proficient" or better is really 3 percent to 6 percent lower, meaning that dozens more Minnesota schools probably would have been deemed underperforming under the federal No Child Left Behind law, state officials said.

Although Yecke said she was first alerted to the mistake in November and immediately began the process to correct it, the error apparently was brought to the attention of the state's testing director months earlier -- and before the state released the inaccurate results.

Mmm, not good. Errors can happen to anyone; psychometricians are people, too. That doesn't justify the deliberate release of bad data, although the issue here isn't with the tests themselves, but with how the standards were set:

There is no problem with the actual tests...[and] the students' raw scores are accurate. But the "cut score" -- the number of questions students must get right to pass -- was set too low. The reason: A committee of teachers formed last May to try to align the test results with proposed new standards in reading and math lowered the bar too far, Yecke and Olson said.

That committee never should have been called, Yecke said, because the tests didn't cover material in the new standards. When committee members saw how low the scores would be under the new standards, Olson said, they lowered the "passing" definition. The result was dramatically higher proficiency marks in 2003 than what kids had scored in 2002.

Okay, that's a pretty bad mistake. Why on earth would a standard-setting group have been assembled to judge test scores based on standards that weren't congruent with the tests? Yecke is right to say that group should never have been assembled, and their standard should not have been the one used.

The real scores, arrived at by using the previous standards, showed improvement -- just not the knock-your-socks-off type of improvement. For instance, the number of third-graders scoring proficient in math rose from 65 percent to 72 percent -- but not to 75 percent as was reported last July.

Yecke said that as many as 50 additional schools would have joined the list of 143 schools that were deemed "not making adequate yearly progress" if the correct scores were used. The mistake didn't put any schools on the list, Yecke, Olson and Davison said.

So the schools that would have otherwise been deemed inadequate receive a "Get Off the List Free" card this year. And the revised lower proficiency rates will be used for comparison next year, which means that any future improvements will help schools even more. Also, a completely new test will be developed to cover the new standards, so that the old test will not have to be aligned to the new standards.

Minnesota's DOE is reacting to this appropriately, but this is a pretty big horse to let out of the barn. The standard-setting protocol should have been one of the more rigidly-defined and QC'ed part of the process, and the news that the now-former state testing director Reg Allen released the scores after the problem was discovered should give further pause:

Although Yecke said she was first alerted to the mistake in November and immediately began the process to correct it, the error apparently was brought to the attention of the state's testing director months earlier -- and before the state released the inaccurate results.

Reg Allen, who resigned from the Education Department two weeks ago, made the decision to release the inflated scores despite having the accurate results in hand, said Mark Davison, head of the University of Minnesota's Office of Educational Accountability. But, Davison said, he couldn't persuade Allen to release the accurate results.

"He said he was trying to adjust for a transition to the new reading and math standards," Davison said of Allen's argument at the time...

"We had a disagreement over how we ought to do this," Davison said. "I mean, I viewed it as his call. After he made the call, he put together a document describing what the process had been. I signed off on that."

He added: "The commissioner also saw that. Whether the commissioner actually realized whether this process would have yielded scores that different from prior years? Probably not."

The press release is here. The language is exquisitely euphemistic - the tests are fine, but "changes do need to be made in how the scores are interpreted." I'll say.

Posted by kswygert at 11:36 AM | Comments (1) | TrackBack

December 17, 2003

ETS errs in reporting NJ scores

Scores for New Jersey's third- and fourth-graders on the state's standardized exams were due in September, but are now expected in January. The company producing the tests, ETS, apologizes for the delay:

Princeton-based Educational Testing Service, best known for the SATs and other national exams, was initially due to file the results and scoring analysis with the state in September. However, computer problems and other issues have delayed the work, and the data is now expected to be turned in by next month.

"We apologize for these delays and we are working nonstop with districts and schools to correct data to ensure that educators have reports that accurately reflect their student populations," ETS President and CEO Kurt Landgraf said Tuesday. "We're taking steps to avoid such occurrences next year, but there is no excuse for this current situation."

State Education Commissioner William Librera said he was "confident" that ETS would remedy the problems.

The company, which is working under a four-year, $35 million contract to develop and score the tests, was criticized earlier this year after some districts received a rough run of students and their scores.

Several errors were found in demographic and other student background data, and some schools said that caused them to receive warnings that they were not meeting federal standards under the No Child Left Behind Act.

Ooops. The New Jersey Star-Ledger has more:

State officials and executives of the Educational Testing Service yesterday acknowledged a rash of errors and time delays involved with the NJ ASK exams, leaving districts without results for students beginning to prepare for the next tests.

In one case, officials said the loss of dozens of test booklets led to the state's mislabeling of two elementary schools as "underperforming."

The sum of the problems yesterday brought an extraordinary public apology from ETS President Kurt Landgraf, as the Lawrence- based firm began returning the last of the scores. The tests in reading and math were given to 210,000 third- and fourth-graders last May.

Part of the issue was the time crunch:

The state was under time pressures to get the tests up and running by spring and chose Princeton-based ETS over several other nationally recognized firms, even at a far higher cost.

The administration and scoring of the tests were without incident, officials said, but the problems began as ETS returned scores for the fourth-grade test this fall and discovered errant codes for schools or students in nearly 150 districts.

Posted by kswygert at 04:34 PM | Comments (1) | TrackBack

October 15, 2003

More MEAP madness

The delay in getting those MEAP scores out, and the missing results, makes the test appear that much more expendable to Michigan educators. The test now costs $15 million; some suggest replacing the test would cost a third as much.

Those educators who are looking at MEAP scores are worried about the score gap between white students and almost everyone else. Refreshingly, none of the educators quoted cited "test bias" as the reason for the gap, nor did they claim that minority kids were doomed to do poorly on the MEAP. Instead, the teachers and administrators all seem determined to root out the true causes of the score gap - lowered goals, deprived eduational backgrounds - and correct those negative influence. A few schools are already succeeding in this:

The MEAP test results were not all bad news. Minorities outscored whites in some schools.

For example, black fourth-graders did better than white classmates at Port Huron's Cleveland Elementary School in the math and language arts tests. In the Anchor Bay district, black students did better overall than white students in math.

Posted by kswygert at 09:21 AM | Comments (0) | TrackBack

October 03, 2003

missing meaps mysteriously materialize

The long-awaited MEAP scores are finally available in Michigan:

After months of delays getting state standardized test scores, state education officials will begin to start putting together schools' yearly progress data and long-anticipated report cards...

The education department needed the MEAP data to calculate a school's yearly progress, required by federal education law, and finish the report cards.

A number of factors contributed to the score delay, including duplicate barcodes and schools that were late in sending in their tests to be scored. A state Senate committee took several hours of testimony from the contractors and state departments to figure out the cause of the problem.

Some schools aren't happy, because some scores are still missing:

Portage Public Schools officials are upset that despite their pleas, they've been told long-delayed state standardized-test results are being released today even though some of the district's scores are incomplete because hundreds of tests are missing...

Portage officials said they made the state aware of mistakes, including the hundreds of missing scores, but they say that as far as they know the mistakes haven't been fixed...

In one example in Portage, 201 students took the seventh-grade reading and English language-arts test at North Middle School, but the state has the results for only one test. Consequently, the reported average is based on a single student's performance.

"It's based on one child, based on one test. We're saying that's inappropriate," said Tom Vance, Portage's communications director.

Yes, indeedy, it's inappropriate. I mean, the sample sizes should be considered when any statistics are reported, or interpreted - but I certainly hope the school can highlight this, and will not get scored based on such a small sample size.

Posted by kswygert at 02:09 PM | Comments (0)

September 22, 2003

Mistakes were made - but not just by us

The database company that was involved in the missing MEAP problem a while back admits culpability - but says that it wasn't the only one at fault:

The company that set up a student database for standardized test scores said Wednesday it made mistakes, but it isn't the only one responsible for the delayed test scores.

The state Senate Education Committee took testimony Wednesday as part its ongoing effort to determine why Michigan Educational Assessment Program test scores were delayed by several months. The MEAP scores didn't go out to schools until late August, and many educators were upset they didn't have the results sooner...

Some state officials have pointed to Enterprises Computing Services Inc., which developed the database that links an individual student with his or her test information, as the cause of the delay...

"It is not right and it is not appropriate to hang the whole thing on us," Hari Iyer, chief executive of Woodstock, Ga.-based Enterprises, told the committee. "I'll admit we made a mistake. We corrected it at our expense," Iyer said...

Enterprises said it expected to receive test scores in March, but didn't until June 12. The tests were scored by a different company, Durham, N.C.-based Measurement Inc. Officials from Measurement Inc. said they didn't receive the last batch of tests from schools until late April. Iyer and Kevin Ireland, national sales manager for Enterprises, also suggested that the state's MEAP office is understaffed.

So the schools were late in sending the tests, the scoring company was late in sending the scores, and the database company had a glitch in their problem. A cavalcade of errors, it seems.

Posted by kswygert at 11:15 AM | Comments (0)

Is something screwy in East St. Louie?

Something seems to be very, very wrong with the standardized test scores for children in East St. Louis (Illinois), and an audit is being considered to further examine scores which are going up and down like a rollercoaster ride:

Something is wrong, possibly very wrong, with standardized test scores for East St. Louis school children. That's the conclusion of Richard Mark, chairman of District 189's state-appointed oversight panel, after nearly a decade of watching test scores climb and fall like a rollercoaster, often in the same school buildings and only a few years apart.

Case in point: Brown Elementary School was named a federal Blue Ribbon School winner. That's because in the 2001-02 school year, 82 percent of its third-graders met benchmarks for reading and math on the Illinois Standards Achievement Test. Yet, preliminary figures for the 2002-03 test scores show a stunning 32 percent drop in test scores for Brown's third-graders.

And at seven other District 189 schools, variances of 12 percentage points or more were found over a four-year period, ending with the 2001-02 school year.

Emphasis mine. A similar problem with high variability in test scores uncovered a cheating scandal in Chicago recently, and that's what's feared here in East St. Louis. The words "red flag" come to mind for everone who's examining these data.

Why does this variability suggest cheating? According to one study, classrooms that cheat with teacher participation will show unusually large score gains for one year, but these gains do not continue - scores level out or even drop the subsequent year. The key is in the size of the gain - individual students may show large fluctuations, but overall mean scores shouldn't show huge amounts of fluctuation year to year, especially fluctuations that change direction.

Of course, a school that implements a radically-new process could show a large gain one year, with gains leveling off after that. Low- or no-stakes tests can also be more unstable, because kids may not try their best. And those who want to examine the scores face charges of racism, despite the fact that cheaters, whether they be teachers or students or both, aren't doing kids of any race a favor.

In a related article, a buddy of mine, testing and test cheating expert Professor Greg Cizek, is pontificating on the subject of "test tampering" - which seems to be on the rise:

Some educators call teacher cheating the inevitable result of the nation's test score obsession. And experts agree that test tampering - and suspicion of it - is on the rise.

"It's picked up," said Gregory J. Cizek, a nationally known testing and cheating expert at the University of North Carolina at Chapel Hill. "I am responding all the time to school districts who say, 'We have a problem with this teacher. Can you do an analysis?' " said Cizek, author of Cheating on Tests: How to Do It, Detect It, and Prevent It.

A great book, by the way - and reasonably priced, too! (I figure the more plugs I give his book, the more likely he'll pay for dinner next time he's in town)

But seriously folks, in an age when real estate agents are allegedly showing school report cards to prospective home buyers in an attempt to cash in on good test scores, it's not surprising that some folks are trying to take the easy road:

The most noted test scandals have occurred in New York and Texas, but smaller-scale allegations have erupted in recent years from California to Chicago to Connecticut.

In December 1999, widespread cheating allegations in New York City public schools implicated scores of teachers said to have provided answers on reading and math tests.

Similar, but more isolated, test-tampering cases in Houston, Austin, Dallas and several other Texas districts helped lead to the formation of that state's Public Education Integrity Task Force in 1999.

Dr. Cizek's theory is that teachers who oppose testing may have decided that test-tampering is "justifiable civil disobedience." I can understand teachers not wanting their efforts to be judged wholly by test scores - but any teacher who helps children cheat is not doing them a favor, and the message being sent is not that the government is wrong, or that tests are wrong, but that children should be helped to cheat in order to protect the teacher's reputation and the school's standing. And that's very wrong.

Posted by kswygert at 10:07 AM | Comments (4)

September 02, 2003

Too much demand for testing

The NYT has the scoop on the spate of recent testing blunders, and wonders if the rising demand for tests should be met with a rising demand for accuracy and accountability from test developers:

Testing is the buzzword of education these days, with state legislatures and the federal government demanding more of it than ever before. Everything from high school graduation to eligibility for transfers, tutoring and federal aid is tied to the results. But educators and some testing industry experts are warning that the new demands are pushing the limits of the testing industry's ability to provide fair and accurate tests.

When President Bush signed the No Child Left Behind Act in January 2002, calling for increased annual testing in grades three through eight by the 2005-06 school year, the testing industry — dominated by a handful of companies — had just weathered the three most error-plagued years in its history. Researchers at Boston College recently found that last year was hardly better, with at least 18 problems reported, almost matching the total reported between 1976 and 1996.

This surge in testing errors is no joke, but it's also no surprise to those of us who have watched the industry expand at a much faster rate than psychometricans can be trained and standards can be perfected. Testing companies are notoriously close-mouthed about what goes on inside their doors, but part of the problem is that they are expected to provide tests "good, fast, and cheap" - and the problem is that they've had to "pick any two" of those qualities to get the job done. Errors often get caught when test forms are released, but that practice is prohibitively expensive for many states.

Some of the more recent testing criticisms lump big errors in with little ones, as I noted a while back. But even testing defenders concede that the haste in which they are asked to produce good material is the main cause of errors:

Several testing company executives said that the Boston College study reflected an "antitesting agenda" and that it did not distinguish between serious errors and trivial ones. But they agreed with the researchers that haste was the most common contributor to errors. Neal Kingston, the chief operating officer at Measured Progress, said his company had occasionally been asked to devise and deliver new statewide tests in three months — an utterly impossible task, he said.

Is industry regulation the answer?

Concern about this rising tide of testing errors is reviving the long-dormant issue of industry regulation. "We regulate our pet food, and we don't regulate the tests which are making major decisions about the lives of our kids," said Monty Neill, executive director of FairTest, an advocacy group in Boston.

Others have called for an independent oversight panel that could monitor for quality in testing. Professor Madaus, the co-author of the Boston College study, said he preferred that approach to letting the federal government regulate the industry because he feared that politics would taint the professionalism of test evaluation.

Even some testing executives see merit in at least compiling a national database to track testing errors. "Researchers have to hunt and peck where they can to find the mistakes and compile them," said Dr. Kingston of Measured Progress. "A lot of mistakes, quite possibly, don't even get caught."

An independent oversight panel, free from all political bias? A lovely thought, but does such a collection of psychometricians and educators exist?

Posted by kswygert at 10:55 AM | Comments (2)

August 29, 2003

Missing meaps makes michigan mad

Oopsie. The Michigan Educational Assessment Program (MEAP) scores for students in Grand Rapids, Michigan, who took the test early this year have been lost. More than 1,000 - or one-seventh - of the district's scores have gone AWOL:

The missing Michigan Educational Assessment Program scores appear to be limited to those attained by students tested in January and February at elementary and middle schools...

The state Department of Treasury administers the MEAP program and contracts out the testing to Durham, N.C.-based Measurement Inc. Spokesman Terry Stanton said Treasury staff is working with the company to figure out what went wrong.

The state needs the scores for several reasons, including score grades required under NCLB and determining qualification for Merit Award Scholarships.

"This is getting to the point of ridiculousness that it's almost funny," said Grand Rapids school board Vice President Amy McGlynn. "After all these months, the scores have pretty much lost their meaning anyway. We needed these scores months ago if we were going to do anything productive with them."

Hey, glad to see someone has a sense of humor about the whole thing. But one state representative, Michael Sak (D-Grand Rapids), isn't laughing. He's asked the state to take responsibility for the tests and refuse to work with Measurement, Inc. any more.

Posted by kswygert at 11:17 AM | Comments (0)

August 22, 2003

Harcourt's woes

Looks like the contractor hired by the state of Nevada to score standardized tests isn't too reliable:

For the second consecutive year, the private contractor hired by the state to calculate the scores on Nevada students' standardized tests didn't make the grade.

In 2002, miscalculations by Harcourt Educational Measurement led 736 Nevada students -- 550 of them from Clark County -- who had actually passed the mandated high school proficiency exam to believe they had failed the test...

This time around, Harcourt overstated the scores of thousands of third- and fifth-graders statewide on the skills test required by the federal No Child Left Behind Act. As a consequence, as many as 21,000 youngsters may receive scores that were calculated and reported inaccurately...

"I am very upset and very disappointed," state Board of Education member John Hawk said. Mr. Hawk suggested Harcourt would face additional fines ... or, perhaps, the company's $13.2 million contracts to score elementary and high school tests might finally be terminated.

Harcourt isn't some fly-by-night company experiencing startup problems. They're one of the largest for-profit testing companies in the nation. But this isn't the first state in which they've had problems. This NYT article from 2001 describes Californian fiascos that stretch back to 1998:

Case in point: California. On Oct. 9, 1997, Gov. Pete Wilson signed into law a bill that gave state education officials five weeks to choose and adopt a statewide achievement test, called the Standardized Testing and Reporting program. The law's "unrealistic" deadlines, state auditors said later, contributed to the numerous quality control problems that plagued the test contractor, Harcourt Educational Measurement, for the next two years...

Some test materials were delivered so late that students could not take the tests on schedule. It got worse. Pages in test booklets were duplicated, missing or out of order. One district's test booklets, more than two tons of paper, were dumped on the sidewalk outside the district offices at 5 p.m. on a Friday — in the rain. Test administrators were not adequately trained...

In 1998, nearly 700 of the state's 8,500 schools got inaccurate test results, and more than 750,000 students were not included in the statewide analysis of the test results. Then, in 1999, Harcourt made a mistake entering demographic data into its computer. The resulting scores made it appear that students with a limited command of English were performing better in English than they actually were, a politically charged statistic in a state that had voted a year earlier to eliminate bilingual education in favor of a one-year intensive class in English...

If Harcourt is one of the largest companies, testing the most students, then by the law of averages, it wouldn't be surprising for them to have a lot of errors. The issue is that, given two straight years' worth of problems in Nevada, it doesn't seem like Harcourt has learned from its earlier mistakes, and it doesn't seem like they have a QC process in place to prevent more errors from happening.

Posted by kswygert at 11:23 AM | Comments (2)

August 11, 2003

Situation normal - all TAKSed up - UPDATE

The debate over the misleading TAKS math item continues. In my previous post on the topic, I cited Bas Braam's explanation for why the item did seem to have two correct answers, and I agreed with him that this item should have not been approved for administration.

Then an email appeared in my inbox (edited for brevity):

Dear Ms. Swygert:

I am writing to you to discuss an item posted on your blog on August 7, 2003 relating to an alleged faulty question on the Texas Grade 10 math TAKS test...

I disagree entirely with the officials in Texas who concluded that the question had 2 different answers. I also respectfully disagree with the other experts you refer to.

I think that Mr. Braams may have been tricked into accepting an incorrect reading of the question. The question itself makes no reference to inscribed or circumscribed radii. The only fact that relates to the relevance of these radii is the explanation given by the Texas officials for accepting both answers.

The question contains only the following "facts:"

1. The shape is a regular octagon.
2. A triangle is drawn using one of the octagon's sides as its base.
3. The length of the sides of the triangle are defined to be 4.6 centimeters.
4. The altitude of the triangle is defined to be 4.0 cm.
5. The right angle symbol at the intersection of the base and the altitude along with the equal sides establishes that the triangle is an isosceles triangle (I think I have remembered the term correctly).
6. The drawing of the triangle APPEARS to show that the peak of the triangle is NEAR the center of the octagon.

Nowhere does the following fact appear:
1. The peak of the triangle is precisely at the center of the octagon. (Nor do any of the facts in combination allow a conclusion that the peak is the center of the octagon).

The alternate answer is based upon an APPARENT contradiction which in turn is based on a FACT that IS NOT PRESENTED as part of the question...

In other words, because the item doesn't state that the peak of the triangle in question is in fact the center of the octagon, the student shouldn't have assumed that it was. I appreciate that this reader took the time to write in with a lengthy explanation, because it might explain what the test developers were thinking.

I still say, however, that because the peak certainly appears to be near the center, this is a flawed item. These drawings are supposed to be precise, and it would not be surprising for a kid to make this assumption.

I also note that the instructions for the math section do not include any disclaimers such as, "Diagrams of shapes are not drawn to scale, and assumptions of placements of points, angles, etc. should not be made unless stated as fact." I have seen similar instructions for other exams, but I don't see that here.

Bas notes that the item, if read in this way, does indeed have one correct answer (see Addendum):

Please see the figure accompanying question 8 in the exam. The line segments that I described as inner and outer radii are not, in fact, identified as such in the figure or in the question. They meet at a point that certainly appears to be the center of the octagon, but that is not labelled either. There is, therefore, a reading of the question under which it has a single correct answer. Under that reading the given data are all correct, the special point is not meant to be the center of the octagon, and the figure is simply distorted in what happens to be a highly misleading way.

I still say that by including a triangle that appears to be placed at the center, with no qualifying statement as to what assumptions can be made, this is a bad item.

Bas also notes that quality control for TAKS does not seem to be very good:

The TEA (Texas Education Authority) put out an Additional Information Regarding Released Science Items for the spring 2003 testing cycle. Four controversial items are discussed.

Grade 5 Science, Item 13. Item 13 asked students which two planets are closest to Earth. Among the possible answers: Mercury and Venus, and Mars and Venus. The correct answer varies over time, and the question is plainly wrong or crazy...

Grade 10 Science, Item 50. Item 50 looks crazy to me - they seem to be testing in a most convoluted way that the student knows that the element symbol K stands for Potassium. The TEA discussion indicates that the item is factually wrong to boot, but they insist that it is valid just the same.

Grade 11 Science, Items 11 and 45. Question 11 asks for the force exerted by a jumping frog on a leaf. The force has two components: one due to the weight of the frog and the other due to its acceleration. These are to be added vectorially, but the direction of the jump is not given. The TEA insists that therefore the correct treatment of the question must ignore the weight of the frog. Obviously the question is wrong and the TEA is wrong to insist that it is correct...

And so on. Bas believes "it is too much to ask of the psychometric process that it correct for blunders of this kind," but in fact that's why field-tested items should be examined so closely. If an item is good, and it's related to the topic of the test, then it should have a large positive correlation with the total test score, and the smarter students should be more likely to choose the correct answer than any other. If there are no attractive distractors, then no incorrect answer should be preferred over any other. These methods can help flag some invalid items - not all, of course, but that's no reason not to closely examine the statistics.

Posted by kswygert at 11:22 AM | Comments (2)

August 08, 2003

Catching cheaters on the FCAT

Devoted Reader Nick sent along a followup to the story of the FCAT cheating allegations in Broward County. Don't remember that story? Neither do I, because I missed it when it first broke back in July. FCAT tampering was suspected at several schools because of unusual gains in the state-issued grades, which are dependent on FCAT performances:

Department of Education officials are looking into some results at Jackson Senior High. They have also requested more information from West Little River Elementary after the school's state-issued grade jumped from an F to an A following three years of Ds.

The problem at West Little River [Elementary school] appears most significant in the third grade, where 51 percent of the students were in the top two of five FCAT reading levels -- a highly unusual number for a school with such historically poor grades.

The large-score-gains method isn't the only one available for discovering potential cheaters. As with most high-stakes exams, the seating charts ing classrooms were saved for further analysis, so that aberrant classrooms, or clumps of high-performers all seated next to one another, could be flagged.

What's more, students took a norm-referenced test one week after the FCAT, and those scores are normally correlated. If not, that could indicate cheating on either exam - but the Florida DOE doesn't actually have the software necessary to compare performances on these two exams.

There's one elementary school in particular that has been the focus of investigation:

In Broward, the state is finishing an investigation at Park Ridge Elementary, the only case in that county in which widespread tampering is suspected. The Pompano Beach school had a D grade from 1999 to 2001, and then rocketed to an A in 2002.

The jump was so dramatic that Broward testing officials began reviewing the data, and the state quickly followed. The school's letter grade plummeted to an F this year, when the school was being watched closely. The fallout is still being felt. Grades are based in part on how much students improve, and it's hard to better a high score that wasn't truly earned in the previous year.

So, I missed this when it first happened, but thanks to Nick, I know the followup, which seems to be that no criminal charges will be filed against schools with suspicious scores, but administrators aren't yet off the hook:

No criminal charges will be filed in a case of possible FCAT cheating at Park Ridge Elementary in Pompano Beach, but teachers and administrators could still be disciplined...The Broward state attorney's office said there was statistical evidence of tampering by teachers and possibly administrators during the high-stakes test in 2002. But it was doubtful that a jury could be convinced of their guilt...

...the jump in letter grade [at Park Ridge] doesn't tell the complete story of just how improbable the accomplishment was.

In 2001, 1,528 of 1,728 schools in Florida had better FCAT scores than Park Ridge's third-graders. The next year, the Park Ridge students -- now in fourth grade -- bested all but 13 schools statewide.

''Clearly the scores were very, very high,'' [Superintendent Frank] Till said. ``Statistically they are beyond the range of real possibilities. It still doesn't mean it didn't happen. But the big disappointment is they didn't sustain themselves the next year.''

An e-mail tip to the Department of Education in the summer of 2002 prompted an investigation. This year, with the school under close watch, its grade plummeted to an F. Prosecutor Bernhard Hollar interviewed some teachers, principals, test proctors and some third-grade students.

• He found that one third-grade class had check marks next to most of the right answers on most of the test booklets.

• Some students said third-grade teachers Edward Peddell, Ealton McDuffie, Sheryol Daniels, and two teaching assistants helped them during the test.

However, Hollar wrote ''there is no conclusive supporting testimony to substantiate the claims.''

Hence the decision not to put this in front of a jury. And the check-mark thing - I know from experience that test-takers often mark their booklets as they're working. A check could mean that the test taker was sure of that answer and didn't need to come back to it. However, these are third-graders we're talking about, and the possibility that the students used it as a time-management strategy might be less likely than the possibility that teachers marked the booklets to indicate the correct answers.

The statistical evidence is, to me, damning (which may be why I rarely get chosen for jury duty). Park Ridge went from being in the bottom 200 of over 1700 schools to being in the top 14? That's an impossible jump. One administrator surmised that a combination of compassion and fear might lead teachers to help kids cheat, but teaching kids to cheat on tests is not by any means compassionate. It's immoral, and the only person who truly benefits from it is the teacher, because they will appear to have taught their students well. And fear should be an impetus to improve performance, not to sidestep indicators of performance.

Posted by kswygert at 10:53 AM | Comments (3)

July 22, 2003

Teaching that it's okay to cheat

Did three Bristol (PA) elementary school teachers help their students just a bit too much on the PSSA (Pennsylvania's standardized test)? PA's Dept. of Education thinks so, and the teachers may be employed at a school that was facing sanctions if test scores did not improve:

Bristol Township School District officials are waiting to hear whether the state determines three Buchanan Elementary teachers cheated on Pennsylvania's standardized assessment test...The investigation was prompted by a discussion between Erin and Justin Darr and their now 9-year-old son, Brandon, who was in third grade on test day during the spring. "My son said that he had spelled 'wolves' incorrectly on his paper and he turned it in to his teacher and she corrected it," Erin Darr said yesterday...

Buchanan Elementary was one of nine Bristol Township schools warned that it could face sanctions if scores did not improve. Darr thinks this pressure played a part in what her children told her.

"This was just absolutely appalling that they would teach the children to cheat to get a better score to keep their hind ends out of trouble," Darr said. "They're telling the children it's OK to cheat so you stay out of trouble," he said.

Exactly.

Posted by kswygert at 11:44 AM | Comments (1)

July 18, 2003

From bad to worse

The NY Regents Exam officials just can't catch a break. Now the Physics exam is being challenged, though officials are defending the test as "flawless":

Leonard Morochnick was so upset after 43 percent of his physics students at the New York City Lab School for Collaborative Studies failed the Regents physics exam last month that he sat down at his computer and banged out a lengthy analysis of the test...he and scores of other physics teachers across the state have continued to critique the test and to denounce what they perceive as injustices to anyone who will listen.

The latter group has not seemed to include state officials in Albany, who defend the exam as technically flawless...

There are no official statistics, but some teachers who have assembled test results from schools across the state estimate that more than 40 percent of the 40,000 students who took the exam on June 17 failed it. Many who failed had received good scores on the College Board's SAT II physics test, educators said.

Why are there no official statistics? Is it that the Regents Exam normally distributes the pass/fail rate but hasn't done so yet? Or is that information normally withheld?

The upshot is that, even if the reaction to this is not "snowballing" like the reaction to the obviously-flawed Math exam, schools are deciding not to use the Physics exam, and the New York State Association of School Superintendents sent a letter out to college admissions officers in an effort to convince them to disregard the Physics exam results.

Oh, wait, here are some "official statistics," but only for the June 2002 test:

Criticism of the test began immediately after it was introduced in June 2002. Commissioner Mills announced on the basis of a preliminary review that 33 percent of students had failed it. Data collected in a complete survey, announced later, showed a 39 percent failure rate. That was more than double the 17 percent failure rate on the previous Regents physics test...

The fail rate more than doubled and the state didn't think anything was wrong? Are they mad? Or just in denial? According to one SUNY-Buffalo professor who studied a series of answer sheets, the passing standard was changed substantially from 2001 to 2002 - to pass the June 2001 exam, students had to get 50 percent of the items right, whereas on the June 2002 and subsequent exams, students had to get 68 percent of the items right.

When the definition of passing changes this substantially, I don't see how the testing process can be considered consistent from year to year. Is there some documentation as to why the standard was changed so dramatically, and is there any theoretical basis as to why it should have? Even if the state felt that a 17% pass rate was too low, it's ludicrous to raise the bar that high from one year to the next.

If an exam that is meant to be of equal difficulty from year to year turns out to be a little bit too easy, then the passing rate might fluctuate from year to year as well. But according to observers, the test contained more difficult items and required a larger number of items correct to pass. No wonder teachers report that physics class enrollments are down.

Posted by kswygert at 10:55 AM | Comments (0)

July 01, 2003

Regents Math flap continues!

The New York Assistant Education Commissioner, Roseanne DeFabio, has stepped down, amid a tornado of negative testing publicity surrounding the Math portion of the NY Regents Exam. She's actually taking an "early retirement," but there doesn't seem to be any controversy as to the precipitating factor:

Assistant Education Commissioner Roseanne DeFabio, 59, opted to take early retirement...Education Department spokesman Tom Dunn said Tuesday. Dunn said state Education Commissioner Richard Mills [who recently voided the June 2003 Math portion of the Regents exam] wanted the Office of Assessment, which develops the standardized tests, to directly report to Deputy Education Commissioner James Kadamus instead of DeFabio, but she refused to accept reassignment. Her resignation took effect immediately...

The shuffling was made so that the Office of Assessment "receives the resources and attention to ensure that the assessment system remains the cornerstone of the Regents' strategy...[in other words, so that blunders this big don't happen again...]

During DeFabio's tenure, the Education Department came under fire several times over allegations of faulty Regents tests, including the sanitizing of literary passages on the English Regents test last June...Last week, Mills gave schools the option of tossing out scores from the Math A Regents exam that is normally a prerequisite for high school graduation, admitting that the test was flawed...

To the dismay of students and parents, the Education Department refused to change the scoring of its June 2002 physics Regents test despite lower passing grades and complaints from some teachers that the new format of the test was too hard. The department did schedule a makeup test that gave students another chance...

I'm betting this will not be the only shakeup within the Education Department.

Posted by kswygert at 06:09 PM | Comments (2)

June 27, 2003

More on the NY Regents Math fiasco

Scores from the faulty June administration of the the New York Regents exam Math A section were discarded because the test form was too hard, and this is causing untold headaches for NY's students. This description of the resulting chaos in the Kingston school district is but one example:

For seniors who took the exam, the remedy...is clear: If the student passed the Math A course, that student may substitute the course grade for the exam and then graduate. A total of 22 Kingston High School seniors took the Math A exam. According to [Kingston Assistant Superintendent for Curriculum Grere] Fischer, two seniors passed the exam and 18 will graduate today based on their passing course grade...

But how this will affect the grades of freshman, sophomores and juniors is not yet clear...[state Education Commissioner Richard] Mills originally said that juniors could substitute their course grades for the test as well, but recent communications to school districts stated that the state may re-score the exams...

For freshman and sophomores it gets a little more complicated...a sophomore can take the Math A exam at the next opportunity, which means that a sophomore who might be continuing with Math B this upcoming year may have to take the Math A exam in January after a whole semester away from the material...four years of a certain exam are developed at the same time. There could be three more years of faulty Math A exams to come...

Fischer said that the Math A fiasco could have also affected students grades on other standardized tests that week. She said that many teachers reported students crying in the middle of taking other exams after finding out that they "bombed" Math A.

Bas Braams has examined previous Math sections of the Regents exams and found them to be lacking in quality as well. If he is correct in his statement (stated in a previous comment on this page) that the test designers "clearly do not have an adequate background in mathematics, or even in precise and clear use of language," the future of the Regents Math exam is starting to appear rather bleak.

Posted by kswygert at 02:37 PM | Comments (2)

June 24, 2003

Test design blunder negates NY Regents scores

N2P Reader and humble genius Bas Braams has sent a couple of emails my way regarding the mathematics portion of the NY Regents Exam. For those of you unfamiliar with the exam, it's produced by the New York State Education Department to assess if students have met the New York State Learning Standards, and it's a requirement for graduation in New York State. The administrator's manual for the exam is here.

The first email that Bas sent covers the negative press the June 2003 exam had been receiving, and he concluded that perhaps the exam was okay, although the difficulty level may have been inconsistent with past exams.

However, he followed this up with an email in which he decided, after further inspection, that it appeared the January and June 2003 exams differed greatly in difficulty, and that some items were badly worded as well. The open-response section of the June exam appeared to him to be more difficult than the corresponding section in the January exam. Now, even when exams are well-constructed, some difficulty fluctuations may appear, but if care is taken when assembling the test forms, the fluctuations will be small, and can be corrected through equating (this is done on the SAT).

Now, though, it's a moot point, because schools have been given the option of discarding the math portion of the exam. The test construction method must have failed at some point, because the statistics show the June exam was in fact far more difficult than the January form:

A high failure rate on the Regents exam had called into question the fairness of the test and imperiled the right of thousands of seniors to graduate from high school this week. Commissioner Richard Mills also ruled that the planned August administration of the math test would be suspended to give education officials more time to review the June test results and why so many students failed...

Last Tuesday, tens of thousands of high school students across the state took the test, known as the Math A Regents examination. The state immediately ordered a speedy review of the results because of initial reports by school districts and parents of unusually low passing rates...

Speediness now doesn't quite compensate for incompentency earlier. Incidents like these bother me a great deal, not only because I feel for the kids who had to take an exam that is essentially going to be thrown out, but because this also adds fuel to the fire for those who advocate removing high-stakes tests altogether.

boy_math_md_wht.gif

Posted by kswygert at 08:26 PM | Comments (3)

June 18, 2003

More test scoring errors?

A new study released by the National Board for Educational Testing and Public Policy claims that the number of human scoring errors reported by testing companies has risen to an indefensibly high number. The report, not surprisingly, cautions against the use of exam scores in high-stakes decisions, and bemoans the lack of an overseeing US agency to audit and regulate testing processes and products.

The authors point out that all test scores contain some form of error, if only random measurement error; that both random and human error should be considered when using test scores for high-stakes decisions; and that forcing companies to produce high-stakes tests without sufficient development time is asking for disaster. These points are correct. However, I don't feel the authors really convey any useful solutions.

For one, the question of what to use for high-stakes assessments, if not tests, is left unanswered. The usual suspects - grades, performance assessments, holistic judgements, and the like - are not only no less likely to be error-prone than test scores, but are more difficult to assess for error in the first place. Using them to compare schools within states, or states within the country, is not feasible. Schools may decide to use a combination of assessments for exit exam purposes, but the standardized tests were often chosen specifically because, due to their objective nature, they do reduce the random measurement error.

Does it improve matters if a school tries to balance out a standardized test, which is objectively scored but may have human error, with a performance assessment, which is guaranteed to contain more random measurement error than the standardized test, and perhaps more human scoring error as well? The authors warn against using a single score to make a decision - but what if the other possible scores for that decision are less reliable and more subjective? It might be appropriate to use these scores in conjunction with the objective test, but that's by no mean a given.

One of the testing errors listed in this study (page 36 (19-1999)) is an example of this. This error is essentially a training error for human scorers - an error that is all too common in performance assessments, and one that could have been avoided with a more standardized, objective judgment.

It bothers me as well that the authors don't distinguish between degrees of errors. For example, of the 78 errors listed in Appendices A and B, 17 of them are situations that involve only one or two miskeyed, typographically-incorrect, or otherwise-flawed items on a test. Errors, yes; indications that the testing company is going to hell in a handbasket, no. Lumping these smaller, well-nigh-inevitable flubs in with the more substantial equating and scoring problems dilutes the impact of the bigger problems.

The authors also fudge the numbers a bit by counting some errors more than once if they affected more than one state. For example, TerraNova's massive miscalculation of percentile scores is counted four times in the Appendix (p. 37). Is an error more than one error if it affects more than one set of tests? In one sense, perhaps, but in another sense, if it's all one root cause, then it shouldn't be counted as multiple errors.

And in two cases, the authors have labeled something that was not under the control of the testing company as an error. On page 44, error (3)1980, we see that in 1980 ETS informed 163 students that tests were lost. This could have been ETS's error; more likely, the person at the other end who was responsible for shipping the tests did not follow directions - or the tests were lost (or stolen) through the mail. I can personally vouch for the fact that this has happened at companies other than ETS.

And speaking of stolen tests, check out page 47, error (19)2001. In this situation, someone stole a test form, and ETS, after suspecting cheating, demanded a retest at that high school. That's NOT a testing error. People steal test forms all the time, and test companies are forced to declare the forms and items missing. They may choose to do a retest, and if the scores soar upwards, it's perfectly legitimate for them to suspect cheating. If they catch the thieves, they can slap them with theft and copyright violation charges.

Can the amazing proliferation of testing in the late 90's explain some of these errors? It certainly can, in more ways than one. For starters, more tests mean more errors. It's also possible that testing errors which would not have been discovered before are being discovered now, because of the increasingly-high-stakes nature of many of these new exams. With the big, established testing companies, more quality assurance checks are in place now, and so more errors are being caught. This doesn't necessarily mean more errors are occurring. I think the numbers from 1976 are impossibly low, and that there were some errors back then that went undiscovered.

It's probably true, though, that not only are there more tests, but some of them have been rushed into production and administration, at speeds that would not have previously been acceptable. New testing companies have also sprung up to meet the need, and not all of these companies follow good quality assurance guidelines.

Some of the errors reported in this study are indeed problematic, and could have been avoided with better quality assurance systems (and more time for test development). The states and school districts have their roles to play as well, in choosing testing companies carefully and following standard procedures. As for the suggestion of a US agency to oversee testing, well, I'm skeptical about the potential of a federal agency to streamline the situation. Is the testing industry really analogous to the airline industry, in which a much-needed reduction in fatalities was acheived through federal intervention? Or would futher meddling from Washington DC only exacerbate the problem?

Posted by kswygert at 11:55 AM | Comments (5)

June 12, 2003

Department of Alarming Statistics

The malcontents at Fark have uncovered a SacBee education article containing a priceless little nugget of wisdom. It seems Sacramento City Councilwoman Lauren Hammond isn't happy with School Superintendent Jim Sweeney, despite the fact that, during his tenure, the number of low-performing Sacramento schools has dropped from 18 to 1. It seems, however, that there's this one little problem that he hasn't been able to fix:

"I don't doubt that Jim Sweeney loves children and had dedicated his life's career to improving education," [Hammond] said. "The school district has done some wonderful things ... but (on state tests) half the students are still below the 50th percentile. That's a problem."

Please tell me that at least the reporter understood the idiocy of this statement, and managed to stifle her laughter when Hammond made the comment. The Farksters certainly understood it, as one can tell from the label they slapped on the story.

Posted by kswygert at 03:40 PM | Comments (7)
Sitemeter