Every year, millions of students sit down for standardized tests that carry weighty consequences. National tests like the Graduate Record Examinations (GRE) serve as gatekeepers to higher education, while state assessments can determine everything from whether a student will graduate to federal funding for schools and teacher pay.
Traditional paper-and-pencil tests have given way to computerized versions. And increasingly, the grading process—even for written essays—has also been turned over to algorithms.
Natural language processing (NLP) artificial intelligence systems—often called automated essay scoring engines—are now either the primary or secondary grader on standardized tests in at least 21 states, according to a survey conducted by Motherboard. Three states didn’t respond to the questions.
Of those 21 states, three said every essay is also graded by a human. But in the remaining 18 states, only a small percentage of students’ essays—it varies between 5 to 20 percent—will be randomly selected for a human grader to double check the machine’s work.
But research from psychometricians—professionals who study testing—and AI experts, as well as documents obtained by Motherboard, show that these tools are susceptible to a flaw that has repeatedly sprung up in the AI world: bias against certain demographic groups. And as a Motherboard experiment demonstrated, some of the systems can be fooled by nonsense essays with sophisticated vocabulary.
Essay-scoring engines don’t actually analyze the quality of writing. They’re trained on sets of hundreds of example essays to recognize patterns that correlate with higher or lower human-assigned grades. They then predict what score a human would assign an essay, based on those patterns.
“The problem is that bias is another kind of pattern, and so these machine learning systems are also going to pick it up,” said Emily M. Bender, a professor of computational linguistics at the University of Washington. “And not only will these machine learning programs pick up bias in the training sets, they’ll amplify it.”
An interactive map shows which U.S. states utilize automated essay scoring systems, according to a Motherboard investigation.
The education industry has long grappled with conscious and subconscious bias against students from certain language backgrounds, as demonstrated by efforts to ban the teaching of black English vernacular in several states.
AI has the potential to exacerbate discrimination, experts say. Training essay-scoring engines on datasets of human-scored answers can ingrain existing bias in the algorithms. But the engines also focus heavily on metrics like sentence length, vocabulary, spelling, and subject-verb agreement—the parts of writing that English language learners and other groups are more likely to do differently. The systems are also unable to judge more nuanced aspects of writing, like creativity.
Nevertheless, test administrators and some state education officials have embraced the technology. Traditionally, essays are scored jointly by two human examiners, but it is far cheaper to have a machine grade an essay, or serve as a back-up grader to a human.
Research is scarce on the issue of machine scoring bias, partly due to the secrecy of the companies that create these systems. Test scoring vendors closely guard their algorithms, and states are wary of drawing attention to the fact that algorithms, not humans, are grading students’ work. Only a handful of published studies have examined whether the engines treat students from different language backgrounds equally, but they back up some critics’ fears.
The nonprofit Educational Testing Service is one of the few, if not the only, vendor to have published research on bias in machine scoring. Its “E-rater” engine is used to grade a number of statewide assessments, the GRE, and the Test of English as a Foreign Language (TOEFL), which foreign students must take before attending certain colleges in the U.S.
“This is a universal issue of concern, this is a universal issue of occurrence, from all the people I’ve spoken to in this area,” David Williamson, ETS’ vice president of new product development, told Motherboard. “It’s simply that we’ve been public about it.”
In studies from 1999, 2004, 2007, 2008, 2012, and 2018, ETS found that its engine gave higher scores to some students, particularly those from mainland China, than did expert human graders. Meanwhile, it tended to underscore African Americans and, at various points, Arabic, Spanish, and Hindi speakers—even after attempts to reconfigure the system to fix the problem.
“If we make an adjustment that could help one group in one country, it’s probably going to hurt another group in another country,” said Brent Bridgeman, a senior ETS researcher.
The December 2018 study delved into ETS’ algorithms to determine the cause of the disparities.
E-rater tended to give students from mainland China lower scores for grammar and mechanics, when compared to the GRE test-taking population as a whole. But the engine gave them above-average scores for essay length and sophisticated word choice, which resulted in their essays receiving higher overall grades than those assigned by expert human graders. That combination of results, Williamson and the other researchers wrote, suggested many students from mainland China were using significant chunks of pre-memorized shell text.
African Americans, meanwhile, tended to get low marks from E-rater for grammar, style, and organization—a metric closely correlated with essay length—and therefore received below-average scores. But when expert humans graded their papers, they often performed substantially better.
The bias can severely impact how students do on high-stakes tests. The GRE essays are scored on a six-point scale, where 0 is assigned only to incomplete or wildly off-topic essays. When the ETS researchers compared the average difference between expert human graders and E-rater, they found that the machine boosted students from China by an average of 1.3 points on the grading scale and under-scored African Americans by .81 points. Those are just the mean results—for some students, the differences were even more drastic.
All essays scored by E-rater are also graded by a human and discrepancies are sent to a second human for a final grade. Because of that system, ETS does not believe any students have been adversely affected by the bias detected in E-rater.
It is illegal under federal law to disclose students’ scores on the GRE and other tests without their written consent, so outside audits of systems like E-rater are nearly impossible.
One of the other rare studies of bias in machine scoring, published in 2012, was conducted at the New Jersey Institute of Technology, which was researching which tests best predicted whether first-year students should be placed in remedial, basic, or honors writing classes.
Norbert Elliot, the editor of the Journal of Writing Analytics who previously served on the GRE’s technical advisory committee, was a NJIT professor at the time, and led the study. It found that ACCUPLACER, a machine-scored test owned by the College Board, failed to reliably predict female, Asian, Hispanic, and African American students’ eventual writing grades . NJIT determined it couldn’t legally defend its use of the test if it were challenged under Title VI or VII of the federal Civil Rights Act.
The ACCUPLACER test has since been updated, but lots of big questions remain about machine scoring in general, especially when no humans are in the loop.
“The BABEL Generator proved you can have complete incoherence, meaning one sentence had nothing to do with another,” and still receive a high mark from the algorithms.
Several years ago, Les Perelman, the former director of writing across the curriculum at MIT, and a group of students developed the Basic Automatic B.S. Essay Language (BABEL) Generator, a program that patched together strings of sophisticated words and sentences into meaningless gibberish essays. The nonsense essays consistently received high, sometimes perfect, scores when run through several different scoring engines
Motherboard replicated the experiment. We submitted two BABEL-generated essays—one in the “issue” category, the other in the “argument” category—to the GRE’s online ScoreItNow! practice tool, which uses E-rater. Both received scores of 4 out of 6, indicating the essays displayed “competent examination of the argument and convey(ed) meaning with acceptable clarity.”
Here’s the first sentence from the essay addressing technology’s impact on humans’ ability to think for themselves: “Invention for precincts has not, and presumably never will be undeniable in the extent to which we inspect the reprover.”
“The BABEL Generator proved you can have complete incoherence, meaning one sentence had nothing to do with another,” and still receive a high mark, Perelman told Motherboard.
“Automated writing evaluation is simply a means of tagging elements in a student’s work. If we overemphasize written conventions, standard written English, then you can see that the formula that drives this is only going to value certain kinds of writing,” Elliot, the former NJIT professor, said. “Knowledge of conventions is simply one part of a student’s ability to write … There may be a way that a student is particularly keen and insightful, and a human rater is going to value that. Not so with a machine.”
Elliot is nonetheless a proponent of machine scoring essays—so long as each essay is also graded by a human for quality control—and using NLP to provide instant feedback to writers.
“I was critical of what happened at a particular university [but] … I want to be very open to the use of technology to advance students’ successes,” he said. “I certainly wouldn’t want to shut down this entire line of writing analytics because it has been found, in certain cases, to sort students into inappropriate groups.”
But the existence of bias in the algorithms calls into question even the benefits of automated scoring, such as instant feedback for students and teachers.
“If the immediate feedback you’re giving to a student is going to be biased, is that useful feedback? Or is that feedback that’s also going to perpetuate discrimination against certain communities?” Sarah Myers West, a postdoctoral researcher at the AI Now Institute, told Motherboard.
In most machine scoring states, any of the randomly selected essays with wide discrepancies between human and machine scores are referred to another human for review.
Utah has been using AI as the primary scorer on its standardized tests for several years.
“It was a major cost to our state to hand score, in addition to very time consuming,” said Cydnee Carter, the state’s assessment development coordinator. The automated process also allowed the state to give immediate feedback to students and teachers, she said.
Through public records requests, Motherboard obtained annual technical reports prepared for the state of Utah by its longest-serving test provider, the nonprofit American Institutes for Research (AIR). The reports offer a glimpse into how providers do and don’t monitor their essay-scoring systems for fairness.
Each year, AIR field tests new questions during the statewide assessments. One of the things it monitors is whether female students or those from certain minority groups perform better or worse on particular questions than white or male students who scored similarly overall on the tests. The measurement is known as differential item functioning (DIF).
During the 2017-2018 school year in Utah, AIR flagged 348 English Language Arts questions that exhibited mild DIF against minority or female students in grades 3 through 8, compared to 40 that exhibited mild DIF against white or male students. It also flagged 3 ELA questions that demonstrated severe DIF against minorities or females.
Questions flagged for severe DIF go before AIR’s fairness and sensitivity committee for review.
It can be difficult to determine the cause of bias in these cases. It could be a result of the prompt’s wording, of a biased human grader, or of bias in the algorithms, said Susan Lottridge, the senior director of automated scoring at AIR.
“We don’t really know the source of DIF when it comes to these open-ended items,” she said. “I think it’s an area that’s really in the realm of research right now.”
Overall, AIR’s engine performs “reasonably similar across the (demographic) groups,” Lottridge said.
For some educators, that’s not enough. In 2018, Australia shelved its plan to implement machine scoring on its national standardized test due to an outcry from teachers and writing experts like Perelman. And across the amorphous AI industry, questions of bias are prompting companies to reconsider the value of these tools.
“It is a tremendously big issue in the broader field of AI,” West said. “That it remains a persistent challenge points to how complex and deeply rooted issues of discrimination are in the field … Just because a problem is difficult, doesn’t mean it’s something we don’t need to solve, especially when these tests are being used to decide people’s access to credentials they need to get a job.”