When ChatGPT was launched to the public in November 2022, advocates and watchdogs warned of the risk of racial bias. The new large-scale language model was created by harvesting 300 billion words from books, articles, and online writings, which include racist falsehoods and reflect the implicit biases of the authors. Biased training data is likely to generate biased advice, answers, and essays. Garbage in, garbage out.
Researchers are beginning to document how AI bias is showing up in unexpected ways. At the research and development arm of the giant testing agency ETS, which administers the SAT, two investigators pitted humans against machines to grade more than 13,000 essays written by students in grades 8 through 12. They found that the AI model that powers ChatGPT penalized Asian-American students more than other races and ethnicities in scoring the essays. This was purely a research exercise, and these essays and the machine scores were not used in any of ETS’s assessments. But the organization shared its analysis with me to warn schools and teachers about the risk of racial bias when using ChatGPT or other AI applications in the classroom.
AI and humans graded essays differently based on race and ethnicity
“You have to be careful and evaluate the results before you present them to students,” said Mo Zhang, one of the ETS researchers who led the analysis. “There are methods to do this, and you shouldn’t leave educational assessment specialists out of the equation.”
That may seem selfish coming from an employee of a company that specializes in educational assessment. But Zhang’s advice is worth considering amid the excitement of trying out new AI technology. There are potential pitfalls, as teachers save time by handing over the grading work to a robot.
In the ETS analysis, Zhang and his colleague Matt Johnson fed 13,121 trials into one of the latest versions of the AI model that powers ChatGPT, called GPT 4 Omni, or simply GPT-4o. (That version was added to ChatGPT in May 2024, but when the researchers ran this experiment, they used the latest AI model through a different portal.)
Some background on this large batch of essays: Students from across the country originally wrote these essays between 2015 and 2019 as part of standardized state exams or classroom assessments. Their task was to write an argumentative essay, such as “Should students be allowed to use cell phones in school?” The essays were collected to help scientists develop and test an automated writing assessment.
Each of the essays was scored by expert writing raters on a scale of 1 to 6 points, with 6 being the highest score. ETS asked GPT-4o to score them on the same six-point scale using the same scoring guide that humans use. Neither the human nor the machine were informed of the student’s race or ethnicity, but the researchers were able to see the students’ demographic information in the datasets that accompany these essays.
The GPT-4o gave essays a score almost a full point lower than the humans. The average score of the 13,121 essays was 2.8 for the GPT-4o and 3.7 for the humans. But Asian Americans were devalued by an additional quarter-point. Human raters gave Asian Americans an average score of 4.3, while the GPT-4o gave them only a 3.2, a deduction of about 1.1 points. In contrast, the difference in scores between humans and the GPT-4o was only about 0.9 points for white, black, and Hispanic students. Imagine an ice cream truck that keeps cutting an extra quarter-ball off the cones of only Asian-American children.
“Clearly, this doesn’t seem fair,” Johnson and Zhang wrote in an unpublished report they shared with me. While the additional penalty for Asian Americans isn’t terribly high, they said, it’s significant enough that it can’t be ignored.
The researchers aren’t sure why GPT-4o gave lower grades than humans and why it penalized Asian Americans more. Zhang and Johnson described the AI system as a “massive black box” of algorithms that operate in ways “not fully understood by their own developers.” This inability to explain a student’s grade on a written assignment makes the systems particularly frustrating to use in schools.
This study doesn’t prove that AI systematically under-scores essays or is biased against Asian Americans. Other versions of AI sometimes produce different results. A separate analysis of essay scoring by researchers at the University of California, Irvine, and Arizona State University found that AI scores essays were as often too high as too low. That study, which used ChatGPT version 3.5, didn’t look at results by race and ethnicity.
I wondered if the AI’s bias against Asian Americans was somehow related to their high academic achievement. Just as Asian Americans tend to score high on math and reading tests, Asian Americans were, on average, the best writers in this batch of 13,000 essays. Even with the penalty, Asian Americans still had the highest scores on the essays, well above those of white, black, Hispanic, Native American, or multiracial students.
In both the ETS and UC-ASU essay studies, AI gave far fewer perfect scores than humans. For example, in the ETS study, humans gave 732 perfect 6s, while GPT-4o gave only three total. GPT’s stinginess with perfect scores may have affected many Asian Americans who received 6s from human raters.
The ETS researchers asked GPT-4o to score essays cold, without showing the chatbot any scored examples to calibrate its scores. It’s possible that a few sample essays or small tweaks to the scoring instructions, or prompts, given to ChatGPT could reduce or eliminate bias against Asian Americans. Perhaps the bot would be fairer to Asian Americans if it were explicitly asked to “give more perfect 6s.”
ETS researchers told me that this isn’t the first time they’ve noticed that Asian students are treated differently by a robotic grader. Previous automated essay graders, which used different algorithms, have sometimes done the opposite, giving Asians higher scores than human graders. For example, an automated ETS grading system developed more than a decade ago, called e-rater, tended to inflate the scores of students from Korea, China, Taiwan, and Hong Kong on their essays for the Test of English as a Foreign Language (TOEFL), according to a study published in 2012. That may be because some Asian students had memorized well-structured paragraphs, while humans were quick to spot when the essays were off-topic. (The ETS website states that it only relies on the e-rater score for practice tests and uses it in conjunction with human scores for actual exams.)
Asian Americans also scored higher on an automated grading system created in a 2021 coding competition and powered by BERT, which was the most advanced algorithm before the current generation of large language models, such as GPT. The computer scientists ran their experimental bot-grader through a series of tests and found that it gave higher grades than humans to Asian Americans’ open-ended responses on a reading comprehension test.
It’s also unclear why BERT sometimes treats Asian Americans differently. But it does show how important it is to test these systems before deploying them in schools. But given the enthusiasm among teachers, I worry that this train has already left the station. In recent webinars, I’ve seen many teachers indicate in the chat window that they’re already using ChatGPT, Claude, and other AI-powered apps to grade assignments. This can save teachers time, but it can also hurt students.
This article about AI bias was written by Jill Barshay and produced by The Hechinger Report, an independent, nonprofit news organization focused on inequality and innovation in education. Sign up to receive Proof points and other Hechinger Newsletters.