Developing an e-rater Advisory to Detect Babel-generated Essays
Abstract
Background: It is important for developers of automated scoring systems to ensure that their systems are as fair and valid as possible. This commitment means evaluating the performance of these systems in light of construct-irrelevant response strategies. The enhancement of systems to detect and deal with these kinds of strategies is often an iterative process, whereby as new strategies come to light they need to be evaluated and effective mechanisms built into the automated scoring systems to handle them. In this paper, we focus on the Babel system, which automatically generates semantically incohesive essays. We expect that these essays may unfairly receive high scores from automated scoring engines despite essentially being nonsense. Literature Review: We discuss literature related to gaming of automated scoring systems. One reason that Babel essays are so easy to identify as nonsense by human readers is that they lack any semantic cohesion. Therefore, we also discuss some literature related to cohesion and detecting semantic cohesion. Research Questions: This study addressed three research questions:Can we automatically detect essays generated by the Babel system?Can we integrate the detection of Babel-generated essays into an operational automated essay scoring system while making sure not to flag valid student responses?Does a general approach for detecting semantically incohesive essays also detect Babel-generated essays?Research Methodology: This article describes the creation of two corpora necessary to address the research questions: (1) a corpus of Babel-generated essays and (2) a corresponding corpus of good-faith essays. We built a classifier to distinguish Babel-generated essays from good-faith essays and investigated whether the classifier can be integrated into an automated scoring engine without adverse effects. We also developed a measure of lexical-semantic cohesion and examined its distribution in Babel and in good-faith essays.Results: We found that the classifier built on Babel-generated essays and good-faith essays and using features from the automated scoring engine can distinguish the Babel-generated essays from the good-faith ones with 100% accuracy. We also found that if we integrated this classifier into the automated scoring engine it flagged very few responses that were submitted as part of operational submissions (76 of 434,656). The responses that were flagged had previously been assigned a score of Null (non-scorable) or a score of 1 by human experts. The measure of lexical-semantic cohesion shows promise in being able to distinguish Babel-generated essays from good-faith essays.Conclusions: Our results show that it is possible to detect the kind of gaming strategy illustrated by the Babel system and add it to an automated scoring engine without adverse effects on essays seen during real high-stakes tests. We also show that a measure of lexical-semantic cohesion can separate Babel-generated essays from good-faith essays to a certain degree, depending on task. This points to future work that would generalize the capability to detect semantic incoherence in essays. Directions for Further Research: Babel-generated essays can be identified and flagged by an automated scoring system without any adverse effects on a large set of good-faith essays. However, this is just one type of gaming strategy. It is important for developers of automated scoring systems to continue to be diligent about expanding the construct coverage of their systems in order to prevent weaknesses that can be exploited by tools such as Babel. It is also important to focus on the underlying linguistic reasons that lead to nonsense sentences. Successful identification of such nonsense would lead to improved automated scoring and feedback.
- Journal
- Journal of Writing Analytics
- Published
- 2018-01-01
- DOI
- 10.37514/jwa-j.2018.2.1.08
- CompPile
- Search in CompPile ↗
- Open Access
- OA PDF Gold
- Topics
- Export
- BibTeX RIS
Citation Context
Cited by in this index (0)
No articles in this index cite this work.
References (0)
No references on file for this article.
Related Articles
-
Res Rhetorica Jan 2026Review/Recenzja: Nancy Organ. 2024. Data Visualization for People of All Ages. Oxon: CRC Press; and Jen Christiansen. 2023. Building Science Graphics: An Illustrated Guide to Communicating Science Through Diagrams and Visualizations. Oxon: CRC Press ↗Ewa Modrzejewska
-
Peitho Jan 2026The Diasporic Cookbook as Chronotope, a Review of Kitchens of Hope: Immigrants Share Stories of Resilience and Recipes from Home ↗Marcella Prokop
-
Res Rhetorica Jun 2025Recenzja/Review: Risa Applegarth (2024), Just Kids: Youth Activism and Rhetorical Agency. Columbus: The Ohio State University Press ↗Wiktoria Ługowska
-
Rhetoric & Public Affairs Jun 2024Isaac James Richards
-
Pedagogy Apr 2024modern rhetorical theory rhetorical criticism feminist rhetorics first-year composition writing pedagogy creative writing teacher development collaborative writing assessment technical communication professional writing grammar and mechanics literacy studies gender and writing literary studies book reviews editorial matter