Wesley Morris
2 articles-
Abstract
The increasing adoption of automated essay scoring (AES) in high-stakes educational contexts necessitates careful examination of potential biases within the systems. This study investigates how the demographic composition of training data influences fairness in AES systems developed from finetuned large language models (LLMs). Using the PERSUADE corpus of 26,000 student essays, we conducted a systematic analysis using demographically restricted training sets to isolate the impact of training data demographics on LLM-AES performance. Each demographically restricted training set comprised essays written by one racial/ethnic group. Four variants of a Longformer-based AES were developed: one trained on demographically balanced data and three trained on demographically restricted datasets. An initial analysis of the human ratings indicated that demographic factors significantly predict human essay scores (marginal R² = 0.125), a pattern that is paralleled in national writing assessment data. LLM-AES systems trained on demographically restricted data exhibited small systematic biases (marginal R² = 0.043). However, the LLM trained on balanced data showed minimal demographic bias, suggesting that representative training data can effectively prevent amplification of demographic disparities beyond those present in human ratings. These results highlight both the importance and limitations of training data diversity in achieving fair assessment outcomes. • 12.5% of variance in human essay ratings was explained by demographics. • We construct demographically restricted training sets to isolate bias. • Balanced training data minimized LLM-AES bias across demographic groups. • LLM-AES trained on demographically restricted data showed more bias.
-
Abstract
Many linguistic studies of writing assume a single linear relationship between linguistic features in the text and human judgments of writing quality. However, writing quality may be better understood as a complex latent construct that can be constructed in a number of different ways through different linguistic profiles of high-quality writing styles as shown in Crossley et al. (2014). This study builds on the exploratory study reported by Crossley et al. by analyzing a representational corpus of 4,170 highly rated persuasive essays written by secondary-school students. The study uses natural language processing tools to derive quantitative representations for the linguistic features found in the texts. These linguistic features inform a k-means cluster analysis which indicates that a four-cluster profile best fits the data. By examining the indices most and least distinctive of each cluster, the study identifies a structured writing style, a conversational writing style, a reportive writing style, and an academic writing style. The findings support the notion that writers can employ a variety of writing profiles to successfully write an argumentative essay.