Assessing fairness in finetuned scoring models with demographically restricted training data

Langdon Holmes; Wesley Morris; Scott Crossley; Joon Suh Choi

doi:10.1016/j.asw.2026.101032

Assessing Writing Apr 2026 Open Access

Assessing fairness in finetuned scoring models with demographically restricted training data

Langdon Holmes Vanderbilt University ; Wesley Morris Vanderbilt University ; Scott Crossley ; Joon Suh Choi Vanderbilt University

Abstract

The increasing adoption of automated essay scoring (AES) in high-stakes educational contexts necessitates careful examination of potential biases within the systems. This study investigates how the demographic composition of training data influences fairness in AES systems developed from finetuned large language models (LLMs). Using the PERSUADE corpus of 26,000 student essays, we conducted a systematic analysis using demographically restricted training sets to isolate the impact of training data demographics on LLM-AES performance. Each demographically restricted training set comprised essays written by one racial/ethnic group. Four variants of a Longformer-based AES were developed: one trained on demographically balanced data and three trained on demographically restricted datasets. An initial analysis of the human ratings indicated that demographic factors significantly predict human essay scores (marginal R² = 0.125), a pattern that is paralleled in national writing assessment data. LLM-AES systems trained on demographically restricted data exhibited small systematic biases (marginal R² = 0.043). However, the LLM trained on balanced data showed minimal demographic bias, suggesting that representative training data can effectively prevent amplification of demographic disparities beyond those present in human ratings. These results highlight both the importance and limitations of training data diversity in achieving fair assessment outcomes. • 12.5% of variance in human essay ratings was explained by demographics. • We construct demographically restricted training sets to isolate bias. • Balanced training data minimized LLM-AES bias across demographic groups. • LLM-AES trained on demographically restricted data showed more bias.

Journal: Assessing Writing
Published: 2026-04-01
DOI: 10.1016/j.asw.2026.101032
CompPile: Search in CompPile ↗
Open Access: OA PDF Hybrid
Topics: assessment artificial intelligence race and writing
Export: BibTeX RIS

Citation Context

Cited by in this index (0)

No articles in this index cite this work.

References (40) · 1 in this index

AERA (2014)

Standards for educational and psychological testing
Ahmadi Shirazi (2019)

For a Greater Good: Bias Analysis in Writing Assessment

SAGE Open ↗
Amorim (2018)

Automated Essay Scoring in the Presence of Biased Ratings

Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
Crossley, S.A., Holmes L., & Morris, W. (Under Review). Assessing the reliability and validity of large langu…
Baffour (2023)

Analyzing bias in large language model solutions for assisted writing feedback tools: Les…

Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) ↗

Show all 40 →

Bates (2015)

Fitting linear mixed-effects models using lme4

Journal of Statistical Software ↗
Beltagy (2020)

Longformer: The Longest-Document transformer
Benjamin (2020)

Race after technology: Abolitionist tools for the New Jim Code
Benjamini (1995)

Controlling the false discovery rate: A practical and powerful approach to multiple testing

Journal of the Royal Statistical Society Series B (Methodological) ↗
Bennett (2015)

Validity and Automated Scoring

Technology and Testing
Beseiso (2020)

An Empirical Analysis of BERT Embedding for Automated Essay Scoring

International Journal of Advanced Computer Science and Applications ↗
Brown (2020)

Language models are few-shot learners

arXiv:2005 14165 [Cs]
Crossley et al. (2022)

The persuasive essays for rating, selecting, and understanding argumentative an…

Assessing Writing
Dastin (2018)

Amazon scraps secret AI recruiting tool that showed bias against women
Davis (2021)

Algorithmic reparation

Big Data & Society ↗
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transform…
García (2020)

Schools are still segregated, and black children are paying a price
Köchling (2020)

Discriminated by an algorithm: A systematic review of discrimination and fairness by algo…

Business Research ↗
Latif, E., & Zhai, X. (2023). Fine-tuning ChatGPT for Automatic Scoring (No. arXiv:2310.10072). arXiv. 〈https…

↗
Latif, E., Zhai, X., & Liu, L. (2025). AI Gender Bias, Disparities, and Fairness: Does Training Data Matter? …
Leckie (2011)

Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency…

Journal of Educational Measurement ↗
Lottridge (2022)

Examining Bias in Automated Scoring of Reading Comprehension Items
Lumley, T., & McNamara, T.F. (1993). Rater characteristics and rater bias: Implications for training. https:/…
Mayfield (2020)

Should you fine-tune BERT for automated essay scoring?

Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Mehrabi (2022)

A Survey on Bias and Fairness in Machine Learning
Misgna (2024)

A survey on deep learning-based automated essay scoring and feedback generation

Artificial Intelligence Review ↗
Mo (2017)

Predicting students’ writing performance on the NAEP from student- and state-level variables

Reading and Writing ↗
Palermo (2022)

Rater characteristics, response content, and scoring contexts: Decomposing the determinat…

Frontiers in Psychology ↗
Russell (2021)

An intersectional approach to DIF: Do initial findings hold across tests?

Educational Assessment ↗
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.A. M., Abid, A., Fisch, A., Brown, A.R., Santoro, A., Gupta, A…
Suresh (2021)

A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle

Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization
Uto (2021)

Learning Automated Essay Scoring Models Using Item-Response-Theory-Based Scores to Decrea…
Wang (2024)

Effectiveness of large language models in automated evaluation of argumentative essays: F…

Computer Assisted Language Learning
Warr (2024)

Is ChatGPT Racially Biased? The Case of Evaluating Student Writing
Williamson (2012)

A framework for evaluation and use of automated scoring

Educational Measurement: Issues and Practice ↗
Wolf (2020)

Transformers: State-of-the-art natural language processing

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Yamashita (2025)

Exploring potential biases in GPT-4o’s ratings of English language learners’ essays

Language Testing ↗
Yancey (2023)

Rating short L2 essays on the CEFR scale with GPT-4

Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
Yoshida (2024)

The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GP…

Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky
Zumbo (1999)

A Handbook on the Theory and Methods of Differential Item Functioning (DIF). Directorate of Human Resources Research and Evaluation

CrossRef global citation count: 0 View in citation network → Build reading path →

Assessing fairness in finetuned scoring models with demographically restricted training data

Abstract

Citation Context

Cited by in this index (0)

References (40) · 1 in this index

Related Articles