Assessing fairness in finetuned scoring models with demographically restricted training data

Langdon Holmes Vanderbilt University ; Wesley Morris Vanderbilt University ; Scott Crossley ; Joon Suh Choi Vanderbilt University

Abstract

The increasing adoption of automated essay scoring (AES) in high-stakes educational contexts necessitates careful examination of potential biases within the systems. This study investigates how the demographic composition of training data influences fairness in AES systems developed from finetuned large language models (LLMs). Using the PERSUADE corpus of 26,000 student essays, we conducted a systematic analysis using demographically restricted training sets to isolate the impact of training data demographics on LLM-AES performance. Each demographically restricted training set comprised essays written by one racial/ethnic group. Four variants of a Longformer-based AES were developed: one trained on demographically balanced data and three trained on demographically restricted datasets. An initial analysis of the human ratings indicated that demographic factors significantly predict human essay scores (marginal R² = 0.125), a pattern that is paralleled in national writing assessment data. LLM-AES systems trained on demographically restricted data exhibited small systematic biases (marginal R² = 0.043). However, the LLM trained on balanced data showed minimal demographic bias, suggesting that representative training data can effectively prevent amplification of demographic disparities beyond those present in human ratings. These results highlight both the importance and limitations of training data diversity in achieving fair assessment outcomes. • 12.5% of variance in human essay ratings was explained by demographics. • We construct demographically restricted training sets to isolate bias. • Balanced training data minimized LLM-AES bias across demographic groups. • LLM-AES trained on demographically restricted data showed more bias.

Journal
Assessing Writing
Published
2026-04-01
DOI
10.1016/j.asw.2026.101032
CompPile
Search in CompPile ↗
Open Access
OA PDF Hybrid
Topics
Export

Citation Context

Cited by in this index (0)

No articles in this index cite this work.

References (40) · 1 in this index

  1. Standards for educational and psychological testing
  2. For a Greater Good: Bias Analysis in Writing Assessment
    SAGE Open  
  3. Automated Essay Scoring in the Presence of Biased Ratings
    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
  4. Crossley, S.A., Holmes L., & Morris, W. (Under Review). Assessing the reliability and validity of large langu…
  5. Analyzing bias in large language model solutions for assisted writing feedback tools: Les…
    Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)  
Show all 40 →
  1. Fitting linear mixed-effects models using lme4
    Journal of Statistical Software  
  2. Longformer: The Longest-Document transformer
  3. Race after technology: Abolitionist tools for the New Jim Code
  4. Controlling the false discovery rate: A practical and powerful approach to multiple testing
    Journal of the Royal Statistical Society Series B (Methodological)  
  5. Validity and Automated Scoring
    Technology and Testing
  6. An Empirical Analysis of BERT Embedding for Automated Essay Scoring
    International Journal of Advanced Computer Science and Applications  
  7. Language models are few-shot learners
    arXiv:2005 14165 [Cs]
  8. Assessing Writing
  9. Amazon scraps secret AI recruiting tool that showed bias against women
  10. Algorithmic reparation
    Big Data & Society  
  11. Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transform…
  12. Schools are still segregated, and black children are paying a price
  13. Discriminated by an algorithm: A systematic review of discrimination and fairness by algo…
    Business Research  
  14. Latif, E., & Zhai, X. (2023). Fine-tuning ChatGPT for Automatic Scoring (No. arXiv:2310.10072). arXiv. 〈https…
  15. Latif, E., Zhai, X., & Liu, L. (2025). AI Gender Bias, Disparities, and Fairness: Does Training Data Matter? …
  16. Rater effects on essay scoring: A multilevel analysis of severity drift, central tendency…
    Journal of Educational Measurement  
  17. Examining Bias in Automated Scoring of Reading Comprehension Items
  18. Lumley, T., & McNamara, T.F. (1993). Rater characteristics and rater bias: Implications for training. https:/…
  19. Should you fine-tune BERT for automated essay scoring?
    Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
  20. A Survey on Bias and Fairness in Machine Learning
  21. A survey on deep learning-based automated essay scoring and feedback generation
    Artificial Intelligence Review  
  22. Predicting students’ writing performance on the NAEP from student- and state-level variables
    Reading and Writing  
  23. Rater characteristics, response content, and scoring contexts: Decomposing the determinat…
    Frontiers in Psychology  
  24. An intersectional approach to DIF: Do initial findings hold across tests?
    Educational Assessment  
  25. Srivastava, A., Rastogi, A., Rao, A., Shoeb, A.A. M., Abid, A., Fisch, A., Brown, A.R., Santoro, A., Gupta, A…
  26. A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle
    Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization
  27. Learning Automated Essay Scoring Models Using Item-Response-Theory-Based Scores to Decrea…
  28. Effectiveness of large language models in automated evaluation of argumentative essays: F…
    Computer Assisted Language Learning
  29. Is ChatGPT Racially Biased? The Case of Evaluating Student Writing
  30. A framework for evaluation and use of automated scoring
    Educational Measurement: Issues and Practice  
  31. Transformers: State-of-the-art natural language processing
    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
  32. Exploring potential biases in GPT-4o’s ratings of English language learners’ essays
    Language Testing  
  33. Rating short L2 essays on the CEFR scale with GPT-4
    Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)
  34. The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GP…
    Artificial Intelligence in Education. Posters and Late Breaking Results, Workshops and Tutorials, Industry and Innovation Tracks, Practitioners, Doctoral Consortium and Blue Sky
  35. A Handbook on the Theory and Methods of Differential Item Functioning (DIF). Directorate of Human Resources Research and Evaluation