GenAI and human assessments of L2 Chinese writing: Interrater reliability and rater bias

Yuan Lu; Xiaoying Liles; Xi Ma

doi:10.1016/j.asw.2025.100989

Assessing Writing Oct 2025

GenAI and human assessments of L2 Chinese writing: Interrater reliability and rater bias

Yuan Lu Indiana University Bloomington ; Xiaoying Liles Indiana University Bloomington ; Xi Ma Asian University

Journal: Assessing Writing
Published: 2025-10-01
DOI: 10.1016/j.asw.2025.100989
CompPile
Open Access: Closed
Export: BibTeX RIS

Citation Context

Cited by in this index (0)

No articles in this index cite this work.

References (60) · 3 in this index

Andrich (2016)

Rasch rating-scale model

Handbook of item response theory: Volume one Models
Authors (2024)

Writing performance and discourse organization in L2 Chinese: A longitudinal case study

Journal of Second Language Writing
Akhtarshenas (2025)

ChatGPT or a silent everywhere helper: A survey of large language models

arXiv Preprint arXiv:2503 17403
American Council on the Teaching of Foreign Languages (ACTFL). (2024). ACTFL Proficiency Guidelines 2024. Ret…
Baffour (2023)

Analyzing bias in large language model solutions for assisted writing feedback tools: Les…

Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023)

Show all 60 →

Barrot (2023)

Using ChatGPT for second language writing: Pitfalls and potentials

Assessing Writing
Bond (2020)

Applying the Rasch model: Fundamental measurement in the human sciences
Bouziane (2024)

AI versus human effectiveness in essay evaluation

Discover Education ↗
Bridgeman (2012)

Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and …

Applied Measurement in Education ↗
Bui, N.M., & Barrot, J.S.(2025). ChatGPT as an automated essay scoring tool in the writing classrooms: how it…

↗
Chapelle (2008)

Building a validity argument for the test of English as a foreign language
Chen (2008)

Beyond the design of automated writing evaluation: Pedagogical practices and perceived le…

Language Learning Technology ↗
Eckes (2005)

Examining rater effects in TestDaF writing and speaking performance assessments: A many-f…

Language Assessment Quarterly ↗
Eckes (2012)

Operational rater types in writing assessment: Linking rater cognition to rater behavior

Language Assessment Quarterly ↗
Eckes (2019)

Many-facet Rasch measurement: Implications for rater-mediated language assessment

Quantitative data analysis for language assessment volume I
Geçkin (2023)

Assessing second-language academic writing: AI vs. human raters

Journal of Educational Technology and Online Learning ↗
Hannah (2023)

Validity arguments for automated essay scoring of young students’ writing traits

Language Assessment Quarterly ↗
Kane (2016)

Explicating validity

Assessment in education: Principles, Policy Practice
Karatay (2024)

Automated writing evaluation use in second language classrooms: A research synthesis

System ↗
Kim (2024)

ChatGPT for writing evaluation: Examining the accuracy and reliability of AI-generated sc…

Exploring artificial intelligence in applied linguistics
Klobucar et al. (2013)

Automated scoring in context: Rapid assessment for placed students

Assessing Writing
Li (2004)

Topic chains in Chinese discourse

Discourse Processes ↗
Li (2021)

Parataxis or hypotaxis? Choices of taxis in Chinese–English translation

Lingua ↗
Li (2014)

The role of automated writing evaluation holistic scores in the ESL classroom

System ↗
Lim (2011)

The development and maintenance of rating quality in performance writing assessment: A lo…

Language Testing ↗
Linacre (2002)

What do infit and outfit, mean-square and standardized mean?

Rasch Measurement Transactions
Linacre (2023)

A user’s guide to facets Rasch-model computer programs

Program Manual 3 87 0
Liu (2024)

DeepSeek-V3 technical report

arXiv Preprint arXiv:2412 19437
Liu (2016)

Investigating the application of automated writing evaluation to Chinese undergraduate En…

CALICO Journal ↗
Mizumoto (2023)

Exploring the potential of using an AI language model for automated essay scoring

Research Methods in Applied Linguistics ↗
Mizumoto (2024)

Testing the viability of ChatGPT as a companion in L2 writing accuracy assessment

Research Methods in Applied Linguistics ↗
OpenAI (2025). Introducing GPT-4.5. [Released on February 27, 2025]. Retrieved from: 〈https://openai.com/rese…
Pack (2024)

Large language models and automated essay scoring of English language learner writing: In…

Computers and Education: Artificial Intelligence
Page (1966)

The imminence of grading essays by computer

Phi Delta Kappan
Pfau (2023)

Exploring the potential of ChatGPT in assessing L2 writing accuracy for research purposes

Research Methods in Applied Linguistics ↗
Plonsky (2014)

How big is “big”? Interpreting effect sizes in L2 research

Language Learning ↗
Qian (2020)

Evaluating China’s Automated Essay Scoring System iWrite

Journal of Educational Computing Research ↗
Rahman (2025)

Comparative Analysis based on DeepSeek, ChatGPT, and Google Gemini: Features, techniques,…

arXiv Preprint arXiv:2503 04783
Ramesh (2022)

An automated essay scoring systems: A systematic literature review

Artificial Intelligence Review ↗
Ramineni (2018)

Understanding mean score differences between the e-rater® automated scoring engine and hu…

ETS Research Report Series ↗
Richardson (2021)

Rise of the machines? The evolving role of Artificial Intelligence (AI) technologies in h…

London Review of Education ↗
Schaefer (2008)

Rater bias patterns in an EFL writing assessment

Language Testing ↗
Shermis (2024)

Introduction to automated essay evaluation

The Routledge international handbook of automated essay evaluation
Shi (2023)

A systematic review of automated writing evaluation systems

Education and Information Technologies ↗
Shin (2024)

Exploratory study on the potential of ChatGPT as a rater of second language writing

Education and Information Technologies ↗
Shin (2020)

More efficient processes for creating automated essay scoring frameworks: A demonstration…

Language Testing ↗
Singh, S., Bansal, S., Saddik, A.E., & Saini, M.(2025). From ChatGPT to DeepSeek AI: A comprehensive analysis…
Sun (2006)

Chinese: A linguistic introduction
Tate (2024)

Can AI provide useful holistic essay scoring?

Computers and Education: Artificial Intelligence
Tseng (2023)

AI-writing tools in education: If you can’t beat them, join them

Journal of China Computer-Assisted Language Learning ↗
Uyar (2025)

Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT

International Journal of Assessment Tools in Education ↗
Wilson et al. (2024)

Validity of automated essay scores for elementary-age English language learners…

Assessing Writing
Winke (2013)

Raters’ L2 background as a potential source of bias in rating oral performance

Language Testing ↗
Xiao (2025)

Human-AI collaborative essay scoring: A dual-process framework with LLMs

Proceedings of the 15th International Learning Analytics and Knowledge Conference
Yamashita (2024)

An application of many-facet Rasch measurement to evaluate automated essay scoring: A cas…

Research Methods in Applied Linguistics ↗
Yamashita (2025)

Exploring potential biases in GPT-4o’s ratings of English language learners’ essays

Language Testing ↗
Yancey (2023)

Rating short L2 essays on the CEFR scale with GPT-4

Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications
Zhang (2016)

Testing writing in Chinese as a second language: An overview of research

Current trends in language testing in the Pacific Rim and the Middle East: Policies, analysis, and diagnosis
Zhou (2023)

Chinese intermediate English learners outdid ChatGPT in deep cohesion: Evidence from Engl…

System ↗
Zhu (2025)

A systematic review of artificial intelligence in language education: Current status and …

Language Learning Technology ↗

CrossRef global citation count: 0 View in citation network → Build reading path →