GenAI and human assessments of L2 Chinese writing: Interrater reliability and rater bias

Yuan Lu Indiana University Bloomington ; Xiaoying Liles Indiana University Bloomington ; Xi Ma Asian University
Journal
Assessing Writing
Published
2025-10-01
DOI
10.1016/j.asw.2025.100989
CompPile
Search in CompPile ↗
Open Access
Closed
Export

Citation Context

Cited by in this index (0)

No articles in this index cite this work.

References (60) · 3 in this index

  1. Rasch rating-scale model
    Handbook of item response theory: Volume one Models
  2. Writing performance and discourse organization in L2 Chinese: A longitudinal case study
    Journal of Second Language Writing
  3. ChatGPT or a silent everywhere helper: A survey of large language models
    arXiv Preprint arXiv:2503 17403
  4. American Council on the Teaching of Foreign Languages (ACTFL). (2024). ACTFL Proficiency Guidelines 2024. Ret…
  5. Analyzing bias in large language model solutions for assisted writing feedback tools: Les…
    Proceedings of the 18th workshop on innovative use of NLP for building educational applications (BEA 2023)
Show all 60 →
  1. Assessing Writing
  2. Applying the Rasch model: Fundamental measurement in the human sciences
  3. AI versus human effectiveness in essay evaluation
    Discover Education  
  4. Comparison of human and machine scoring of essays: Differences by gender, ethnicity, and …
    Applied Measurement in Education  
  5. Bui, N.M., & Barrot, J.S.(2025). ChatGPT as an automated essay scoring tool in the writing classrooms: how it…
  6. Building a validity argument for the test of English as a foreign language
  7. Beyond the design of automated writing evaluation: Pedagogical practices and perceived le…
    Language Learning Technology  
  8. Examining rater effects in TestDaF writing and speaking performance assessments: A many-f…
    Language Assessment Quarterly  
  9. Operational rater types in writing assessment: Linking rater cognition to rater behavior
    Language Assessment Quarterly  
  10. Many-facet Rasch measurement: Implications for rater-mediated language assessment
    Quantitative data analysis for language assessment volume I
  11. Assessing second-language academic writing: AI vs. human raters
    Journal of Educational Technology and Online Learning  
  12. Validity arguments for automated essay scoring of young students’ writing traits
    Language Assessment Quarterly  
  13. Explicating validity
    Assessment in education: Principles, Policy Practice
  14. Automated writing evaluation use in second language classrooms: A research synthesis
    System  
  15. ChatGPT for writing evaluation: Examining the accuracy and reliability of AI-generated sc…
    Exploring artificial intelligence in applied linguistics
  16. Assessing Writing
  17. Topic chains in Chinese discourse
    Discourse Processes  
  18. Parataxis or hypotaxis? Choices of taxis in Chinese–English translation
    Lingua  
  19. The role of automated writing evaluation holistic scores in the ESL classroom
    System  
  20. The development and maintenance of rating quality in performance writing assessment: A lo…
    Language Testing  
  21. What do infit and outfit, mean-square and standardized mean?
    Rasch Measurement Transactions
  22. A user’s guide to facets Rasch-model computer programs
    Program Manual 3 87 0
  23. DeepSeek-V3 technical report
    arXiv Preprint arXiv:2412 19437
  24. Investigating the application of automated writing evaluation to Chinese undergraduate En…
    CALICO Journal  
  25. Exploring the potential of using an AI language model for automated essay scoring
    Research Methods in Applied Linguistics  
  26. Testing the viability of ChatGPT as a companion in L2 writing accuracy assessment
    Research Methods in Applied Linguistics  
  27. OpenAI (2025). Introducing GPT-4.5. [Released on February 27, 2025]. Retrieved from: 〈https://openai.com/rese…
  28. Large language models and automated essay scoring of English language learner writing: In…
    Computers and Education: Artificial Intelligence
  29. The imminence of grading essays by computer
    Phi Delta Kappan
  30. Exploring the potential of ChatGPT in assessing L2 writing accuracy for research purposes
    Research Methods in Applied Linguistics  
  31. How big is “big”? Interpreting effect sizes in L2 research
    Language Learning  
  32. Evaluating China’s Automated Essay Scoring System iWrite
    Journal of Educational Computing Research  
  33. Comparative Analysis based on DeepSeek, ChatGPT, and Google Gemini: Features, techniques,…
    arXiv Preprint arXiv:2503 04783
  34. An automated essay scoring systems: A systematic literature review
    Artificial Intelligence Review  
  35. Understanding mean score differences between the e-rater® automated scoring engine and hu…
    ETS Research Report Series  
  36. Rise of the machines? The evolving role of Artificial Intelligence (AI) technologies in h…
    London Review of Education  
  37. Rater bias patterns in an EFL writing assessment
    Language Testing  
  38. Introduction to automated essay evaluation
    The Routledge international handbook of automated essay evaluation
  39. A systematic review of automated writing evaluation systems
    Education and Information Technologies  
  40. Exploratory study on the potential of ChatGPT as a rater of second language writing
    Education and Information Technologies  
  41. More efficient processes for creating automated essay scoring frameworks: A demonstration…
    Language Testing  
  42. Singh, S., Bansal, S., Saddik, A.E., & Saini, M.(2025). From ChatGPT to DeepSeek AI: A comprehensive analysis…
  43. Chinese: A linguistic introduction
  44. Can AI provide useful holistic essay scoring?
    Computers and Education: Artificial Intelligence
  45. AI-writing tools in education: If you can’t beat them, join them
    Journal of China Computer-Assisted Language Learning  
  46. Artificial intelligence as an automated essay scoring tool: A focus on ChatGPT
    International Journal of Assessment Tools in Education  
  47. Assessing Writing
  48. Raters’ L2 background as a potential source of bias in rating oral performance
    Language Testing  
  49. Human-AI collaborative essay scoring: A dual-process framework with LLMs
    Proceedings of the 15th International Learning Analytics and Knowledge Conference
  50. An application of many-facet Rasch measurement to evaluate automated essay scoring: A cas…
    Research Methods in Applied Linguistics  
  51. Exploring potential biases in GPT-4o’s ratings of English language learners’ essays
    Language Testing  
  52. Rating short L2 essays on the CEFR scale with GPT-4
    Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications
  53. Testing writing in Chinese as a second language: An overview of research
    Current trends in language testing in the Pacific Rim and the Middle East: Policies, analysis, and diagnosis
  54. Chinese intermediate English learners outdid ChatGPT in deep cohesion: Evidence from Engl…
    System  
  55. A systematic review of artificial intelligence in language education: Current status and …
    Language Learning Technology