Assessing Writing

31 articles
Year: Topic: Clear
Export:
artificial intelligence ×

July 2026

  1. Investigating the impact of ChatGPT-assisted self-assessment on college students' writing development: Insights from diverse linguistic backgrounds
    doi:10.1016/j.asw.2026.101061
  2. LAWE-CL2: Multi-agent LLM-based automated writing evaluation system integrating linguistic features with fine-tuning for Chinese L2 writing assessment
    doi:10.1016/j.asw.2026.101051
  3. Anchor is the key: Toward accessible automated essay scoring with large language model through prompting
    doi:10.1016/j.asw.2026.101053
  4. Educator perspectives on automated writing scoring and feedback for young language learners: Applying a fairness and justice lens
    doi:10.1016/j.asw.2026.101050
  5. Accuracy and fairness of generative AI in automated essay scoring: Comparing GPT-4o, feature-based models, and human raters
    doi:10.1016/j.asw.2026.101047

April 2026

  1. ChatGPT feedback and emotional engagement in L2 writing: A control-value theory perspective using Q-methodology
    doi:10.1016/j.asw.2026.101045
  2. Generative artificial intelligence for automated writing evaluation: A systematic review of trends, efficacy, and challenges
    doi:10.1016/j.asw.2026.101041
  3. Associations of adolescents’ argumentative writing scores and growth when evaluated by different human raters and artificial intelligence models
    doi:10.1016/j.asw.2026.101015
  4. Developing students’ feedback literacy in disciplinary academic writing through generative artificial intelligence
    doi:10.1016/j.asw.2026.101030
  5. Assessing fairness in finetuned scoring models with demographically restricted training data
    Abstract

    The increasing adoption of automated essay scoring (AES) in high-stakes educational contexts necessitates careful examination of potential biases within the systems. This study investigates how the demographic composition of training data influences fairness in AES systems developed from finetuned large language models (LLMs). Using the PERSUADE corpus of 26,000 student essays, we conducted a systematic analysis using demographically restricted training sets to isolate the impact of training data demographics on LLM-AES performance. Each demographically restricted training set comprised essays written by one racial/ethnic group. Four variants of a Longformer-based AES were developed: one trained on demographically balanced data and three trained on demographically restricted datasets. An initial analysis of the human ratings indicated that demographic factors significantly predict human essay scores (marginal R² = 0.125), a pattern that is paralleled in national writing assessment data. LLM-AES systems trained on demographically restricted data exhibited small systematic biases (marginal R² = 0.043). However, the LLM trained on balanced data showed minimal demographic bias, suggesting that representative training data can effectively prevent amplification of demographic disparities beyond those present in human ratings. These results highlight both the importance and limitations of training data diversity in achieving fair assessment outcomes. • 12.5% of variance in human essay ratings was explained by demographics. • We construct demographically restricted training sets to isolate bias. • Balanced training data minimized LLM-AES bias across demographic groups. • LLM-AES trained on demographically restricted data showed more bias.

    doi:10.1016/j.asw.2026.101032
  6. The impact of ChatGPT’s feedback on L2 Chinese learners’ writing outcome, confidence, and emotions: A mixed-method quasi-experimental study
    doi:10.1016/j.asw.2026.101027

January 2026

  1. Generative artificial intelligence for automated essay scoring: Exploring teacher agency through an ecological perspective
    Abstract

    Generative artificial intelligence (AI) is increasingly used in writing assessment, particularly for automated essay scoring (AES) and for generating formative feedback within automated writing evaluation (AWE). While AI-driven AES enhances efficiency and consistency, concerns regarding accuracy, bias, and ethical implications raise critical questions about its role in assessment. This paper examines the impact of generative AI on teacher agency through an ecological perspective, which considers agency as shaped by personal, institutional, and sociocultural factors. The analysis highlights the need for teachers to critically mediate AI-generated scores and feedback to align them with pedagogical goals, ensuring AI functions as an assistive tool rather than a determinant of assessment outcomes. Although AI can streamline assessment, over-reliance risks diminishing teachers’ evaluative expertise and reinforcing biases embedded in AI systems. Ethical concerns, including transparency, data privacy, and fairness, further complicate its adoption. To address these challenges, this paper proposes a framework for responsible AI integration that prioritizes bias mitigation, data security, and teacher-driven decision-making. The discussion concludes with pedagogical implications and directions for future research on AI-assisted writing assessment. • Teachers can actively mediate AI-generated scores to maintain agency. • Dependence on AES may weaken teachers’ evaluative skills. • Bias, data privacy, and AI opacity can undermine teachers’ decision-making. • AI literacy and hybrid assessment models can promote teacher autonomy. • A framework for protecting teacher agency in generative AI–based AWE is presented.

    doi:10.1016/j.asw.2025.100990
  2. Extracting interpretable writing traits from a large language model
    Abstract

    Large language models (LLMs) are increasingly used to support automated writing evaluation (AWE), both for purposes of scoring and feedback. However, LLMs present challenges to interpretability, making it hard to evaluate the construct validity of scoring and feedback models. BIOT (best interpretable orthogonal transformations) is a new method of analysis that makes dimensions of an embedding interpretable by aligning them with external predictors. It was originally developed to improve the interpretability of multidimensional scaling models. However, This paper shows that BIOT can be used to align LLM embeddings with an interpretable writing trait model developed using multidimensional analysis of classical NLP features to measure latent dimensions of writing style and writing quality. This makes it possible to determine whether an AWE model built using an LLM is aligned with known (and construct-relevant) dimensions of textual variation, supporting construct validity. Specifically, we examine the alignment between the hidden layers of deBERTA, a small LLM that has been shown to be useful for a variety of natural language processing applications, and a writing trait model developed through factor analysis of classical features used in existing AWE models. Specific dimensions of transformed deBERTA layers are strongly correlated with these classical factors. When the transformation matrix derived using BIOT is applied to token vectors, it is also possible to visualize which tokens in the original text contributed to high or low scores on a specific dimension. • Large language models (LLMs) are increasingly used to support automated writing evaluate (AWE). • LLMs present challenges to interpretability, making it hard to evaluate construct validity of scoring and feedback models. • BIOT is a new interpretation method that aligns embedding dimensions with external predictors. • Specifically, BIOT can be used to align LLM embeddings with classical NLP measures of aspects of style and writing quality. • This demonstrates a general method to determine whether an LLM latently represents construct-relevant dimensions.

    doi:10.1016/j.asw.2025.101011

October 2025

  1. Investigating a customized generative AI chatbot for automated essay scoring in a disciplinary writing task
    doi:10.1016/j.asw.2025.100959
  2. Can generative AI figure out figurative language? The influence of idioms on essay scoring by ChatGPT, Gemini, and Deepseek
    doi:10.1016/j.asw.2025.100981
  3. Comparing GPT-based approaches in automated writing evaluation
    doi:10.1016/j.asw.2025.100961
  4. Using ChatGPT to score essays and short-form constructed responses
    doi:10.1016/j.asw.2025.100988
  5. Integrating move analysis and sentence reconstruction in automated writing evaluation for L2 academic writers
    doi:10.1016/j.asw.2025.100984

July 2025

  1. Using ChatGPT to facilitate vocabulary learning in continuation writing assessment tasks
    doi:10.1016/j.asw.2025.100952
  2. The impact of self-revision, machine translation, and ChatGPT on L2 writing: Raters’ assessments, linguistic complexity, and error correction
    doi:10.1016/j.asw.2025.100950

April 2025

  1. How L2 student writers engage with automated feedback: A longitudinal perspective
    doi:10.1016/j.asw.2025.100919

October 2024

  1. The impact of task duration on the scoring of independent writing responses of adult L2-English writers
    Abstract

    In writing assessment, there is inherently a tension between authenticity and practicality: tasks with longer durations may more closely reflect real-life writing processes but are less feasible to administer and score. What is more, given total testing time, there is necessarily a trade-off between task duration and number of tasks. Traditionally, high-stakes assessments have managed this trade-off by administering one or two writing tasks each test, allowing 20–40 minutes per task. However, research on second language (L2) English writing has not found longer task durations to significantly improve score validity or reliability. Importantly, very few studies have compared much shorter durations for writing tasks to more traditional allotments. To explore this issue, we asked adult L2-English test takers to respond to two writing prompts with either 5-minute or 20-minute time limits. Responses were then evaluated by expert human raters and an automated writing evaluation tool. Regardless of scoring method, short duration scores evidenced equally high test-retest reliability and criterion validity as long duration scores. As expected, longer task duration yielded higher scores, but regardless of duration, test takers demonstrated the entire spectrum of writing proficiency. Implications for writing assessment are discussed in relation to scoring practices and task design. • Longer writing tasks do not have higher test-retest reliability than shorter ones. • Longer writing tasks do not have higher criterion validity than shorter ones. • The impact of task duration is not mediated by scoring method (human or machine).

    doi:10.1016/j.asw.2024.100895

April 2024

  1. Visualizing formative feedback in statistics writing: An exploratory study of student motivation using DocuScope Write & Audit
    Abstract

    Recently, formative feedback in writing instruction has been supported by technologies generally referred to as Automated Writing Evaluation tools. However, such tools are limited in their capacity to explore specific disciplinary genres, and they have shown mixed results in student writing improvement. We explore how technology-enhanced writing interventions can positively affect student attitudes toward and beliefs about writing, both reinforcing content knowledge and increasing student motivation. Using a student-facing text-visualization tool called Write & Audit, we hosted revision workshops for students (n = 30) in an introductory-level statistics course at a large North American University. The tool is designed to be flexible: instructors of various courses can create expectations and predefine topics that are genre-specific. In this way, students are offered non-evaluative formative feedback which redirects them to field-specific strategies. To gauge the usefulness of Write & Audit, we used a previously validated survey instrument designed to measure the construct model of student motivation (Ling et al. 2021). Our results show significant increases in student self-efficacy and beliefs about the importance of content in successful writing. We contextualize these findings with data from three student think-aloud interviews, which demonstrate metacognitive awareness while using the tool. Ultimately, this exploratory study is non-experimental, but it contributes a novel approach to automated formative feedback and confirms the promising potential of Write & Audit.

    doi:10.1016/j.asw.2024.100830

July 2023

  1. Collaborating with ChatGPT in argumentative writing classrooms
    doi:10.1016/j.asw.2023.100752
  2. Using ChatGPT for second language writing: Pitfalls and potentials
    doi:10.1016/j.asw.2023.100745

April 2022

  1. Automated writing evaluation: Does spelling and grammar feedback support high-quality writing and revision?
    doi:10.1016/j.asw.2022.100608

April 2020

  1. eRevis(ing): Students’ revision of text evidence use in an automated writing evaluation system
    doi:10.1016/j.asw.2020.100449

January 2020

  1. Engaging with automated writing evaluation (AWE) feedback on L2 writing: Student perceptions and revisions
    doi:10.1016/j.asw.2019.100439

July 2019

  1. Affordances and limitations of the ACCUPLACER automated writing placement tool
    doi:10.1016/j.asw.2019.06.004

April 2018

  1. Student engagement with teacher and automated feedback on L2 writing
    doi:10.1016/j.asw.2018.02.004

October 2017

  1. Design and evaluation of automated writing evaluation models: Relationships with writing in naturalistic settings
    doi:10.1016/j.asw.2017.10.001