Pinakes — Rhetoric & Composition

July 2026

Jul 2026

Investigating the impact of ChatGPT-assisted self-assessment on college students' writing development: Insights from diverse linguistic backgrounds ↗

Hamidreza Moeiniasl

assessment artificial intelligence

doi:10.1016/j.asw.2026.101061
Jul 2026

LAWE-CL2: Multi-agent LLM-based automated writing evaluation system integrating linguistic features with fine-tuning for Chinese L2 writing assessment ↗

Xuelin Wang; Qihao Yang; Yuxin Hao; Zhijun Wang; Sijia Guo

assessment artificial intelligence multilingual writers

doi:10.1016/j.asw.2026.101051
Jul 2026

Anchor is the key: Toward accessible automated essay scoring with large language model through prompting ↗

Jaeyoon Choi; Tamara Tate; Mark Warschauer

artificial intelligence

doi:10.1016/j.asw.2026.101053
Jul 2026

Educator perspectives on automated writing scoring and feedback for young language learners: Applying a fairness and justice lens ↗

Jieun Kim; Mark Chapman; Lynn Shafer Willner; Jason A. Kemp; Ahyoung Alicia Kim

artificial intelligence

doi:10.1016/j.asw.2026.101050
Jul 2026

Accuracy and fairness of generative AI in automated essay scoring: Comparing GPT-4o, feature-based models, and human raters ↗

Yue Huang; Corey Palermo; Joshua Wilson

artificial intelligence

doi:10.1016/j.asw.2026.101047

April 2026

Apr 2026

ChatGPT feedback and emotional engagement in L2 writing: A control-value theory perspective using Q-methodology ↗

Guangyuan Yao; Zhaoxia Liu

artificial intelligence multilingual writers

doi:10.1016/j.asw.2026.101045
Apr 2026

Generative artificial intelligence for automated writing evaluation: A systematic review of trends, efficacy, and challenges ↗

Shadi I. Abudalfa; Jessie S. Barrot

assessment artificial intelligence book reviews

doi:10.1016/j.asw.2026.101041
Apr 2026

Associations of adolescents’ argumentative writing scores and growth when evaluated by different human raters and artificial intelligence models ↗

Deborah K. Reed; Sterett Mercer

argument artificial intelligence

doi:10.1016/j.asw.2026.101015
Apr 2026

Developing students’ feedback literacy in disciplinary academic writing through generative artificial intelligence ↗

Jianda Liu; Zihao Shi; Wanqing Li

artificial intelligence literacy studies

doi:10.1016/j.asw.2026.101030
Apr 2026 OA PDF

Assessing fairness in finetuned scoring models with demographically restricted training data ↗

Langdon Holmes; Wesley Morris; Scott Crossley; Joon Suh Choi

Abstract

The increasing adoption of automated essay scoring (AES) in high-stakes educational contexts necessitates careful examination of potential biases within the systems. This study investigates how the demographic composition of training data influences fairness in AES systems developed from finetuned large language models (LLMs). Using the PERSUADE corpus of 26,000 student essays, we conducted a systematic analysis using demographically restricted training sets to isolate the impact of training data demographics on LLM-AES performance. Each demographically restricted training set comprised essays written by one racial/ethnic group. Four variants of a Longformer-based AES were developed: one trained on demographically balanced data and three trained on demographically restricted datasets. An initial analysis of the human ratings indicated that demographic factors significantly predict human essay scores (marginal R² = 0.125), a pattern that is paralleled in national writing assessment data. LLM-AES systems trained on demographically restricted data exhibited small systematic biases (marginal R² = 0.043). However, the LLM trained on balanced data showed minimal demographic bias, suggesting that representative training data can effectively prevent amplification of demographic disparities beyond those present in human ratings. These results highlight both the importance and limitations of training data diversity in achieving fair assessment outcomes. • 12.5% of variance in human essay ratings was explained by demographics. • We construct demographically restricted training sets to isolate bias. • Balanced training data minimized LLM-AES bias across demographic groups. • LLM-AES trained on demographically restricted data showed more bias.

assessment artificial intelligence race and writing

doi:10.1016/j.asw.2026.101032
Apr 2026

The impact of ChatGPT’s feedback on L2 Chinese learners’ writing outcome, confidence, and emotions: A mixed-method quasi-experimental study ↗

Xian Zhao; Danping Wang

artificial intelligence

doi:10.1016/j.asw.2026.101027

January 2026

Jan 2026 OA PDF

Generative artificial intelligence for automated essay scoring: Exploring teacher agency through an ecological perspective ↗

Jessie S. Barrot

Abstract

Generative artificial intelligence (AI) is increasingly used in writing assessment, particularly for automated essay scoring (AES) and for generating formative feedback within automated writing evaluation (AWE). While AI-driven AES enhances efficiency and consistency, concerns regarding accuracy, bias, and ethical implications raise critical questions about its role in assessment. This paper examines the impact of generative AI on teacher agency through an ecological perspective, which considers agency as shaped by personal, institutional, and sociocultural factors. The analysis highlights the need for teachers to critically mediate AI-generated scores and feedback to align them with pedagogical goals, ensuring AI functions as an assistive tool rather than a determinant of assessment outcomes. Although AI can streamline assessment, over-reliance risks diminishing teachers’ evaluative expertise and reinforcing biases embedded in AI systems. Ethical concerns, including transparency, data privacy, and fairness, further complicate its adoption. To address these challenges, this paper proposes a framework for responsible AI integration that prioritizes bias mitigation, data security, and teacher-driven decision-making. The discussion concludes with pedagogical implications and directions for future research on AI-assisted writing assessment. • Teachers can actively mediate AI-generated scores to maintain agency. • Dependence on AES may weaken teachers’ evaluative skills. • Bias, data privacy, and AI opacity can undermine teachers’ decision-making. • AI literacy and hybrid assessment models can promote teacher autonomy. • A framework for protecting teacher agency in generative AI–based AWE is presented.

writing pedagogy teacher development assessment artificial intelligence literacy studies

doi:10.1016/j.asw.2025.100990
Jan 2026 OA PDF

Extracting interpretable writing traits from a large language model ↗

Paul Deane; Andrew Hoang

Abstract

Large language models (LLMs) are increasingly used to support automated writing evaluation (AWE), both for purposes of scoring and feedback. However, LLMs present challenges to interpretability, making it hard to evaluate the construct validity of scoring and feedback models. BIOT (best interpretable orthogonal transformations) is a new method of analysis that makes dimensions of an embedding interpretable by aligning them with external predictors. It was originally developed to improve the interpretability of multidimensional scaling models. However, This paper shows that BIOT can be used to align LLM embeddings with an interpretable writing trait model developed using multidimensional analysis of classical NLP features to measure latent dimensions of writing style and writing quality. This makes it possible to determine whether an AWE model built using an LLM is aligned with known (and construct-relevant) dimensions of textual variation, supporting construct validity. Specifically, we examine the alignment between the hidden layers of deBERTA, a small LLM that has been shown to be useful for a variety of natural language processing applications, and a writing trait model developed through factor analysis of classical features used in existing AWE models. Specific dimensions of transformed deBERTA layers are strongly correlated with these classical factors. When the transformation matrix derived using BIOT is applied to token vectors, it is also possible to visualize which tokens in the original text contributed to high or low scores on a specific dimension. • Large language models (LLMs) are increasingly used to support automated writing evaluate (AWE). • LLMs present challenges to interpretability, making it hard to evaluate construct validity of scoring and feedback models. • BIOT is a new interpretation method that aligns embedding dimensions with external predictors. • Specifically, BIOT can be used to align LLM embeddings with classical NLP measures of aspects of style and writing quality. • This demonstrates a general method to determine whether an LLM latently represents construct-relevant dimensions.

assessment artificial intelligence

doi:10.1016/j.asw.2025.101011

October 2024

Oct 2024 OA PDF

The impact of task duration on the scoring of independent writing responses of adult L2-English writers ↗

Ben Naismith; Yigal Attali; Geoffrey T. LaFlair

Abstract

In writing assessment, there is inherently a tension between authenticity and practicality: tasks with longer durations may more closely reflect real-life writing processes but are less feasible to administer and score. What is more, given total testing time, there is necessarily a trade-off between task duration and number of tasks. Traditionally, high-stakes assessments have managed this trade-off by administering one or two writing tasks each test, allowing 20–40 minutes per task. However, research on second language (L2) English writing has not found longer task durations to significantly improve score validity or reliability. Importantly, very few studies have compared much shorter durations for writing tasks to more traditional allotments. To explore this issue, we asked adult L2-English test takers to respond to two writing prompts with either 5-minute or 20-minute time limits. Responses were then evaluated by expert human raters and an automated writing evaluation tool. Regardless of scoring method, short duration scores evidenced equally high test-retest reliability and criterion validity as long duration scores. As expected, longer task duration yielded higher scores, but regardless of duration, test takers demonstrated the entire spectrum of writing proficiency. Implications for writing assessment are discussed in relation to scoring practices and task design. • Longer writing tasks do not have higher test-retest reliability than shorter ones. • Longer writing tasks do not have higher criterion validity than shorter ones. • The impact of task duration is not mediated by scoring method (human or machine).

assessment artificial intelligence

doi:10.1016/j.asw.2024.100895

April 2024

Apr 2024 OA PDF

Visualizing formative feedback in statistics writing: An exploratory study of student motivation using DocuScope Write & Audit ↗

Michael Laudenbach; David West Brown; Zhiyu Guo; Suguru Ishizaki; Alex Reinhart; Gordon Weinberg

Abstract

Recently, formative feedback in writing instruction has been supported by technologies generally referred to as Automated Writing Evaluation tools. However, such tools are limited in their capacity to explore specific disciplinary genres, and they have shown mixed results in student writing improvement. We explore how technology-enhanced writing interventions can positively affect student attitudes toward and beliefs about writing, both reinforcing content knowledge and increasing student motivation. Using a student-facing text-visualization tool called Write & Audit, we hosted revision workshops for students (n = 30) in an introductory-level statistics course at a large North American University. The tool is designed to be flexible: instructors of various courses can create expectations and predefine topics that are genre-specific. In this way, students are offered non-evaluative formative feedback which redirects them to field-specific strategies. To gauge the usefulness of Write & Audit, we used a previously validated survey instrument designed to measure the construct model of student motivation (Ling et al. 2021). Our results show significant increases in student self-efficacy and beliefs about the importance of content in successful writing. We contextualize these findings with data from three student think-aloud interviews, which demonstrate metacognitive awareness while using the tool. Ultimately, this exploratory study is non-experimental, but it contributes a novel approach to automated formative feedback and confirms the promising potential of Write & Audit.

genre theory writing pedagogy revision assessment artificial intelligence affect and writing

doi:10.1016/j.asw.2024.100830