Pinakes — Rhetoric & Composition

July 2026

Jul 2026

Investigating the impact of ChatGPT-assisted self-assessment on college students' writing development: Insights from diverse linguistic backgrounds ↗

Hamidreza Moeiniasl

assessment artificial intelligence

doi:10.1016/j.asw.2026.101061
Jul 2026

LAWE-CL2: Multi-agent LLM-based automated writing evaluation system integrating linguistic features with fine-tuning for Chinese L2 writing assessment ↗

Xuelin Wang; Qihao Yang; Yuxin Hao; Zhijun Wang; Sijia Guo

assessment artificial intelligence multilingual writers

doi:10.1016/j.asw.2026.101051

April 2026

Apr 2026

Generative artificial intelligence for automated writing evaluation: A systematic review of trends, efficacy, and challenges ↗

Shadi I. Abudalfa; Jessie S. Barrot

assessment artificial intelligence book reviews

doi:10.1016/j.asw.2026.101041
Apr 2026 OA PDF

Pursuing fair writing assessment: Halo effects in primary school foreign language writing in grade six ↗

Ruth Trüb; Julian Lohmann; Jens Möller; Stefan D. Keller

Abstract

Assessing the writing competence of pupils learning English as a foreign language (EFL) at primary school is associated with specific challenges because of learners’ limited language resources. This study investigates the extent to which characteristics of their texts trigger so-called halo effects. Halo effects are an assessment bias where the quality of one feature unintentionally influences the evaluation of other aspects. The study examines halo effects across nine aspects of text quality (communicative effect, level of detail, coherence, cohesion, complexity of syntax and grammar, correctness of syntax and grammar, vocabulary, orthography and punctuation), based on a random sample of narrative texts from a sixth-grade corpus. 200 pre-service teachers assessed four randomly assigned texts. Halo effects were calculated by comparison to expert ratings using multi-level regression analyses. Results show that orthography and vocabulary were the two main triggers of halo effects. Punctuation also triggered some halo effects, but to a smaller extent. The assessment of communicative effect, complexity and correctness of syntax and grammar was not determined by the corresponding text quality but dominated by other criteria. Results highlight the importance of being aware of halo effects when assessing young EFL learners’ texts and emphasise the need for suitable training measures. • Analysis of halo effects across nine aspects of text quality. • Random sample of narrative texts from a sixth-grade EFL corpus. • Orthography and vocabulary are the two main triggers of halo effects. • Punctuation also triggers halo effects but to a smaller extent. • Halo effects call for awareness and targeted training.

assessment multilingual writers grammar and mechanics

doi:10.1016/j.asw.2026.101036
Apr 2026 OA PDF

From spelling to content: The influence of spelling quality on text assessment ↗

Frederike Strahl; Jörg Kilian; Jens Möller

assessment grammar and mechanics

doi:10.1016/j.asw.2026.101014
Apr 2026 OA PDF

How do L2 writing subskills interact hierarchically? Insights from diagnostic classification models ↗

Farshad Effatpanah; Hamdollah Ravand; Mahmoud Abdi Tabari; Yi-Hsin Chen; Olga Kunina-Habenicht

Abstract

This study examined the hierarchical structure among second/foreign language (L2) writing subskills using a Hierarchical Diagnostic Classification Model (HDCM). A pool of 500 essays composed by English as a Foreign Language (EFL) students was assessed by four experienced EFL teachers using the Empirically-derived Descriptor-based Diagnostic (EDD) checklist. Based on a literature review and the expertise of three content experts, several models were developed to reflect various hierarchical interactions among L2 writing subskills, including linear, divergent, convergent, independent, unstructured, mixed, and higher-order. The comparison of the models showed the presence of an unstructured interaction among L2 writing subskills, indicating that content is the foundational subskill for the mastery of vocabulary, grammar, organization, and mechanics. Higher mastery classes were also associated with higher educational levels, greater frequency of English use, and longer exposure to L2. Understanding the hierarchical relationships among L2 writing subskills can improve targeted instructional strategies and assessment practices. • A constrained version of existing DCMs is represented by hierarchical DCMs. • Models were developed to show hierarchical interactions among L2 writing subskills. • An unstructured interaction among L2 writing subskills was identified. • Higher mastery classes were associated with higher educational levels. • The classes were associated with greater English use and longer L2 exposure.

assessment multilingual writers grammar and mechanics

doi:10.1016/j.asw.2026.101029
Apr 2026

Assessing GenAI-assisted digital multimodal composing: Reconceptualizing a genre-based framework through self-assessment and peer assessment ↗

Yunan Zhang; Zixuan Li

genre theory assessment multimodality

doi:10.1016/j.asw.2026.101017
Apr 2026 OA PDF

Assessing fairness in finetuned scoring models with demographically restricted training data ↗

Langdon Holmes; Wesley Morris; Scott Crossley; Joon Suh Choi

Abstract

The increasing adoption of automated essay scoring (AES) in high-stakes educational contexts necessitates careful examination of potential biases within the systems. This study investigates how the demographic composition of training data influences fairness in AES systems developed from finetuned large language models (LLMs). Using the PERSUADE corpus of 26,000 student essays, we conducted a systematic analysis using demographically restricted training sets to isolate the impact of training data demographics on LLM-AES performance. Each demographically restricted training set comprised essays written by one racial/ethnic group. Four variants of a Longformer-based AES were developed: one trained on demographically balanced data and three trained on demographically restricted datasets. An initial analysis of the human ratings indicated that demographic factors significantly predict human essay scores (marginal R² = 0.125), a pattern that is paralleled in national writing assessment data. LLM-AES systems trained on demographically restricted data exhibited small systematic biases (marginal R² = 0.043). However, the LLM trained on balanced data showed minimal demographic bias, suggesting that representative training data can effectively prevent amplification of demographic disparities beyond those present in human ratings. These results highlight both the importance and limitations of training data diversity in achieving fair assessment outcomes. • 12.5% of variance in human essay ratings was explained by demographics. • We construct demographically restricted training sets to isolate bias. • Balanced training data minimized LLM-AES bias across demographic groups. • LLM-AES trained on demographically restricted data showed more bias.

assessment artificial intelligence race and writing

doi:10.1016/j.asw.2026.101032
Apr 2026

Aligning ACTFL writing proficiency guidelines with CEFR descriptors: Insights from Chinese writing assessment ↗

Pei-Jiun Lan; Po-Hsi Chen; Shu-Huei Peng; Chia-Pi Chao; Tsung-An Wang; Wen-Hsun Tseng

assessment

doi:10.1016/j.asw.2026.101033

January 2026

Jan 2026 OA PDF

The effects of online resource use on L2 learners’ computer-mediated writing processes and written products ↗

Honglan Wang; Jookyoung Jung

Abstract

While previous studies on online resource use in L2 writing have focused on the overall writing quality, limited attention has been paid to its effects on linguistic complexity and real-time writing processes. Addressing this gap, the present study explored how online resource use influences both the processes and products of L2 writing. Forty-nine intermediate L2 learners completed two computer-mediated argumentative writing tasks, either with or without the use of online resources. Writing behaviors were captured via keystroke logging and screen recording, and analyzed for search activity, fluency, pausing, and revision quantity. Cognitive processes were examined through stimulated recall interviews, and written products were evaluated for both quality and linguistic complexity. The results showed that participants spent an average of 14 % of task time using online resources, with considerable individual variation. Mixed-effects modeling revealed that resource use facilitated the production of more sophisticated words, with marginal influence on writing quality or syntactic complexity. Resource use was also associated with longer between-word pauses, fewer within-word pauses, and reduced revisions. These findings highlight the potential of online resource use to enhance the authenticity of L2 writing assessment tasks without compromising test validity, while encouraging the use of more advanced vocabulary in writing. • Learners spent 14 % of the total writing task time using online resources. • Online resource use had no significant impact on L2 writing quality. • Online resource use improved lexical sophistication, not syntactic complexity. • Online resource use reduced within-word pauses and aided spelling retrieval. • Online resource use led to fewer revisions but did not affect fluency.

revision argument assessment digital rhetoric multilingual writers grammar and mechanics affect and writing

doi:10.1016/j.asw.2025.100994
Jan 2026

Verb-centric or balanced?: An NLP-based assessment of word class contributions to L2 writing proficiency ↗

Hyunwoo Kim; Haerim Hwang

assessment multilingual writers

doi:10.1016/j.asw.2025.100997
Jan 2026 OA PDF

Generative artificial intelligence for automated essay scoring: Exploring teacher agency through an ecological perspective ↗

Jessie S. Barrot

Abstract

Generative artificial intelligence (AI) is increasingly used in writing assessment, particularly for automated essay scoring (AES) and for generating formative feedback within automated writing evaluation (AWE). While AI-driven AES enhances efficiency and consistency, concerns regarding accuracy, bias, and ethical implications raise critical questions about its role in assessment. This paper examines the impact of generative AI on teacher agency through an ecological perspective, which considers agency as shaped by personal, institutional, and sociocultural factors. The analysis highlights the need for teachers to critically mediate AI-generated scores and feedback to align them with pedagogical goals, ensuring AI functions as an assistive tool rather than a determinant of assessment outcomes. Although AI can streamline assessment, over-reliance risks diminishing teachers’ evaluative expertise and reinforcing biases embedded in AI systems. Ethical concerns, including transparency, data privacy, and fairness, further complicate its adoption. To address these challenges, this paper proposes a framework for responsible AI integration that prioritizes bias mitigation, data security, and teacher-driven decision-making. The discussion concludes with pedagogical implications and directions for future research on AI-assisted writing assessment. • Teachers can actively mediate AI-generated scores to maintain agency. • Dependence on AES may weaken teachers’ evaluative skills. • Bias, data privacy, and AI opacity can undermine teachers’ decision-making. • AI literacy and hybrid assessment models can promote teacher autonomy. • A framework for protecting teacher agency in generative AI–based AWE is presented.

writing pedagogy teacher development assessment artificial intelligence literacy studies

doi:10.1016/j.asw.2025.100990
Jan 2026 OA PDF

The relation between linguistic accuracy and scoring of Swedish EFL students’ writing during a high-stakes exam ↗

Christian Holmberg Sjöling

Abstract

This paper examines the effect of linguistic accuracy (e.g., the lack of form, grammatical, and lexical errors) on scoring during the high-stakes national test of English in Swedish upper secondary school. Teachers are expected to score their own students’ texts with the help of assessment instructions containing benchmark texts (i.e., texts representing different score bands). The assessment instructions and the score bands provided to guide scoring are not explicit about how accuracy should influence scores. Two research questions were answered: As measured by ordinal regression, to what extent does linguistic accuracy predict rater scores? Do the texts scored by teachers reflect the graded example texts in terms of how linguistic accuracy predicts scores? The results revealed, amongst other things, that overall frequency of errors in texts significantly predicted scores as the model explained approximately 58 % of the variance in the outcome variable according to Nagelkerke’s pseudo R-squared. Accuracy also had a similar effect on scores in texts rated by teachers as in the benchmark texts. In relation to the findings, it was concluded that accuracy may have more of an impact on scores than constructs that are more explicit components of the score bands such as lexical complexity.

assessment multilingual writers

doi:10.1016/j.asw.2025.100995
Jan 2026 OA PDF

Extracting interpretable writing traits from a large language model ↗

Paul Deane; Andrew Hoang

Abstract

Large language models (LLMs) are increasingly used to support automated writing evaluation (AWE), both for purposes of scoring and feedback. However, LLMs present challenges to interpretability, making it hard to evaluate the construct validity of scoring and feedback models. BIOT (best interpretable orthogonal transformations) is a new method of analysis that makes dimensions of an embedding interpretable by aligning them with external predictors. It was originally developed to improve the interpretability of multidimensional scaling models. However, This paper shows that BIOT can be used to align LLM embeddings with an interpretable writing trait model developed using multidimensional analysis of classical NLP features to measure latent dimensions of writing style and writing quality. This makes it possible to determine whether an AWE model built using an LLM is aligned with known (and construct-relevant) dimensions of textual variation, supporting construct validity. Specifically, we examine the alignment between the hidden layers of deBERTA, a small LLM that has been shown to be useful for a variety of natural language processing applications, and a writing trait model developed through factor analysis of classical features used in existing AWE models. Specific dimensions of transformed deBERTA layers are strongly correlated with these classical factors. When the transformation matrix derived using BIOT is applied to token vectors, it is also possible to visualize which tokens in the original text contributed to high or low scores on a specific dimension. • Large language models (LLMs) are increasingly used to support automated writing evaluate (AWE). • LLMs present challenges to interpretability, making it hard to evaluate construct validity of scoring and feedback models. • BIOT is a new interpretation method that aligns embedding dimensions with external predictors. • Specifically, BIOT can be used to align LLM embeddings with classical NLP measures of aspects of style and writing quality. • This demonstrates a general method to determine whether an LLM latently represents construct-relevant dimensions.

assessment artificial intelligence

doi:10.1016/j.asw.2025.101011
Jan 2026 OA PDF

How reliable and valid is peer evaluation in adolescents’ L2 argumentative writing? ↗

Albert W. Li; Steve Graham

Abstract

Peer evaluation is widely recognized for its educational benefits; however, its reliability and validity, particularly among adolescent second-language (L2) writers at the early stages of English language and literacy development, remain insufficiently explored. This explanatory sequential mixed-methods study investigated the reliability and validity of peer evaluation in English argumentative writing among 35 Grade 10 and 37 Grade 12 students from a public high school in Beijing, China. Twelve of the participating students (six at each grade) were interviewed about the validity, reliability, and value of peer evaluation. The findings indicated that peer evaluations demonstrated high levels of reliability and validity, with peer-assessed writing scores closely aligning with inter-teacher assessments. Notably, variations were observed among Grade 10 students, particularly in the evaluation of lower-order writing skills, such as grammar and vocabulary, which exhibited reduced validity. These results underscore the potential of peer evaluation in assessing higher-order content-level writing across varying levels of L2 English writing proficiency. The study also highlights areas where adolescent L2 writers may require additional support to enhance the effectiveness of peer evaluation practices in English argumentative writing. Implications for improving English argumentative writing instruction and refining peer evaluation strategies in high school L2 English classrooms are discussed. • Peer evaluation shows high reliability, similar to inter-teacher rating. • Peer evaluation works well for higher-order skills in L2 argumentative writing. • 10th graders struggled with evaluating lower-order skills like grammar. • 12th graders evaluate lower- and higher-order skills with greater validity than 10th graders.

writing pedagogy teacher development argument assessment multilingual writers grammar and mechanics literacy studies

doi:10.1016/j.asw.2025.100992
Jan 2026 OA PDF

Assessing the effects of task complexity on cognitive demands in L2 writing ↗

Na Tao; Ying Wang

Abstract

The assessment of task-generated cognitive demands has been receiving increasing attention in task complexity research. However, scant attention has been paid to assessing cognitive demands when task complexity is manipulated along both resource-directing and resource-dispersing dimensions. To address this gap, the present study aimed to investigate the relative effects of reasoning demands and prior knowledge on cognitive demands in L2 writing. Eighty-eight EFL students completed two letter-writing tasks with varying reasoning demands under one of two conditions, that is, either with prior knowledge available or without prior knowledge available. Cognitive demands were assessed by the post-task questionnaire, the dual-task method and the open-ended questions. The results revealed that reasoning demands and prior knowledge were strong determinants of cognitive demands, which provided empirical evidence for Robinson’s Cognition Hypothesis. Moreover, the post-task questionnaire, the dual-task method and open-ended questions were found to assess distinct aspects of cognitive demands, which highlighted the importance of data triangulation in exploring task complexity effects. The study provides language teachers and assessors with implications for task design and implementation. • How reasoning demands and prior knowledge affect cognitive demands was underexplored. • Cognitive demands were assessed by both quantitative and qualitative methods. • Findings supported some assumptions underlying Robinson’s framework. • The independent measures assessed distinct aspects of cognitive demands.

assessment qualitative research multilingual writers affect and writing

doi:10.1016/j.asw.2025.100998
Jan 2026 OA PDF

Assessing the effects of explicit coherence instruction on EFL students’ integrated writing performance ↗

Xi Li; Mo Chen

Abstract

As a key attribute of effective writing, coherence remains challenging to teach in language classrooms, with traditional writing instruction frequently overlooking coherence in favor of discrete, rule-based features. This mixed-methods study investigates the effectiveness of explicit coherence instruction on English-as-a-Foreign-Language (EFL) students’ performance on integrated writing tasks. The study employed a controlled experimental design with 64 upper-intermediate-level undergraduate students at a Chinese university, drawing on Hasan’s Cohesive Harmony theory as the theoretical framework. Half of the participants (n = 32) in the experimental group received explicit instruction on coherence with a focus on cohesive chains and cohesive devices in integrated writing, while the control group (n = 32) received standard paraphrasing instruction. Quantitative analysis revealed that the experimental group showed significant improvements in coherence scores and multiple cohesive chain measures. Qualitative discourse analysis of six students’ writing samples from the experimental group demonstrated varying levels of improvement in writing coherence, with high-performing students showing better use of identity chains and pronoun references. The findings revealed that explicit instruction on coherence significantly improved students’ performance in creating coherent integrated writing, particularly through the development of cohesive chains and appropriate use of cohesive devices. This study underscores the pedagogical value of teaching coherence to enhance writing quality and provides concrete strategies for developing more effective teaching approaches for integrated writing tasks in EFL contexts. • The study examined 64 Chinese EFL students using mixed-methods experimental design. • Cohesive Harmony theory served as the framework for assessing writing coherence. • Explicit instruction significantly improved coherence in integrated writing tasks. • High-performing students demonstrated superior identity chain development.

discourse analysis writing pedagogy graduate education teacher development assessment empirical research multilingual writers

doi:10.1016/j.asw.2026.101019

October 2025

Oct 2025

Assessing writing practices in higher education: Characterizing self-reported practices and identifying their determinants ↗

Dyanne Escorcia; Kiara Campo; Gabriela Navarro; Christine Ros

assessment

doi:10.1016/j.asw.2025.100976
Oct 2025

Response time for English learners on large-scale writing assessments ↗

Catherine Welch; Stephen Dunbar; Jeongmin Ji; Annette Vernon; Junhee Park

assessment

doi:10.1016/j.asw.2025.100979
Oct 2025

Comparing GPT-based approaches in automated writing evaluation ↗

Yingying Liu; Xiaofei Lu; Huilei Qi

assessment artificial intelligence

doi:10.1016/j.asw.2025.100961
Oct 2025

Assessing L2 writing formality using syntactic complexity indices: A fuzzy evaluation approach ↗

Zhiyun Huang; Guangyao Chen; Zhanhao Jiang

assessment multilingual writers grammar and mechanics

doi:10.1016/j.asw.2025.100973
Oct 2025 OA PDF

Judgment accuracy in primary school EFL writing assessment: Do text characteristics matter? ↗

Ruth Trüb; Jens Möller; Julian Lohmann; Thorben Jansen; Stefan D. Keller

Abstract

Assessing the writing competence of pupils learning English as a foreign language (EFL) at primary school is challenging. This study aimed at examining a largely unexplored topic, namely the role of text characteristics in writing assessment, and analysed judgment accuracy differentiated by nine aspects of text quality (communicative effect, level of detail, coherence, cohesion, complexity of syntax and grammar, correctness of syntax and grammar, vocabulary, orthography and punctuation). Two hundred pre-service teachers assessed four randomly assigned texts from learners in grade six. Their assessment was compared to the existing ratings of two experts from a previous study. We found a relative judgment accuracy between r = .34 and .60 for the nine assessment criteria, with vocabulary being assessed significantly more accurately than almost all other criteria. Orthography, complexity and correctness of syntax and grammar and punctuation were rated with significantly more accuracy than cohesion, level of detail, communicative effect and coherence. The pre-service teachers assessed most criteria more strictly and with higher variability than the experts. The results suggest that teacher education should offer pre-service teachers concrete opportunities to practise writing assessment, implement activities to strengthen the assessment of content- and structure-related criteria, and help them adjust their assessment rigour. • Judgment accuracy in the assessment of primary school EFL learners’ texts. • Relative judgment accuracy between r = .34 and .60 for the different criteria. • Significant differences in relative judgment accuracy between assessment criteria. • Linguistic text qualities are assessed with more accuracy than content- and structure-related aspects. • Pre-service teachers are more rigorous and heterogeneous in rating than experts.

teacher development assessment multilingual writers grammar and mechanics

doi:10.1016/j.asw.2025.100957
Oct 2025 OA PDF

Exploring the scoring validity of holistic and dimension-based Comparative Judgements of young learners’ EFL writing ↗

Rebecca Sickinger; John Pill; Tineke Brunfaut

Abstract

Comparative Judgement (CJ) is a pairwise comparison evaluation method, typically conducted online. Multiple judges each compare the quality of a series of paired performances and, from their decisions, a rank order is constructed and scores calculated. Research across different educational contexts supports CJ’s reliability for evaluating written performances, permitting more precise scoring of scripts and for dimension-focused evaluation. However, scant insights are available about the basis of judges’ evaluations. This issue is important because argument-based approaches to validation (common in the field of language testing and adopted in this study) require evidence to support claims about how scores are appropriate for test purpose. Therefore, we investigate the scoring validity of CJ, both when used holistically (the standard application of CJ) and when evaluating scripts by individual criteria (termed dimensions in the research context). Twenty-seven judges evaluated 300 scripts addressing two writing task types in a national English as a Foreign Language examination for young learners in Austria. Judges reported via questionnaires what they had focused on while judging. Subsequently, eight judges provided think-aloud data while evaluating 157 scripts, offering further insight into the writing features they considered and their decision-making during CJ. Findings showed that while most judges adapted a decision-making process similar to traditional rating methods, some adapted their method to accommodate the nature of CJ evaluation. Furthermore, results indicated that the judges considered construct-relevant criteria when using CJ, both holistically and by dimension, thus offering support to an argument for the appropriateness of using CJ in this context. • Comparative Judgement can offer an alternative to analytic rating of EFL writing. • Judges with teaching or rating experience largely focus on relevant text features. • Some judges adopt a decision-making process that appears well suited to CJ. • Dimension-based CJ has the potential to provide richer feedback than holistic CJ.

teacher development assessment multilingual writers

doi:10.1016/j.asw.2025.100986
Oct 2025 OA PDF

Which gender provides more specific peer feedback? Gender and assessment training’s effects on peer feedback specificity and intrapersonal factors ↗

José Carlos G. Ocampo; Ernesto Panadero; David Zamorano; Iván Sánchez-Iglesias

Abstract

This study investigated the effects of assessor gender (male vs. female), fictitious assessee gender (male vs. female), and assessment training (with vs. without) on peer feedback specificity (i.e. localisation and focus) and intrapersonal factors (i.e. trust in the self as an assessor and discomfort). This study involved 240 undergraduate psychology students (nMen=120, nWomen=120), with half receiving assessment training and the other half receiving the task instructions. Participants were divided into eight subgroups based on training condition and their self-reported gender to provide peer feedback to three writing samples (poor, average, excellent quality) by fictitious male or female peer assessees in Eduflow. A total of 3017 peer feedback segments were analysed, revealing that trained or untrained male and female assessors were comparable in most peer feedback specificity categories when assessing fictitious male or female assessees. Nonetheless, we also found that female assessors excelled in certain categories of peer feedback specificity, while male assessors also demonstrated competencies in other categories. Results also showed that assessors who received assessment training provided localised peer feedback in all the writing samples. Finally, gender and training did not affect participants’ trust in their abilities and (dis)comfort when providing peer feedback.

assessment gender and writing affect and writing

doi:10.1016/j.asw.2025.100987
Oct 2025

Integrating move analysis and sentence reconstruction in automated writing evaluation for L2 academic writers ↗

Bo-Ren Mau; Hui-Hsien Feng

assessment artificial intelligence

doi:10.1016/j.asw.2025.100984
Oct 2025

Exploring the cross-lingual influence of linguistic complexity in second language writing assessment ↗

Sara Geremia; Thomas Gaillat; Nicolas Ballier; Andrew J. Simpkin

assessment multilingual writers

doi:10.1016/j.asw.2025.100951
Oct 2025

Predictive validity evidence for a no-stakes, untimed, machine-scored diagnostic writing assessment ↗

Elie ChingYen Yu; Oxana Rosca; Heidi L. Andrade; Angela M. Lui; Jason Bryer

assessment

doi:10.1016/j.asw.2025.100978

July 2025

Jul 2025 OA PDF

Making things happen: A study of grammatical metaphors in L2 writing scripts ↗

Nicholas Glasson; Andrew Kitney

Abstract

The notion of grammatical metaphor (GM) (Halliday, 1985) is essentially where a writer can shift an action or quality into being a ‘thing’. As in most senses of metaphor, the goal is to “represent something as something else” (McGrath & Liardét, 2023, p.33). This study investigated the use of grammatical metaphor (GM) in Linguaskill writing exam responses across CEFR proficiency levels (below-B1 to C1 or above). It analysed the presence of a pre-existing GM list (see McGrath & Liardét, 2023) to explore GM frequency in L2 responses, the correlative relationship with proficiency scores and qualitatively explored candidate responses in terms of how GMs were used. Results show a moderate positive correlation between proficiency and GM use, with a dominance of process-to-thing shifts (e.g., transform→transformation) and emergence of GM use from lower to higher proficiency levels. This underscores GM's significance in crafting academically valued meanings in L2 contexts, suggesting its potential for informing instructional and assessment practices. • Metaphorisation in Writing is a useful metric for L2 writing assessment. • Evidence suggests GM frequency correlates with increased performance. • Learners progress from emergent arguments to presenting ideas more concisely. • The majority of GM shifts were to ‘things’. • The study provides further weight to arguments for meaning-based complexity.

assessment multilingual writers

doi:10.1016/j.asw.2025.100939
Jul 2025

Trinka: Facilitating academic writing through an intelligent writing evaluation system ↗

Jessie S. Barrot

assessment

doi:10.1016/j.asw.2025.100953
Jul 2025

Using ChatGPT to facilitate vocabulary learning in continuation writing assessment tasks ↗

Fengkai Liu; Xiaofei Lu; Tan Jin

assessment artificial intelligence

doi:10.1016/j.asw.2025.100952
Jul 2025

Comparative judgment in L2 writing assessment: Reliability and validity across crowdsourced, community-driven, and trained rater groups of judges ↗

Peter Thwaites; Pauline Jadoulle; Magali Paquot

assessment multilingual writers

doi:10.1016/j.asw.2025.100937
Jul 2025

Editorial introduction, Assessing writing Tools & Tech Forum 2025 ↗

Kelly Hartwell; Laura Aull

assessment

doi:10.1016/j.asw.2025.100956

April 2025

Apr 2025 OA PDF

Designing a rating scale for an integrated reading-writing test: A needs-oriented approach ↗

Aynur Ismayilli Karakoҫ; Peter Gu; Rachael Ruegg

Abstract

To meet the current trends in higher education, there is accountability on EAP programmes to prepare and assess students’ access to higher education. Thus, multimodal tasks including integrated writing (IW) assessments have seen a resurgence because they arguably closely mirror academic writing. However, test practicality constraints and variability in the use and format of these assessments mean rating scales often fall short in substantiating the central claims of IW assessment. We developed an integrated reading-writing scale taking into account reading-writing requirements and empirical research on IW tests designed to assess readiness for first-year humanities and social science courses. We approached test development as part of the ongoing validation efforts, detailing the considerations involved in the scale development process. We argue that alignment with academic writing requirements should guide the development of IW tests, thereby acknowledging and comprehending nuances of academic writing. The paper demonstrates considerations and decisions in scale design as the validation process from the start, which is a reminder that assessment is not just a quantitative exercise but a multifaceted process. • The design of a rating scale for first-year undergraduate academic writing is detailed. • Emphasis is placed on the role of reading in integrated writing scales. • Academic argumentation, rather than solely source-use mechanics, is considered. • Implications for construct operationalisation in academic evaluations are offered.

first-year composition argument assessment empirical research multimodality literacy studies

doi:10.1016/j.asw.2025.100918
Apr 2025

Does student assessment literacy matter between motivational constructs and engagement in L2 writing? A survey of Chinese EFL undergraduates ↗

Jian Xu; Yao Zheng

assessment multilingual writers literacy studies

doi:10.1016/j.asw.2025.100916

January 2025

Jan 2025 OA PDF

A meta-analysis of relationships between syntactic features and writing performance and how the relationships vary by student characteristics and measurement features ↗

Jiali Wang; Young-Suk G. Kim; Joseph Hin Yan Lam; Molly Ann Leachman

Abstract

Students’ proficiency in constructing sentences impacts the writing process and writing products. Linguistic demands in writing differ in terms of both student characteristics and measurement features. To identify various syntactic demands considering these features, we conducted a meta-analysis examining the relationships between syntactic features (complexity and accuracy) and writing performance (quality, productivity, and fluency) and moderating effects of both student characteristics and measurement features. A total of 109 studies (effect sizes: 871; the total number of participants: 24,628) met the inclusion criteria. Results showed that there was a weak relationship for syntactic accuracy (r = .25) and complexity (r = .16). Writers' characteristics, including grade level and language proficiency, and measurement features, writing genres, writing outcomes, whether the writing task is text-based or not, and type of syntactic complexity measures, were significant moderators for certain syntactic features. The findings highlighted the importance of writer and measurement factors when considering the relationships between linguistic features in writing and writing performance. Implications were discussed regarding the selection of syntactic features in assessing language use in writing, gaps in the literature, and significance for writing instruction and assessment. • Aimed to depict the relationships between syntactic features and writing performance. • Found weak relationships between syntactic features and writing outcomes. • Relationships vary as a function of student characteristics and measurement features. • Noun phrase complexity might be more valid than some traditional syntactic complexity measures. • Findings have important implications for writing assessments.

genre theory writing pedagogy assessment grammar and mechanics

doi:10.1016/j.asw.2024.100909

October 2024

Oct 2024 OA PDF

Effects of a genre and topic knowledge activation device on a standardized writing test performance ↗

Natalia Ávila Reyes; Diego Carrasco; Rosario Escribano; María Jesús Espinosa; Javiera Figueroa; Carolina Castillo

Abstract

The aim of this article was twofold: first, to introduce a design for a writing test intended for application in large-scale assessments of writing, and second, to experimentally examine the effects of employing a device for activating prior knowledge of topic and genre as a means of controlling construct-irrelevant variance and enhancing validity. An authentic, situated writing task was devised, offering students a communicative purpose and a defined audience. Two devices were utilized for the cognitive activation of topic and genre knowledge: an infographic and a genre model. The participants in this study were 162 fifth-grade students from Santiago de Chile, with 78 students assigned to the experimental condition (with activation device) and 84 students assigned to the control condition (without activation device). The results demonstrate that the odds of presenting good writing ability are higher for students who were part of the experimental group, even when controlling for text transcription ability, considered a predictor of writing. These findings hold implications for the development of large-scale tests of writing guided by principles of educational and social justice. • Genre and topic knowledge are forms of prior knowledge relevant to writing. • Higher odds for better writing in students exposed to prior knowledge activation. • Results support use of prior knowledge activation in standardized assessment.

genre theory writing pedagogy assessment race and writing

doi:10.1016/j.asw.2024.100898
Oct 2024 OA PDF

Validating an integrated reading-into-writing scale with trained university students ↗

Claudia Harsch; Valeriia Koval; Paraskevi (Voula) Kanistra; Ximena Delgado-Osorio

Abstract

Integrated tasks are often used in higher education (HE) for diagnostic purposes, with increasing popularity in lingua franca contexts, such as German HE, where English-medium courses are gaining ground. In this context, we report the validation of a new rating scale for assessing reading-into-writing tasks. To examine scoring validity, we employed Weir’s (2005) socio-cognitive framework in an explanatory mixed-methods design. We collected 679 integrated performances in four summary and opinion tasks, which were rated by six trained student raters. They are to become writing tutors for first-year students. We utilized a many-facet Rasch model to investigate rater severity, reliability, consistency, and scale functioning. Using thematic analysis, we analyzed think-aloud protocols, retrospective and focus group interviews with the raters. Findings showed that the rating scale overall functions as intended and is perceived by the raters as valid operationalization of the integrated construct. FACETS analyses revealed reasonable reliabilities, yet exposed local issues with certain criteria and band levels. This is corroborated by the challenges reported by the raters, which they mainly attributed to the complexities inherent in such an assessment. Applying Weir’s (2005) framework in a mixed-methods approach facilitated the interpretation of the quantitative findings and yielded insights into potential validity threads. • FACET analyses show reasonable reliabilities and scale functioning. • Mixed-methods approach facilitates interpreting the quantitative findings. • Raters perceive rating scale as valid operationalization of integrated construct. • Applying Weir’s socio-cognitive framework reveals potential validity threads. • Raters attribute challenges to the complexities inherent in integrated writing.

assessment peer tutoring qualitative research

doi:10.1016/j.asw.2024.100894
Oct 2024 OA PDF

The impact of task duration on the scoring of independent writing responses of adult L2-English writers ↗

Ben Naismith; Yigal Attali; Geoffrey T. LaFlair

Abstract

In writing assessment, there is inherently a tension between authenticity and practicality: tasks with longer durations may more closely reflect real-life writing processes but are less feasible to administer and score. What is more, given total testing time, there is necessarily a trade-off between task duration and number of tasks. Traditionally, high-stakes assessments have managed this trade-off by administering one or two writing tasks each test, allowing 20–40 minutes per task. However, research on second language (L2) English writing has not found longer task durations to significantly improve score validity or reliability. Importantly, very few studies have compared much shorter durations for writing tasks to more traditional allotments. To explore this issue, we asked adult L2-English test takers to respond to two writing prompts with either 5-minute or 20-minute time limits. Responses were then evaluated by expert human raters and an automated writing evaluation tool. Regardless of scoring method, short duration scores evidenced equally high test-retest reliability and criterion validity as long duration scores. As expected, longer task duration yielded higher scores, but regardless of duration, test takers demonstrated the entire spectrum of writing proficiency. Implications for writing assessment are discussed in relation to scoring practices and task design. • Longer writing tasks do not have higher test-retest reliability than shorter ones. • Longer writing tasks do not have higher criterion validity than shorter ones. • The impact of task duration is not mediated by scoring method (human or machine).

assessment artificial intelligence

doi:10.1016/j.asw.2024.100895

July 2024

Jul 2024 OA PDF

Influence of prior educational contexts on directed self-placement of L2 writers ↗

Youmie J. Kim; Matthew J. Hammill

Abstract

Directed self-placement (DSP) allows for student agency in writing placement. DSP has been implemented in many composition programs, although it has not been used as widely for L2 writers in higher education. This study investigates the relationship between student placement decisions and students’ prior educational backgrounds, particularly in relationship to whether they had attended an English-medium high school or an intensive English program (IEP). Actual placement results via an exam were compared to 804 students’ self-placement decisions and correlated with their prior educational backgrounds. Findings indicated that most students’ DSP decisions matched actual exam placement results. However, there was a large number of DSP decisions that were higher or lower than exam placement results. Additionally, the longer students studied at an English-medium instruction high school, the more likely they were to place themselves higher than their exam placement. We conclude that DSP can be used in L2 writing programs, but with careful attention to learners’ educational backgrounds, proficiency, and sense of identity.

assessment multilingual writers

doi:10.1016/j.asw.2024.100870
Jul 2024 OA PDF

Construct representation and predictive validity of integrated writing tasks: A study on the writing component of the Duolingo English Test ↗

Qin Xie

Abstract

This study examined whether two integrated reading-to-write tasks could broaden the construct representation of the writing component of Duolingo English Test (DET). It also verified whether they could enhance DET’s predictive power of English academic writing in universities. The tasks were (1) writing a summary based on two source texts and (2) writing a reading-to-write essay based on five texts. Both were given to a sample (N = 204) of undergraduates from Hong Kong. Each participant also submitted an academic assignment written for the assessment of a disciplinary course. Three professional raters double-marked all writing samples against detailed analytical rubrics. Raw scores were first processed using Multi-Faceted Rasch Measurement to estimate inter- and intra-rater consistency and generate adjusted (fair) measures. Based on these measures, descriptive analyses, sequential multiple regression, and Structural Equation Modeling were conducted (in that order). The analyses verified the writing tasks’ underlying component constructs and assessed their relative contributions to the overall integrated writing scores. Both tasks were found to contribute to DET’s construct representation and add moderate predictive power to the domain performance. The findings, along with their practical implications, are discussed, especially regarding the complex relations between construct representation and predictive validity. • studied the concepts of construct representation (CR) and predictive validity (PV). • within the context of an AI-facilitated language test (Duolingo English Test). • Revealed the complex relations between CR and PV.

assessment

doi:10.1016/j.asw.2024.100846
Jul 2024

Corrigendum to “Assessing metacognition-based student feedback literacy for academic writing” [Assessing Writing 59 (2024) 100811] ↗

Mark Feng Teng; Maggie Ma

assessment literacy studies editorial matter

doi:10.1016/j.asw.2024.100869
Jul 2024

Navigating innovation and equity in writing assessment ↗

Kelly Hartwell; Laura Aull

assessment race and writing

doi:10.1016/j.asw.2024.100873
Jul 2024

A teacher’s inquiry into diagnostic assessment in an EAP writing course ↗

Rabail Qayyum

writing pedagogy teacher development assessment

doi:10.1016/j.asw.2024.100848
Jul 2024 OA PDF

Examining the direct and indirect impacts of verbatim source use on linguistic complexity in integrated argumentative writing assessment ↗

Huiying Cai; Xun Yan

Abstract

Verbatim source use (VSU) in integrated argumentative writing tasks may enhance linguistic complexity of writing performance. This assistance might present an unequal advantage for test-takers across levels of writing proficiency, engendering validity and fairness concerns. While previous research has mostly examined the relationships between source use characteristics and proficiency levels, the relationship between VSU and linguistic complexity remains underexplored. To further unpack these relationships, this study examined both the direct impact of VSU on linguistic complexity of writing performances and its indirect impact through interaction with writing proficiency. Using natural language processing tools and techniques, we examined 34 linguistic complexity features and three VSU features of 3250 argumentative writing performances on a university-level English Placement Test (EPT). We performed exploratory factor analysis to identify linguistic complexity dimensions and applied mixed-effect models to examine how VSU features and proficiency level impacted these dimensions. Post-hoc analyses suggested weak direct impacts of different VSU features on linguistic complexity, which might reflect different essay writing strategies. However, no meaningful indirect impact was found. The findings help unravel the impact of VSU on argumentative writing and provide empirical evidence for validity arguments for integrated writing assessments.

argument assessment

doi:10.1016/j.asw.2024.100868
Jul 2024

Thirty years of writing assessment: A bibliometric analysis of research trends and future directions ↗

Jihua Dong; Yanan Zhao; Louisa Buckingham

assessment

doi:10.1016/j.asw.2024.100862
Jul 2024

EvaluMate: Using AI to support students’ feedback provision in peer assessment for writing ↗

Kai Guo

assessment

doi:10.1016/j.asw.2024.100864

April 2024

Apr 2024

Linguistic factors affecting L1 language evaluation in argumentative essays of students aged 16 to 18 attending secondary education in Greece ↗

Koskinas Emmanouil; Gavriilidou Zoe; Andras Christos; Angelos Markos

assessment

doi:10.1016/j.asw.2024.100844
Apr 2024 OA PDF

Writing productivity development in elementary school: A systematic review ↗

Catherine Martin; Julie E. Dockrell

Abstract

The ability to produce fluent and coherent written text impacts learning and attainments. Valid and reliable assessments of writing are needed to monitor progression, develop goals for writing and identify struggling writers. In order to inform practice and research a systematic review was conducted to investigate which writing productivity measures captured writing development and identified struggling writers in elementary school. Sixty-seven empirical studies were identified for inclusion, appraised, and their data extracted under the themes of writing genre, duration of writing task, use of priming of topic knowledge prior to the writing assessment, use of planning time, writing modality, gender, age of participants and learning difficulties. Total Number of Words and Correct Word Sequences were the most common means of measuring productivity. Productivity varied significantly between genres and durations of writing tasks and was higher in girls than boys. Students with learning difficulties scored significantly lower in writing productivity when compared to typically developing peers. Insufficient research was available to draw conclusions regarding the effects of priming of topic knowledge, planning and modality on writing productivity. Study limitations, links to the assessment of writing and recommended further research are discussed.

genre theory basic writing assessment gender and writing

doi:10.1016/j.asw.2024.100834
Apr 2024

Establishing analytic score profiles for large-scale L2 writing assessment: The case of the CET-4 writing test ↗

Shaoyan Zou; Xun Yan; Jason Fan

assessment multilingual writers

doi:10.1016/j.asw.2024.100826
Apr 2024 OA PDF

Visualizing formative feedback in statistics writing: An exploratory study of student motivation using DocuScope Write & Audit ↗

Michael Laudenbach; David West Brown; Zhiyu Guo; Suguru Ishizaki; Alex Reinhart; Gordon Weinberg

Abstract

Recently, formative feedback in writing instruction has been supported by technologies generally referred to as Automated Writing Evaluation tools. However, such tools are limited in their capacity to explore specific disciplinary genres, and they have shown mixed results in student writing improvement. We explore how technology-enhanced writing interventions can positively affect student attitudes toward and beliefs about writing, both reinforcing content knowledge and increasing student motivation. Using a student-facing text-visualization tool called Write & Audit, we hosted revision workshops for students (n = 30) in an introductory-level statistics course at a large North American University. The tool is designed to be flexible: instructors of various courses can create expectations and predefine topics that are genre-specific. In this way, students are offered non-evaluative formative feedback which redirects them to field-specific strategies. To gauge the usefulness of Write & Audit, we used a previously validated survey instrument designed to measure the construct model of student motivation (Ling et al. 2021). Our results show significant increases in student self-efficacy and beliefs about the importance of content in successful writing. We contextualize these findings with data from three student think-aloud interviews, which demonstrate metacognitive awareness while using the tool. Ultimately, this exploratory study is non-experimental, but it contributes a novel approach to automated formative feedback and confirms the promising potential of Write & Audit.

genre theory writing pedagogy revision assessment artificial intelligence affect and writing

doi:10.1016/j.asw.2024.100830

Assessing Writing

July 2026

April 2026

January 2026

October 2025

July 2025

April 2025

January 2025

October 2024

July 2024

April 2024