Assessing Writing

1015 articles
Year: Topic:
Export:

July 2026

  1. Investigating the impact of ChatGPT-assisted self-assessment on college students' writing development: Insights from diverse linguistic backgrounds
    doi:10.1016/j.asw.2026.101061
  2. LAWE-CL2: Multi-agent LLM-based automated writing evaluation system integrating linguistic features with fine-tuning for Chinese L2 writing assessment
    doi:10.1016/j.asw.2026.101051
  3. Anchor is the key: Toward accessible automated essay scoring with large language model through prompting
    doi:10.1016/j.asw.2026.101053
  4. Developing a rating scale for written intralinguistic mediation in a local context
    doi:10.1016/j.asw.2026.101049
  5. Measuring vocabulary use in L2 English and L2 French writing: How methodological decisions shape the results
    doi:10.1016/j.asw.2026.101039
  6. Educator perspectives on automated writing scoring and feedback for young language learners: Applying a fairness and justice lens
    doi:10.1016/j.asw.2026.101050
  7. A multidimensional approach to the impact of achievement emotions on high school students' L2 writing performance
    doi:10.1016/j.asw.2026.101048
  8. Accuracy and fairness of generative AI in automated essay scoring: Comparing GPT-4o, feature-based models, and human raters
    doi:10.1016/j.asw.2026.101047

April 2026

  1. Cubic effects of autonomous and controlled motivation on L2 self-regulated writing strategies: A polynomial regression analysis
    doi:10.1016/j.asw.2026.101046
  2. ChatGPT feedback and emotional engagement in L2 writing: A control-value theory perspective using Q-methodology
    doi:10.1016/j.asw.2026.101045
  3. Exploring the roles of gender, linguistic, and cognitive variables in continuation writing task performance among learners of English
    doi:10.1016/j.asw.2026.101043
  4. Generative artificial intelligence for automated writing evaluation: A systematic review of trends, efficacy, and challenges
    doi:10.1016/j.asw.2026.101041
  5. Task complexity, collaborative writing, and learner engagement: Examining second language learners’ writing performance
    doi:10.1016/j.asw.2026.101042
  6. Pursuing fair writing assessment: Halo effects in primary school foreign language writing in grade six
    Abstract

    Assessing the writing competence of pupils learning English as a foreign language (EFL) at primary school is associated with specific challenges because of learners’ limited language resources. This study investigates the extent to which characteristics of their texts trigger so-called halo effects. Halo effects are an assessment bias where the quality of one feature unintentionally influences the evaluation of other aspects. The study examines halo effects across nine aspects of text quality (communicative effect, level of detail, coherence, cohesion, complexity of syntax and grammar, correctness of syntax and grammar, vocabulary, orthography and punctuation), based on a random sample of narrative texts from a sixth-grade corpus. 200 pre-service teachers assessed four randomly assigned texts. Halo effects were calculated by comparison to expert ratings using multi-level regression analyses. Results show that orthography and vocabulary were the two main triggers of halo effects. Punctuation also triggered some halo effects, but to a smaller extent. The assessment of communicative effect, complexity and correctness of syntax and grammar was not determined by the corresponding text quality but dominated by other criteria. Results highlight the importance of being aware of halo effects when assessing young EFL learners’ texts and emphasise the need for suitable training measures. • Analysis of halo effects across nine aspects of text quality. • Random sample of narrative texts from a sixth-grade EFL corpus. • Orthography and vocabulary are the two main triggers of halo effects. • Punctuation also triggers halo effects but to a smaller extent. • Halo effects call for awareness and targeted training.

    doi:10.1016/j.asw.2026.101036
  7. Can AI provide useful analytic essay scoring for different genres of writing with elementary grade students?
    doi:10.1016/j.asw.2026.101038
  8. From spelling to content: The influence of spelling quality on text assessment
    doi:10.1016/j.asw.2026.101014
  9. How do L2 writing subskills interact hierarchically? Insights from diagnostic classification models
    Abstract

    This study examined the hierarchical structure among second/foreign language (L2) writing subskills using a Hierarchical Diagnostic Classification Model (HDCM). A pool of 500 essays composed by English as a Foreign Language (EFL) students was assessed by four experienced EFL teachers using the Empirically-derived Descriptor-based Diagnostic (EDD) checklist. Based on a literature review and the expertise of three content experts, several models were developed to reflect various hierarchical interactions among L2 writing subskills, including linear, divergent, convergent, independent, unstructured, mixed, and higher-order. The comparison of the models showed the presence of an unstructured interaction among L2 writing subskills, indicating that content is the foundational subskill for the mastery of vocabulary, grammar, organization, and mechanics. Higher mastery classes were also associated with higher educational levels, greater frequency of English use, and longer exposure to L2. Understanding the hierarchical relationships among L2 writing subskills can improve targeted instructional strategies and assessment practices. • A constrained version of existing DCMs is represented by hierarchical DCMs. • Models were developed to show hierarchical interactions among L2 writing subskills. • An unstructured interaction among L2 writing subskills was identified. • Higher mastery classes were associated with higher educational levels. • The classes were associated with greater English use and longer L2 exposure.

    doi:10.1016/j.asw.2026.101029
  10. Evaluating the consistency between human raters and three AI systems on the scoring of argumentative essays
    doi:10.1016/j.asw.2026.101031
  11. Contributions of working memory capacity and mental set shifting to second language (English) writing performance
    doi:10.1016/j.asw.2026.101035
  12. Assessing GenAI-assisted digital multimodal composing: Reconceptualizing a genre-based framework through self-assessment and peer assessment
    doi:10.1016/j.asw.2026.101017
  13. Associations of adolescents’ argumentative writing scores and growth when evaluated by different human raters and artificial intelligence models
    doi:10.1016/j.asw.2026.101015
  14. Developing students’ feedback literacy in disciplinary academic writing through generative artificial intelligence
    doi:10.1016/j.asw.2026.101030
  15. Conceptual, rhetorical and linguistic transformations: Assessing L2 literature review writing using simulated tasks
    doi:10.1016/j.asw.2026.101013
  16. Assessing fairness in finetuned scoring models with demographically restricted training data
    Abstract

    The increasing adoption of automated essay scoring (AES) in high-stakes educational contexts necessitates careful examination of potential biases within the systems. This study investigates how the demographic composition of training data influences fairness in AES systems developed from finetuned large language models (LLMs). Using the PERSUADE corpus of 26,000 student essays, we conducted a systematic analysis using demographically restricted training sets to isolate the impact of training data demographics on LLM-AES performance. Each demographically restricted training set comprised essays written by one racial/ethnic group. Four variants of a Longformer-based AES were developed: one trained on demographically balanced data and three trained on demographically restricted datasets. An initial analysis of the human ratings indicated that demographic factors significantly predict human essay scores (marginal R² = 0.125), a pattern that is paralleled in national writing assessment data. LLM-AES systems trained on demographically restricted data exhibited small systematic biases (marginal R² = 0.043). However, the LLM trained on balanced data showed minimal demographic bias, suggesting that representative training data can effectively prevent amplification of demographic disparities beyond those present in human ratings. These results highlight both the importance and limitations of training data diversity in achieving fair assessment outcomes. • 12.5% of variance in human essay ratings was explained by demographics. • We construct demographically restricted training sets to isolate bias. • Balanced training data minimized LLM-AES bias across demographic groups. • LLM-AES trained on demographically restricted data showed more bias.

    doi:10.1016/j.asw.2026.101032
  17. Unveiling the complex interactions of mindset, emotions, and self-regulated learning in EFL writing: A latent profile and network analysis
    doi:10.1016/j.asw.2026.101037
  18. An ecological approach to L2 learners’ engagement with written feedback
    doi:10.1016/j.asw.2026.101028
  19. Aligning ACTFL writing proficiency guidelines with CEFR descriptors: Insights from Chinese writing assessment
    doi:10.1016/j.asw.2026.101033
  20. How writing prompts influence analytic trait scores: A differential feature functioning analysis for English language learners
    doi:10.1016/j.asw.2026.101018
  21. L2 learners’ engagement with AI-generated feedback on writing
    doi:10.1016/j.asw.2026.101020
  22. The impact of ChatGPT’s feedback on L2 Chinese learners’ writing outcome, confidence, and emotions: A mixed-method quasi-experimental study
    doi:10.1016/j.asw.2026.101027

January 2026

  1. The effects of online resource use on L2 learners’ computer-mediated writing processes and written products
    Abstract

    While previous studies on online resource use in L2 writing have focused on the overall writing quality, limited attention has been paid to its effects on linguistic complexity and real-time writing processes. Addressing this gap, the present study explored how online resource use influences both the processes and products of L2 writing. Forty-nine intermediate L2 learners completed two computer-mediated argumentative writing tasks, either with or without the use of online resources. Writing behaviors were captured via keystroke logging and screen recording, and analyzed for search activity, fluency, pausing, and revision quantity. Cognitive processes were examined through stimulated recall interviews, and written products were evaluated for both quality and linguistic complexity. The results showed that participants spent an average of 14 % of task time using online resources, with considerable individual variation. Mixed-effects modeling revealed that resource use facilitated the production of more sophisticated words, with marginal influence on writing quality or syntactic complexity. Resource use was also associated with longer between-word pauses, fewer within-word pauses, and reduced revisions. These findings highlight the potential of online resource use to enhance the authenticity of L2 writing assessment tasks without compromising test validity, while encouraging the use of more advanced vocabulary in writing. • Learners spent 14 % of the total writing task time using online resources. • Online resource use had no significant impact on L2 writing quality. • Online resource use improved lexical sophistication, not syntactic complexity. • Online resource use reduced within-word pauses and aided spelling retrieval. • Online resource use led to fewer revisions but did not affect fluency.

    doi:10.1016/j.asw.2025.100994
  2. Verb-centric or balanced?: An NLP-based assessment of word class contributions to L2 writing proficiency
    doi:10.1016/j.asw.2025.100997
  3. Beyond the page: A multimodal self-efficacy framework for assessing L2 digital-academic writing
    doi:10.1016/j.asw.2025.101010
  4. Generative artificial intelligence for automated essay scoring: Exploring teacher agency through an ecological perspective
    Abstract

    Generative artificial intelligence (AI) is increasingly used in writing assessment, particularly for automated essay scoring (AES) and for generating formative feedback within automated writing evaluation (AWE). While AI-driven AES enhances efficiency and consistency, concerns regarding accuracy, bias, and ethical implications raise critical questions about its role in assessment. This paper examines the impact of generative AI on teacher agency through an ecological perspective, which considers agency as shaped by personal, institutional, and sociocultural factors. The analysis highlights the need for teachers to critically mediate AI-generated scores and feedback to align them with pedagogical goals, ensuring AI functions as an assistive tool rather than a determinant of assessment outcomes. Although AI can streamline assessment, over-reliance risks diminishing teachers’ evaluative expertise and reinforcing biases embedded in AI systems. Ethical concerns, including transparency, data privacy, and fairness, further complicate its adoption. To address these challenges, this paper proposes a framework for responsible AI integration that prioritizes bias mitigation, data security, and teacher-driven decision-making. The discussion concludes with pedagogical implications and directions for future research on AI-assisted writing assessment. • Teachers can actively mediate AI-generated scores to maintain agency. • Dependence on AES may weaken teachers’ evaluative skills. • Bias, data privacy, and AI opacity can undermine teachers’ decision-making. • AI literacy and hybrid assessment models can promote teacher autonomy. • A framework for protecting teacher agency in generative AI–based AWE is presented.

    doi:10.1016/j.asw.2025.100990
  5. Unveiling the antecedents of feedback-seeking behavior in L2 writing: The impact of future L2 writing selves and emotions
    Abstract

    While existing research on second or foreign (L2) feedback has predominantly focused on the effectiveness of various feedback practices and their impacts on writing performance, limited attention has been devoted to learners’ proactive role in seeking feedback, and how this important yet underexplored construct correlates with conative and affective variables remains insufficiently examined. To help fill that void, we sought to explore the concept of feedback-seeking behavior and its antecedents in L2 writing by examining the correlations with future L2 writing selves and emotions, particularly unpacking the mediating effect of emotions in the emotion-driven chain of “motivation→emotion→increased or decreased behavior” among 225 undergraduate English major students. Structural equation modeling unveiled that ideal and ought-to L2 writing selves directly and significantly influenced emotions, and emotions impacted the two dimensions of feedback-seeking behavior significantly. More importantly, ideal L2 writing self indirectly influenced feedback monitoring and feedback inquiry through the mediation of writing enjoyment. Nevertheless, writing boredom exercised no significant mediating effect on future L2 selves and feedback-seeking behavior. These findings reinforced the learner-centered perspective that positions students as proactive agents and provide some notable implications for L2 writing instruction to advance our understanding of teacher feedback. • Learners with heightened L2 selves deployed more feedback-seeking strategies. • Experiencing L2 enjoyment fostered distinct feedback-seeking behaviors. • No variations in L2 boredom existed in the link between L2 selves and behavior. • More high-quality research evaluating L2 learners as proactive agents is needed.

    doi:10.1016/j.asw.2025.101009
  6. The relation between linguistic accuracy and scoring of Swedish EFL students’ writing during a high-stakes exam
    Abstract

    This paper examines the effect of linguistic accuracy (e.g., the lack of form, grammatical, and lexical errors) on scoring during the high-stakes national test of English in Swedish upper secondary school. Teachers are expected to score their own students’ texts with the help of assessment instructions containing benchmark texts (i.e., texts representing different score bands). The assessment instructions and the score bands provided to guide scoring are not explicit about how accuracy should influence scores. Two research questions were answered: As measured by ordinal regression, to what extent does linguistic accuracy predict rater scores? Do the texts scored by teachers reflect the graded example texts in terms of how linguistic accuracy predicts scores? The results revealed, amongst other things, that overall frequency of errors in texts significantly predicted scores as the model explained approximately 58 % of the variance in the outcome variable according to Nagelkerke’s pseudo R-squared. Accuracy also had a similar effect on scores in texts rated by teachers as in the benchmark texts. In relation to the findings, it was concluded that accuracy may have more of an impact on scores than constructs that are more explicit components of the score bands such as lexical complexity.

    doi:10.1016/j.asw.2025.100995
  7. Unacclimatized?: Understanding the potential of labor-based contract grading interventions in Chinese EFL writing contexts
    doi:10.1016/j.asw.2025.100993
  8. Extracting interpretable writing traits from a large language model
    Abstract

    Large language models (LLMs) are increasingly used to support automated writing evaluation (AWE), both for purposes of scoring and feedback. However, LLMs present challenges to interpretability, making it hard to evaluate the construct validity of scoring and feedback models. BIOT (best interpretable orthogonal transformations) is a new method of analysis that makes dimensions of an embedding interpretable by aligning them with external predictors. It was originally developed to improve the interpretability of multidimensional scaling models. However, This paper shows that BIOT can be used to align LLM embeddings with an interpretable writing trait model developed using multidimensional analysis of classical NLP features to measure latent dimensions of writing style and writing quality. This makes it possible to determine whether an AWE model built using an LLM is aligned with known (and construct-relevant) dimensions of textual variation, supporting construct validity. Specifically, we examine the alignment between the hidden layers of deBERTA, a small LLM that has been shown to be useful for a variety of natural language processing applications, and a writing trait model developed through factor analysis of classical features used in existing AWE models. Specific dimensions of transformed deBERTA layers are strongly correlated with these classical factors. When the transformation matrix derived using BIOT is applied to token vectors, it is also possible to visualize which tokens in the original text contributed to high or low scores on a specific dimension. • Large language models (LLMs) are increasingly used to support automated writing evaluate (AWE). • LLMs present challenges to interpretability, making it hard to evaluate construct validity of scoring and feedback models. • BIOT is a new interpretation method that aligns embedding dimensions with external predictors. • Specifically, BIOT can be used to align LLM embeddings with classical NLP measures of aspects of style and writing quality. • This demonstrates a general method to determine whether an LLM latently represents construct-relevant dimensions.

    doi:10.1016/j.asw.2025.101011
  9. How reliable and valid is peer evaluation in adolescents’ L2 argumentative writing?
    Abstract

    Peer evaluation is widely recognized for its educational benefits; however, its reliability and validity, particularly among adolescent second-language (L2) writers at the early stages of English language and literacy development, remain insufficiently explored. This explanatory sequential mixed-methods study investigated the reliability and validity of peer evaluation in English argumentative writing among 35 Grade 10 and 37 Grade 12 students from a public high school in Beijing, China. Twelve of the participating students (six at each grade) were interviewed about the validity, reliability, and value of peer evaluation. The findings indicated that peer evaluations demonstrated high levels of reliability and validity, with peer-assessed writing scores closely aligning with inter-teacher assessments. Notably, variations were observed among Grade 10 students, particularly in the evaluation of lower-order writing skills, such as grammar and vocabulary, which exhibited reduced validity. These results underscore the potential of peer evaluation in assessing higher-order content-level writing across varying levels of L2 English writing proficiency. The study also highlights areas where adolescent L2 writers may require additional support to enhance the effectiveness of peer evaluation practices in English argumentative writing. Implications for improving English argumentative writing instruction and refining peer evaluation strategies in high school L2 English classrooms are discussed. • Peer evaluation shows high reliability, similar to inter-teacher rating. • Peer evaluation works well for higher-order skills in L2 argumentative writing. • 10th graders struggled with evaluating lower-order skills like grammar. • 12th graders evaluate lower- and higher-order skills with greater validity than 10th graders.

    doi:10.1016/j.asw.2025.100992
  10. Assessing the effects of task complexity on cognitive demands in L2 writing
    Abstract

    The assessment of task-generated cognitive demands has been receiving increasing attention in task complexity research. However, scant attention has been paid to assessing cognitive demands when task complexity is manipulated along both resource-directing and resource-dispersing dimensions. To address this gap, the present study aimed to investigate the relative effects of reasoning demands and prior knowledge on cognitive demands in L2 writing. Eighty-eight EFL students completed two letter-writing tasks with varying reasoning demands under one of two conditions, that is, either with prior knowledge available or without prior knowledge available. Cognitive demands were assessed by the post-task questionnaire, the dual-task method and the open-ended questions. The results revealed that reasoning demands and prior knowledge were strong determinants of cognitive demands, which provided empirical evidence for Robinson’s Cognition Hypothesis. Moreover, the post-task questionnaire, the dual-task method and open-ended questions were found to assess distinct aspects of cognitive demands, which highlighted the importance of data triangulation in exploring task complexity effects. The study provides language teachers and assessors with implications for task design and implementation. • How reasoning demands and prior knowledge affect cognitive demands was underexplored. • Cognitive demands were assessed by both quantitative and qualitative methods. • Findings supported some assumptions underlying Robinson’s framework. • The independent measures assessed distinct aspects of cognitive demands.

    doi:10.1016/j.asw.2025.100998
  11. Assessing EFL students’ GenAI-assisted writing: Teachers’ pains, perceptions and practices
    doi:10.1016/j.asw.2025.100996
  12. Editorial Board
    doi:10.1016/s1075-2935(26)00012-7
  13. Development of a Genre Adherence Rubric (GAR) for applied linguistics research articles
    doi:10.1016/j.asw.2025.100991
  14. Assessing the effects of explicit coherence instruction on EFL students’ integrated writing performance
    Abstract

    As a key attribute of effective writing, coherence remains challenging to teach in language classrooms, with traditional writing instruction frequently overlooking coherence in favor of discrete, rule-based features. This mixed-methods study investigates the effectiveness of explicit coherence instruction on English-as-a-Foreign-Language (EFL) students’ performance on integrated writing tasks. The study employed a controlled experimental design with 64 upper-intermediate-level undergraduate students at a Chinese university, drawing on Hasan’s Cohesive Harmony theory as the theoretical framework. Half of the participants (n = 32) in the experimental group received explicit instruction on coherence with a focus on cohesive chains and cohesive devices in integrated writing, while the control group (n = 32) received standard paraphrasing instruction. Quantitative analysis revealed that the experimental group showed significant improvements in coherence scores and multiple cohesive chain measures. Qualitative discourse analysis of six students’ writing samples from the experimental group demonstrated varying levels of improvement in writing coherence, with high-performing students showing better use of identity chains and pronoun references. The findings revealed that explicit instruction on coherence significantly improved students’ performance in creating coherent integrated writing, particularly through the development of cohesive chains and appropriate use of cohesive devices. This study underscores the pedagogical value of teaching coherence to enhance writing quality and provides concrete strategies for developing more effective teaching approaches for integrated writing tasks in EFL contexts. • The study examined 64 Chinese EFL students using mixed-methods experimental design. • Cohesive Harmony theory served as the framework for assessing writing coherence. • Explicit instruction significantly improved coherence in integrated writing tasks. • High-performing students demonstrated superior identity chain development.

    doi:10.1016/j.asw.2026.101019
  15. Is it beneficial to strive for perfection in writing?: Exploring the relationship between perfectionism, motivational regulation, and second language (L2) writing performance
    Abstract

    Perfectionism, a personality trait characterized by the pursuit of flawlessness and high personal standards, and motivational regulation, the strategies through which individuals manage their motivational states, have received limited attention in second language (L2) writing. Framed within social cognitive theory, this study examines how two dimensions of perfectionism—perfectionistic strivings and perfectionistic concerns—relate to writing performance (syntactic complexity, accuracy, lexical complexity, and fluency) and how motivational regulation sub-strategies (interest enhancement, self-talk, and emotional control) mediate these relationships. Data from 689 university students in China were analyzed using questionnaires and argumentative writing samples. Results indicated that perfectionistic strivings positively predicted syntactic complexity, accuracy, and lexical complexity, while perfectionistic concerns negatively predicted these dimensions; neither dimension significantly affected fluency. Crucially, motivational regulation sub-strategies partially mediated the relations between perfectionism and writing performance. These findings underscore the importance of distinguishing perfectionism dimensions and targeting motivational regulation strategies to improve L2 writing. Implications for instruction and directions for future longitudinal research are discussed. • Perfectionistic strivings and concerns affect writing via motivational regulation. • Strivings improve syntax, accuracy, and lexical complexity; concerns hinder them. • Most motivational regulation sub-strategies mediate perfectionism’s impact on CALF. • Perfectionism influences writing through motivational regulation.

    doi:10.1016/j.asw.2025.101012
  16. Volume 67 editorial
    doi:10.1016/j.asw.2026.101016

October 2025

  1. Editorial
    doi:10.1016/j.asw.2025.100999
  2. The effect of metacognitive instruction with indirect written corrective feedback on secondary students’ engagement and functional adequacy in L2 writing
    doi:10.1016/j.asw.2025.100962
  3. Investigating a customized generative AI chatbot for automated essay scoring in a disciplinary writing task
    doi:10.1016/j.asw.2025.100959
  4. Improving writing feedback quality and self-efficacy of pre-service teachers in Gen-AI contexts: An experimental mixed-method design
    doi:10.1016/j.asw.2025.100960