Assessing Writing
1016 articlesOctober 2025
-
Abstract
Assessing the writing competence of pupils learning English as a foreign language (EFL) at primary school is challenging. This study aimed at examining a largely unexplored topic, namely the role of text characteristics in writing assessment, and analysed judgment accuracy differentiated by nine aspects of text quality (communicative effect, level of detail, coherence, cohesion, complexity of syntax and grammar, correctness of syntax and grammar, vocabulary, orthography and punctuation). Two hundred pre-service teachers assessed four randomly assigned texts from learners in grade six. Their assessment was compared to the existing ratings of two experts from a previous study. We found a relative judgment accuracy between r = .34 and .60 for the nine assessment criteria, with vocabulary being assessed significantly more accurately than almost all other criteria. Orthography, complexity and correctness of syntax and grammar and punctuation were rated with significantly more accuracy than cohesion, level of detail, communicative effect and coherence. The pre-service teachers assessed most criteria more strictly and with higher variability than the experts. The results suggest that teacher education should offer pre-service teachers concrete opportunities to practise writing assessment, implement activities to strengthen the assessment of content- and structure-related criteria, and help them adjust their assessment rigour. • Judgment accuracy in the assessment of primary school EFL learners’ texts. • Relative judgment accuracy between r = .34 and .60 for the different criteria. • Significant differences in relative judgment accuracy between assessment criteria. • Linguistic text qualities are assessed with more accuracy than content- and structure-related aspects. • Pre-service teachers are more rigorous and heterogeneous in rating than experts.
-
Exploring the scoring validity of holistic and dimension-based Comparative Judgements of young learners’ EFL writing ↗
Abstract
Comparative Judgement (CJ) is a pairwise comparison evaluation method, typically conducted online. Multiple judges each compare the quality of a series of paired performances and, from their decisions, a rank order is constructed and scores calculated. Research across different educational contexts supports CJ’s reliability for evaluating written performances, permitting more precise scoring of scripts and for dimension-focused evaluation. However, scant insights are available about the basis of judges’ evaluations. This issue is important because argument-based approaches to validation (common in the field of language testing and adopted in this study) require evidence to support claims about how scores are appropriate for test purpose. Therefore, we investigate the scoring validity of CJ, both when used holistically (the standard application of CJ) and when evaluating scripts by individual criteria (termed dimensions in the research context). Twenty-seven judges evaluated 300 scripts addressing two writing task types in a national English as a Foreign Language examination for young learners in Austria. Judges reported via questionnaires what they had focused on while judging. Subsequently, eight judges provided think-aloud data while evaluating 157 scripts, offering further insight into the writing features they considered and their decision-making during CJ. Findings showed that while most judges adapted a decision-making process similar to traditional rating methods, some adapted their method to accommodate the nature of CJ evaluation. Furthermore, results indicated that the judges considered construct-relevant criteria when using CJ, both holistically and by dimension, thus offering support to an argument for the appropriateness of using CJ in this context. • Comparative Judgement can offer an alternative to analytic rating of EFL writing. • Judges with teaching or rating experience largely focus on relevant text features. • Some judges adopt a decision-making process that appears well suited to CJ. • Dimension-based CJ has the potential to provide richer feedback than holistic CJ.
-
Which gender provides more specific peer feedback? Gender and assessment training’s effects on peer feedback specificity and intrapersonal factors ↗
Abstract
This study investigated the effects of assessor gender (male vs. female), fictitious assessee gender (male vs. female), and assessment training (with vs. without) on peer feedback specificity (i.e. localisation and focus) and intrapersonal factors (i.e. trust in the self as an assessor and discomfort). This study involved 240 undergraduate psychology students (nMen=120, nWomen=120), with half receiving assessment training and the other half receiving the task instructions. Participants were divided into eight subgroups based on training condition and their self-reported gender to provide peer feedback to three writing samples (poor, average, excellent quality) by fictitious male or female peer assessees in Eduflow. A total of 3017 peer feedback segments were analysed, revealing that trained or untrained male and female assessors were comparable in most peer feedback specificity categories when assessing fictitious male or female assessees. Nonetheless, we also found that female assessors excelled in certain categories of peer feedback specificity, while male assessors also demonstrated competencies in other categories. Results also showed that assessors who received assessment training provided localised peer feedback in all the writing samples. Finally, gender and training did not affect participants’ trust in their abilities and (dis)comfort when providing peer feedback.
July 2025
-
Abstract
The notion of grammatical metaphor (GM) (Halliday, 1985) is essentially where a writer can shift an action or quality into being a ‘thing’. As in most senses of metaphor, the goal is to “represent something as something else” (McGrath & Liardét, 2023, p.33). This study investigated the use of grammatical metaphor (GM) in Linguaskill writing exam responses across CEFR proficiency levels (below-B1 to C1 or above). It analysed the presence of a pre-existing GM list (see McGrath & Liardét, 2023) to explore GM frequency in L2 responses, the correlative relationship with proficiency scores and qualitatively explored candidate responses in terms of how GMs were used. Results show a moderate positive correlation between proficiency and GM use, with a dominance of process-to-thing shifts (e.g., transform→transformation) and emergence of GM use from lower to higher proficiency levels. This underscores GM's significance in crafting academically valued meanings in L2 contexts, suggesting its potential for informing instructional and assessment practices. • Metaphorisation in Writing is a useful metric for L2 writing assessment. • Evidence suggests GM frequency correlates with increased performance. • Learners progress from emergent arguments to presenting ideas more concisely. • The majority of GM shifts were to ‘things’. • The study provides further weight to arguments for meaning-based complexity.
-
Abstract
This paper introduces ASAP 2.0, a dataset of ∼25,000 source-based argumentative essays from U.S. secondary students. The corpus addresses the shortcomings of the original ASAP corpus by including demographic data, consistent scoring rubrics, and source texts. ASAP 2.0 aims to support the development of unbiased, sophisticated Automatic Essay Scoring (AES) systems that can foster improved educational practices by providing summative to students. The corpus is designed for broad accessibility with the hope of facilitating research into writing quality and AES system biases. • We introduce the ASAP 2.0 corpus. • The corpus contains over 25,000 source-based essays. • Each essay is scored for overall writing quality. • The corpus can be used to computationally and quantitatively model source-based writing quality.
April 2025
-
Towards a better understanding of integrated writing performance: The influence of literacy strategy use and independent language skills ↗
Abstract
This study explores the influence mechanism of literacy strategy use and independent language skills (e.g., reading and writing) on integrated writing (IW) performance. 322 Secondary Four students from four schools in Hong Kong completed single-text reading, multiple-text reading, independent writing, and IW tasks, along with questionnaires investigating their reading strategy use and IW strategy use. Path analyses revealed that multiple-text reading and independent writing had comparable significant impacts on IW, mediating the influence of single-text comprehension. In addition, reading strategy use impacted IW indirectly through independent literacy skills and IW strategy use, while IW strategies exerted a direct influence on IW. Our findings underscore the critical role of language skills in mediating the influence of reading strategies on IW performance among young first language (L1) learners. The implications for research and practice, are discussed, emphasizing the complexity of the IW construct and the need for balanced language skills and strategy instruction to enhance IW task performance. • A noble exploration of concurrent effects of strategies and independent skills on IW. • Multiple-text reading and independent writing directly influence IW performance. • Independent skills mediate the impact of reading strategies on IW performance. • Reading strategy indirectly affect IW through independent skills and IW strategy. • Balanced language skills and strategy instruction are crucial for IW performance.
-
Abstract
To meet the current trends in higher education, there is accountability on EAP programmes to prepare and assess students’ access to higher education. Thus, multimodal tasks including integrated writing (IW) assessments have seen a resurgence because they arguably closely mirror academic writing. However, test practicality constraints and variability in the use and format of these assessments mean rating scales often fall short in substantiating the central claims of IW assessment. We developed an integrated reading-writing scale taking into account reading-writing requirements and empirical research on IW tests designed to assess readiness for first-year humanities and social science courses. We approached test development as part of the ongoing validation efforts, detailing the considerations involved in the scale development process. We argue that alignment with academic writing requirements should guide the development of IW tests, thereby acknowledging and comprehending nuances of academic writing. The paper demonstrates considerations and decisions in scale design as the validation process from the start, which is a reminder that assessment is not just a quantitative exercise but a multifaceted process. • The design of a rating scale for first-year undergraduate academic writing is detailed. • Emphasis is placed on the role of reading in integrated writing scales. • Academic argumentation, rather than solely source-use mechanics, is considered. • Implications for construct operationalisation in academic evaluations are offered.
-
Validation of the individual and collective self-efficacy scale for teaching writing in post-secondary faculty ↗
Abstract
Faculty actions in the classroom are known to impact student writing self-efficacy and academic achievement. The purpose of this paper was to validate Locke and Johnston’s Individual and Collective Self-Efficacy for Teaching Writing Scales, a tool originally validated in high school teachers, in a new population of post-secondary faculty. Exploratory and confirmatory factor analysis methods were used in two studies with independent samples of multidisciplinary faculty (N = 281) for the exploratory factor analysis (Study 1) and nursing discipline specific faculty (N = 187) for the confirmatory factor analysis (Study 2). Three factors were identified in the questionnaire which maintained the essence of the theoretical structure proposed by Locke and Johnston. Factor 1 was named Context and Process Competencies, Factor 2 Textural Competencies, and Factor 3 Motivational Competencies. This factor structure was confirmed with acceptable goodness of fit in the confirmatory factor analysis Study 2. Learning to be a teacher of writing is a developmental process and this measurement tool has important validation information that speaks to its usefulness in understanding that process. • Instructional practices are known to impact student achievement levels. • Faculty individual self-efficacy for teaching writing is three factors. • Faculty undergo a slow enculturation practice to teaching writing. • This scale can be used to assess impact of teacher agency on student outcomes.