Exploring the scoring validity of holistic and dimension-based Comparative Judgements of young learners’ EFL writing

Rebecca Sickinger; John Pill; Tineke Brunfaut

doi:10.1016/j.asw.2025.100986

Assessing Writing Oct 2025 Open Access

Exploring the scoring validity of holistic and dimension-based Comparative Judgements of young learners’ EFL writing

Rebecca Sickinger Lancaster University ; John Pill Lancaster University ; Tineke Brunfaut Lancaster University

Abstract

Comparative Judgement (CJ) is a pairwise comparison evaluation method, typically conducted online. Multiple judges each compare the quality of a series of paired performances and, from their decisions, a rank order is constructed and scores calculated. Research across different educational contexts supports CJ’s reliability for evaluating written performances, permitting more precise scoring of scripts and for dimension-focused evaluation. However, scant insights are available about the basis of judges’ evaluations. This issue is important because argument-based approaches to validation (common in the field of language testing and adopted in this study) require evidence to support claims about how scores are appropriate for test purpose. Therefore, we investigate the scoring validity of CJ, both when used holistically (the standard application of CJ) and when evaluating scripts by individual criteria (termed dimensions in the research context). Twenty-seven judges evaluated 300 scripts addressing two writing task types in a national English as a Foreign Language examination for young learners in Austria. Judges reported via questionnaires what they had focused on while judging. Subsequently, eight judges provided think-aloud data while evaluating 157 scripts, offering further insight into the writing features they considered and their decision-making during CJ. Findings showed that while most judges adapted a decision-making process similar to traditional rating methods, some adapted their method to accommodate the nature of CJ evaluation. Furthermore, results indicated that the judges considered construct-relevant criteria when using CJ, both holistically and by dimension, thus offering support to an argument for the appropriateness of using CJ in this context. • Comparative Judgement can offer an alternative to analytic rating of EFL writing. • Judges with teaching or rating experience largely focus on relevant text features. • Some judges adopt a decision-making process that appears well suited to CJ. • Dimension-based CJ has the potential to provide richer feedback than holistic CJ.

Journal: Assessing Writing
Published: 2025-10-01
DOI: 10.1016/j.asw.2025.100986
CompPile: Search in CompPile ↗
Open Access: OA PDF Hybrid
Topics: teacher development assessment multilingual writers
Export: BibTeX RIS

Citation Context

Cited by in this index (0)

No articles in this index cite this work.

References (48) · 2 in this index

Barkaoui (2011)

Think-aloud protocols in research on essay rating: an empirical study of their veridicali…

Language Testing ↗
Bartholomew (2022)

A systematized review of research with adaptive comparative judgment (ACJ) in higher education

International Journal of Technology and Design Education ↗
Bramley (2022)

Editorial – the CJ landscape

Research Matters
Brooks (2012)

Marking as judgment

Research Papers in Education ↗
Buckley (2022)

A review of the valid methodological use of adaptive comparative judgment in technology e…

Frontiers in Education ↗

Show all 48 →

Chambers (2022)

Exploring the validity of comparative judgement: do judges attend to construct-irrelevant…

Frontiers in Education ↗
Chapelle (2021)

Argument-based validation in testing and assessment
Curcin (2019)

Improving awarding: 2018/2019 pilots
Freeman (2001)

Second language teacher education

The Cambridge guide to teaching English to speakers of other languages
Gassner (2011)

Testing writing for the E8 Standards
Gijsen (2021)

The complexity of comparative judgments in assessing argumentative writing: an eye tracki…

Frontiers in Education ↗
Green (1998)

Verbal protocol analysis in language testing research. A handbook
Gu (2014)

To code or not to code: dilemmas in analysing think-aloud protocols in learning strategie…

System ↗
Holmes (2017)

Marking reliability studies 2017: Rank ordering versus marking – which is more reliable? Ofqual
Hughes (2016)

Ask, answer, assess: peer learning from student-generated content
Jones (2024)

Comparative judgement in education research

International Journal of Research & Method in Education ↗
Jones (2015)

The problem of assessing problem solving: can comparative judgement help?

Educational Studies in Mathematics ↗
Kelly (2022)

Critiquing the rationales for using comparative judgement: a call for clarity

Assessment in Education: Principles, Policy & Practice
Knoch (2018)

Validation of rating processes within an argument-based framework

Language Testing ↗
Laming (2004)

Human judgment: the eye of the beholder

Thomson Learning
Landis (1977)

The measurement of observer agreement for categorical data

Biometrics ↗
Landrieu (2022)

Assessing the quality of argumentative texts: examining the general agreement between dif…

Frontiers in Education ↗
Lesterhuis (2022)

Validity of comparative judgment scores: how assessors evaluate aspects of text quality w…

Frontiers in Education ↗
Lesterhuis (2018)

When teachers compare argumentative texts: decisions informed by multiple complex aspects…

L1 Educational Studies in Language and Literature ↗
Marshall (2020)

Assessment by comparative judgement: an application to secondary statistics and English i…

New Zealand Journal of Educational Studies ↗
McGrane (2018)

Applying a Thurstonian, two-stage method in the standardized assessment of writing

Applied Measurement in Education ↗
Oates (2022)

Foreword

Research Matters
Paquot (2022)

Crowdsourced adaptive comparative judgment: a community-based solution for proficiency rating

Language Learning ↗
Pibal (2018)

The role of error in assessing English writing in the Austrian Educational Standards Base…

Language testing in Austria: Taking stock = Sprachtesten in Österreich: Eine Bestandsaufnahme
Pollitt (2012)

Comparative judgement for assessment

International Journal of Technology and Design Education ↗
Pollitt (2012)

The method of adaptive comparative judgement

Assessment in Education: Principles, Policy & Practice
Rotaru (2022)

Using adaptive comparative judgement for assessing GCSE History NEA responses: research report

Qualifications Wales
Shaw (2007)

Examining writing: research and practice in assessing second language writing
Sickinger (2023)

An exploration of comparative judgement for evaluating writing performances of the Austri…

[Doctoral dissertation, Lancaster University]
Sickinger (2025)

Comparative judgement for evaluating young learners’ EFL writing performances: reliabilit…

Language Testing ↗
Siller (2019)

Testing writing for the E8 Standards – Technical Report
Sims (2020)

Rubric rating with MFRM versus randomly distributed comparative judgment: a comparison of…

Educational Measurement, Issues and Practice ↗
Steedle (2016)

Evaluating comparative judgment as an approach to essay scoring

Applied Measurement in Education ↗
Thurstone (1927)

A law of comparative judgment

Psychological Review ↗
Thwaites et al. (2025)

Comparative judgment in L2 writing assessment: Reliability and validity across …

Assessing Writing
Thwaites et al. (2024)

Is CJ a valid, reliable form of L2 writing assessment when texts are long, homo…

Assessing Writing
Thwaites (2024)

Comparative judgement for advancing research in applied linguistics

Research Methods in Applied Linguistics ↗
van Daal (2020)

Making a choice is not easy?! Unravelling the task difficulty of comparative judgement to…

University of Antwerp
van Daal (2019)

Validity of comparative judgement to assess academic writing: examining implications of i…

Assessment in Education: Principles, Policy & Practice
van Daal (2017)

The complexity of assessing student work using comparative judgment: the moderating role …

Frontiers in Education ↗
Walland (2022)

Judges’ views on pairwise comparative judgement and rank ordering as alternatives to anal…

Research Matters
Wheadon (2020)

A comparative judgement approach to the large-scale assessment of primary writing in England

Assessment in Education: Principles, Policy & Practice
Whitehouse (2013)

Testing the validity of judgements about geography essays using the adaptive comparative …

Centre for Education Research and Policy

CrossRef global citation count: 0 View in citation network → Build reading path →

Exploring the scoring validity of holistic and dimension-based Comparative Judgements of young learners’ EFL writing

Abstract

Citation Context

Cited by in this index (0)

References (48) · 2 in this index

Related Articles