Validity evidences for scoring procedures of a writing assessment task. A case study on consistency, reliability, unidimensionality and prediction accuracy

Paula Elosua

doi:10.1016/j.asw.2022.100669

Assessing Writing Oct 2022 Open Access

Validity evidences for scoring procedures of a writing assessment task. A case study on consistency, reliability, unidimensionality and prediction accuracy

Paula Elosua University of the Basque Country

Abstract

Scoring is a fundamental step in the assessment of writing performance. The choice of the scoring procedure as well as the adoption of a discrepancy resolution method can impact the psychometric properties of the scores and therefore the final pass/fail decision. In a comprehensive framework which considers scoring as part of the validation process of the scores, the aim of this paper is to evaluate the impact of rater mean, parity and tertium quid procedures on score properties. Using data from a writing assessment task applied in a professional context, the paper analyses score reliability, dependability, unidimensionality and decision accuracy on two sets of data; complete data and subsample of discrepant data. The results show better performance of the tertium quid procedure in terms of reliability indicators but a lower quality in defining construct unidimensionality.

Journal: Assessing Writing
Published: 2022-10-01
DOI: 10.1016/j.asw.2022.100669
CompPile: Search in CompPile ↗
Open Access: OA PDF Hybrid
Topics: assessment qualitative research
Export: BibTeX RIS

Citation Context

Cited by in this index (0)

No articles in this index cite this work.

References (46) · 5 in this index

American Educational Research Association, American Psychological Association, & National Council on Measurem…
Bachman (2010)

Language assessment in practice: Developing language assessments and justifying their use in the real world
Bachman (1995)

Investigating variability in tasks and rater judgments in a performance test of foreign l…

Language Testing ↗
Brennan, R.L. (1996). Generalizability of performance assessments. In G. W.Phillips (Ed.), Technical issues i…
Brennan (2001)

Generalizability Theory

Show all 46 →

Chapelle (2014)

Evaluation of language tests through validation research

The companion to language assessment
Clifton (2020)

Managing validity versus reliability trade-offs in scale-building decisions

Psychological Methods ↗
Council of Europe. (2001). Common European Framework of Reference for Languages: Learning, Teaching and Asses…
De Boeck (2016)

Reliability and Validity: History, Notions, Methods, Discussion

The ITC International Handbook of Testing and Assessment
Deygers (2015)

Determining the scoring validity of a co-constructed CEFR-based rating scale

Language Testing ↗
Dunsmuir et al. (2015)

An evaluation of the Writing Assessment Measure (WAM) for children's narrative …

Assessing Writing
Eckes (2015)

Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments
Fornell (1981)

Evaluating structural equation models with unobservable variables and measurement errors

Journal of Marketing Research ↗
Gravetter (2014)

Essentials of Statistics for the Behavioral Sciences

CA
Hu (1999)

Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria v…

Structural Equation Modeling ↗
Huang (2012)

Using generalizability theory to examine the accuracy and validity of large-sca…

Assessing Writing
Johnson (2000)

The relationship between score resolution methods and interrater reliability: An empirica…

Applied Measurement in Education ↗
JOHNSON et al. (2001)

Score Resolution and the Interrater Reliability of Holistic Scores in Rating Essays

Written Communication
Johnson (2003)

Score resolution: An investigation of the reliability and validity of resolved scores

Applied Measurement in Education ↗
Kane (1992)

An argument-based approach to validity

Psychological Bulletin ↗
Kane, M.T. (2006). Validation. In R. Brennen (Ed.), Educational measurement, 4th ed. (pp. 17–64). Westport, C…
Kane (2013)

Validating the interpretations and uses of test scores

Journal of Educational Measurement ↗
Kim (2011)

Resolving discrepant ratings in writing assessments: The choice of resolution method and …

English Teaching ↗
Knoch (2018)

Validation of rating processes with an argument-based framework

Language Testing ↗
Knoch, U., & Macqueen, S. (2020). Assessing English for professional purposes. Abingdon: Routledge.

↗
Lee (2010)

Classification consistency and accuracy for complex assessments using item response theory

Journal of Educational Measurement ↗
Lee, Y.W. (2005). Dependability of Scores for a New ESL Speaking Test: Evaluating Prototype Tasks. ETS Monogr…

↗
Lin (2017)

Working with sparse data in rated language tests: Generalizability theory applications

Language Testing ↗
Linacre (1989)

Many-facet Rasch measurement
Linacre (2002)

Optimizing rating scale category effectiveness

Journal of Applied Measurement
Marcoulides (1989)

Performance appraisal: Issues of validity

Performance Improvement Quarterly ↗
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 13–103). New York: A…
Myers, M. (1980). A procedure for writing assessment and holistic scoring. Urbana, IL: National Council of Te…
Ohta et al. (2018)

Integrated writing scores based on holistic and multi-trait scales: A generaliz…

Assessing Writing
Olejnik (2003)

Generalized eta and omega squared statistics: measures of effect size for some common res…

Psychological Methods ↗
Penny (2011)

The accuracy of performance task scores after resolution of rater disagreement: A monte C…

Assessing Writing
Rosseel (2012)

lavaan: An R package for structural equation modeling

Journal of Statistical Software ↗
Rudner, L.M. (2001). Computing the expected proportions of misclassified examinees. Practical Assessment Rese…
Rudner, L.M. (2005). Expected classification accuracy. Practical Assessment Research & Evaluation,10(13). Ret…
Stemler (2004)

A Comparison of consensus, consistency, and measurement approaches to estimating interrat…

Pract. Assesm. Research, and Evaluation
Subkoviak (1988)

A practitioner’s guide to computation and interpretation of reliability indices for maste…

Journal of Educational Measurement ↗
Weir, C. (2005). Language testing and validation. New York: Palgrave Macmillan.

↗
Wind et al. (2019)

Exploring the correspondence between traditional score resolution methods and p…

Assessing Writing
Wind (2020)

Exploring the impacts of different score resolution procedures on person fit and estimate…

Language Assessment Quarterly ↗
Wolcott, W. (1998). An overview of writing assessment: Theory, research, and practice. Urbana, IL: National C…
Zhang (2010)

Assessing the accuracy and consistency of language proficiency classification under compe…

Language Testing ↗

CrossRef global citation count: 3 View in citation network → Build reading path →

Validity evidences for scoring procedures of a writing assessment task. A case study on consistency, reliability, unidimensionality and prediction accuracy

Abstract

Citation Context

Cited by in this index (0)

References (46) · 5 in this index

Related Articles