Journal of Writing Analytics

2 articles
Year: Topic: Clear
Export:
professional writing ×

January 2018

  1. Evolution of Instructor Response? Analysis of Five Years of Feedback to Students
    Abstract

    Background: Research incorporating large data sets and data and text mining methodologies is making initial contributions to writing studies. In writing program administration (WPA) work, one could best characterize the body of publications as small but growing, led by such work as Moxley and Eubanks’ 2015 “On Keeping Score: Instructors' vs. Students' Rubric Ratings of 46,689 Essays” and Arizona State University’s Science of Learning & Educational Technology (SoLET) Lab. Given the information that large-scale textual analysis can provide, it seems incumbent on program administrators to explore ways to make regular and aggressive use of such opportunities to give both students and instructors more resources for learning and development. This project is one attempt to add to this corpus of work; the sample for the study consisted of 17,534 pieces of student writing representing 141,659 discrete comments on that writing, with 58,300 unique words out of over 8.25 million total words written. This data is used to examine trends in the program’s instructor commentary over five years’ time.  By doing so, this study revisits a fundamental task of writing instruction—responding to student writing, and from the data’s results considers how large writing programs with constant turnover of graduate teaching assistants (GTAs) might manage their ongoing instructor professional development and how those GTAs will improve their ability to teach and respond to writing.Literature Review: Researchers have attempted to unpack and understand the task of instructor commentary for several decades; the published literature demonstrates a complex and occasionally ambivalent relationship with this central task of writing instruction. Recent scholarship has moved from the small-scale studies long used by the field to implement large-scale examinations of the instruction occurring in writing programs. Research questions: Three questions guided the inquiry:Does the work of new instructors (MA1s) more closely resemble the lexicon of novice or experienced responders to student writing?How does the new instructors’ work compare to that of more experienced (PHD1 or INS) instructors in the program throughout their time?How does their work evolve over a four-semester longitudinal time frame (as MA1 or MA2 experience levels) in the first-year writing program? [Please note that the abbreviations used above and throughout the article to designate instructor experience levels are as follows: MA1 (first-year master’s students); MA2 (second-year master’s students); PHD1 (first-year doctoral students); INS (instructors—those with 3 or more years’ experience teaching and who are not currently pursuing an additional degree—nearly all of these individuals held a Master’s degree)].Methodology: This study extends the work of Anson and Anson (2017) who first surveyed writing instructors and program administrators to create wordlists that survey respondents associated with “high-quality” and “novice” responses, and then examined a corpus of nearly 50,000 peer responses produced at a single university to learn to what extent instructors and student peers adopted this lexicon. Specifically, the study analyzes a corpus of instructor comments to students using the Anson and Anson wordlists associated with principled and novice commentary to see if new writing instructors align more closely with the concepts represented in either list during their first semester in the program.  It then tracks four cohorts for evolution and change in their vocabulary of feedback over their next three semesters in the program; the study also compares the vocabulary used in their comments to that used by experienced instructors in the program over the same time.Results: The study found that from the outset, the new instructors (MA1) incorporated more of the principled response terms than the novice response terms. Overall, in comparing the MA1 instructors with the most experienced group (INS), the results reveal three important findings about the feedback of both MA1s and INSs in this program.While there are some differences in commentary as seen via examination of the two lexicons, the differences are perhaps less than one might assume.The cohorts do increase their use of the principled terms as they move through the two years’ appointment in the program, but few of the increases demonstrate statistical significance.Few of the terms from either the novice or principled lexicon, with the exception of terms that also appear in the assignment descriptions, what I label as “content terms,” appear frequently in the overall corpus.Discussion: Based on the results, the instructors in this program had acquired a more consistent vocabulary, but not primarily one based on Anson and Anson’s two lexicons—instead, the most frequent and commonly used terms seem to come from a more local “canon,” that is, one based on the assignment descriptions and course outcomes. Regardless of whether the acquisition of a common vocabulary came from more global concepts or an assignment-based local canon, using common terms is something that Nancy Sommers (1982) saw as contributing to “thoughtful commentary” on student writing. As no one has previously studied how quickly new instructors acquire a professional vocabulary for responding to student writing, it is hard to know whether or not the results of this particular group of instructors would be considered “typical.” However, it may well be that the context of this writing program contributed to a more accelerated acquisition.Conclusions: Working with the lexicons developed via Anson and Anson’s survey is a useful starting point for understanding more of what our instructors actually do when responding to student writing, as well as for identifying critical differences in our instructors’ comments. The lexicons, though, only provide us with a subset of expected (thus acceptable) terms included in commentary—terms that afford students the opportunity to act upon receiving them via revision or transfer. Directions for Future Research: Additional research is necessary to expand and refine the lexicons and their impact on student writing. One possibility is to return to the current data set to engage in additional lexical analysis of both the novice and principled lexicons as well as the overall frequency tables to understand how terms are used in the context of response by the various instructor groups. Differences in the application of the terms might help us understand why comments might be labeled as more or less helpful to writers.  Another strategy is to examine the data in terms of markers of stance; finally, topic modeling could be used to locate more subtle differences in the instructor comments that are not as easily identifiable with lexical analysis. Such examinations could serve as a baseline for broadening the study out to other sets of assignments and commentary, perhaps helping us build a set of threshold concepts for talking about writing with our students. Ultimately, it is important to replicate and expand Anson and Anson’s survey to other stakeholder groups. As with much research on the teaching of writing, we default to the group most accessible to us—other writing professionals. Replicating this survey with other stakeholders—graduate teaching assistants, undergraduate students at both lower and upper division levels— could help us understand whether or not a gap exists in understanding what constitutes good feedback from the various stakeholders.

    doi:10.37514/jwa-j.2018.2.1.02

January 2017

  1. Assessing Writing Constructs: Toward an Expanded View of Inter-Reader Reliability
    Abstract

    Background: This study focuses on construct representation and inter-reader agreement and reliability in ePortfolio assessment of 1,315 writing portfolios. These portfolios were submitted by undergraduates enrolled in required writing seminars at the University of Pennsylvania (Penn) in the fall of 2014.  Penn is an Ivy League university with a diverse student population, half of whom identify as students of color. Over half of Penn’s students are women, 12% are international, and 12% are first-generation college students. The students’ portfolios are scored by the instructor and an outside reader drawn from a writing-in-the-disciplines faculty who represent 24 disciplines. The portfolios are the product of a shared curriculum that uses formative assessment and a program-wide multiple-trait rubric. The study contributes to scholarship on the inter-reader reliability and validity of multiple-trait portfolio assessments as well as to recent discussions about reconceptualizing evidence in ePortfolio assessment.  Research Questions: Four questions guided our study: What levels of interrater agreement and reliability can be achieved when assessing complex writing performances that a) contain several different documents to be assessed; b) use a construct-based, multi-trait rubric; c) are designed for formative assessment rather than testing; and d) are rated by a multidisciplinary writing faculty?   What can be learned from assessing agreement and reliability of individual traits? How might these measurements contribute to curriculum design, teacher development, and student learning? How might these findings contribute to research on fairness, reliability, and validity; rubrics; and multidisciplinary writing assessment? Literature Review: There is a long history of empirical work exploring the reliability of scoring highly controlled timed writings, particularly by test measurement specialists. However, until quite recently, there have been few instances of applying empirical assessment techniques to writing portfolios.  Developed by writing theorists, writing portfolios contain multiple documents and genres and are produced and assessed under conditions significantly different from those of timed essay measurement. Interrater reliability can be affected by the different approaches to reading texts depending on the background, training, and goals of the rater. While a few writing theorists question the use of rubrics, most quantitatively based scholarship points to their effectiveness for portfolio assessment and calls into question the meaningfulness of single score holistic grading, whether impressionistic or rubric-based. Increasing attention is being paid to multi-trait rubrics, including, in the field of writing portfolio assessment, the use of robust writing constructs based on psychometrics alongside the more conventional cognitive traits assessed in writing studies, and rubrics that can identify areas of opportunity as well as unfairness in relation to the background of the student or the assessor. Scholars in the emergent field of empirical portfolio assessment in writing advocate the use of reliability as a means to identify fairness and validity and to create great opportunities for portfolios to advance student learning and professional development of faculty.  They also note that while the writing assessment community has paid attention to the work of test measurement practitioners, the reverse has not been the case, and that conversations and collaborations between the two communities are long overdue. Methodology: We used two methods of calculating interrater agreement: absolute and adjacent percentages, and Cohen’s Unweighted Kappa, which calculates the extent to which interrater agreement is an effect of chance or expected outcome. For interrater reliability, we used the Pearson product-moment correlation coefficient. We used SPSS to produce all of the calculations in this study.  Results: Interrater agreement and reliability rates of portfolio scores landed in the medium range of statistical significance.  Combined absolute and adjacent percentages of interrater reliability were above the 90% range recommended; however, absolute agreement was below the 70% ideal.  Furthermore, Cohen’s Unweighted Kappa rates were statistically significant but very low, which may be due to “kappa paradox.” Discussion: The study suggests that a formative, rubric-based approach to ePortfolio assessment that uses disciplinarily diverse raters can achieve medium-level rates of interrater agreement and reliability. It raises the question of the extent to which absolute agreement is a desirable or even relevant goal for authentic feedback processes of a complex set of documents, and in which the aim is to advance student learning. At the same time, our findings point to how agreement and reliability measures can significantly contribute to our assessment process, teacher training, and curriculum. Finally, the study highlights potential concerns about construct validity and rater training.  Conclusion: This study contributes to the emergent field of empirical writing portfolio assessment that calls into question the prevailing standard of reliability built upon timed essay measurement rather than the measurement, conditions, and objectives of complex writing performances.  It also contributes to recent research on multi-trait and discipline-based portfolio assessment.  We point to several directions for further research:  conducting “talk aloud” and recorded sessions with raters to obtain qualitative data on areas of disagreement; expanding the number of constructs assessed; increasing the range and granularity of the numeric scoring scale; and investigating traits that are receiving low interrater reliability scores. We also ask whether absolute agreement might be more useful for writing portfolio assessment than reliability and point to the potential “kappa paradox,” borrowed from the field of medicine, which examines interrater reliability in assessment of rare cases. Kappa paradox might be useful in assessing types of portfolios that are less frequently encountered by faculty readers. These, combined with the identification of jagged profiles and student demographics, hold considerable potential for rethinking how to work with and assess students from a range of backgrounds, preparation, and abilities.  Finally, our findings contribute to a growing effort to understand the role of rater background, particularly disciplinarity, in shaping writing assessment. The goals of our assessment process are to ensure that we are measuring what we intend to measure, specifically those things that students have an equal chance at achieving and that advance student learning.  Our findings suggest that interrater agreement and reliability measures, if thoughtfully approached, will contribute significantly to each of these goals.

    doi:10.37514/jwa-j.2017.1.1.09