Journal of Writing Analytics

6 articles
Year: Topic: Clear
Export:
modern rhetorical theory ×

January 2020

  1. A Line Matching Method for Reliable Higher-Order Theme Identification
    doi:10.37514/jwa-j.2020.4.1.09

January 2018

  1. Placing Writing Tasks in Local and Global Contexts: The Case of Argumentative Writing
    Abstract

    Background: Current research in composition and writing studies is concerned with issues of writing program evaluation and how writing tasks and their sequences scaffold students toward learning outcomes. These issues are beginning to be addressed by writing analytics research, which can be useful for identifying recurring types of language in writing assignments and how those can inform task design and student outcomes. To address these issues, this study provides a three-step method of sequencing, comparison, and diagnosis to understand how specific writing tasks fit into a classroom sequence as well as compare to larger genres of writing outside of the immediate writing classroom environment. By doing so, we provide writing program administrators with tools for describing what skills students demonstrate in a sequence of writing tasks and diagnosing how these skills match with writing students will do in later contexts. Literature Review: Student writing that responds to classroom assignments can be understood as genres, insofar as they are constructed responses that exist in similar rhetorical situations and perform similar social actions. Previous work in corpus analysis has looked at these genres, which helps us as writing instructors understand what kind of constructed responses are required of students and to make those expectations explicit. Aull (2017) examined a corpus of first-year undergraduate writing assignments in two courses to create “sociocognitive profiles” of these assignments. We analyze student writing that responds to similar writing tasks, but use a different corpus method that allows us to understand the tasks in both local and global contexts. By doing so, we gain confidence and depth in our understanding of these tasks, analyze how they sequence together, and are able to compare argumentative writing across institutions and contexts. Research Questions: Two questions guided our study: What is the trajectory of skills targeted by the sequence of tasks in the two first-year writing courses, as evidenced by the rhetorical strategies employed by the writers in successive assignments? Focusing on the final argument assignments, how similar are they to argumentative writing in other contexts, in terms of rhetorical profiles? Methodology: We first conducted a local analysis, in which we used a dictionary-based corpus method to analyze the rhetorical strategies used by writers in the first-year writing courses to understand how they built on each other to form a sequence. Having understood what skills students are demonstrating in a course, we then conducted a global analysis which calculated a “distance” between the first-year argument writing and a corpus of argument writing drawn from other contexts. Recognizing that there was a non-trivial distance, we then identified and evaluated the sources of the distance so that the writing tasks could be assessed or modified. Results: The local analysis revealed eight key rhetorical strategies that student writing exhibits between the two first-year writing courses. With this understanding, we then placed the argument writing in global contexts to find that the assignments in both courses differ somewhat from argument writing in other contexts. Upon analyzing this difference, we found that the first-year writing primarily differs in its usage of academic language, the personal register, assertive language, and reasoning. We suggest that these differences stem primarily from the rhetorical situation and learning objectives associated with first-year writing, as well as the sequencing of the courses. Discussion: The three-step method presented provides a means for writing program administrators to describe and analyze writing that students produce in their writing programs. We intend these steps to be understood as an iterative process, whereby writing programs can use these results to evaluate what rhetorical skills their students are exhibiting and to benchmark those against the program’s goals and/or other similar writing programs. Conclusions: By presenting these analyses together, we ultimately provide a cohesive method by which to analyze a writing program and benchmark students’ use of rhetorical strategies in relation to other argumentative contexts. We believe this method to be useful not only to individual writing programs, but to assessment literature broadly. In future research, we anticipate learning how this process will practically feed back into pedagogy, as well as understanding what placing writing tasks into a global context can tell us about genre theory.

    doi:10.37514/jwa-j.2018.2.1.03
  2. Developing an e-rater Advisory to Detect Babel-generated Essays
    Abstract

    Background: It is important for developers of automated scoring systems to ensure that their systems are as fair and valid as possible. This commitment means evaluating the performance of these systems in light of construct-irrelevant response strategies. The enhancement of systems to detect and deal with these kinds of strategies is often an iterative process, whereby as new strategies come to light they need to be evaluated and effective mechanisms built into the automated scoring systems to handle them. In this paper, we focus on the Babel system, which automatically generates semantically incohesive essays. We expect that these essays may unfairly receive high scores from automated scoring engines despite essentially being nonsense. Literature Review: We discuss literature related to gaming of automated scoring systems. One reason that Babel essays are so easy to identify as nonsense by human readers is that they lack any semantic cohesion. Therefore, we also discuss some literature related to cohesion and detecting semantic cohesion. Research Questions: This study addressed three research questions:Can we automatically detect essays generated by the Babel system?Can we integrate the detection of Babel-generated essays into an operational automated essay scoring system while making sure not to flag valid student responses?Does a general approach for detecting semantically incohesive essays also detect Babel-generated essays?Research Methodology: This article describes the creation of two corpora necessary to address the research questions: (1) a corpus of Babel-generated essays and (2) a corresponding corpus of good-faith essays. We built a classifier to distinguish Babel-generated essays from good-faith essays and investigated whether the classifier can be integrated into an automated scoring engine without adverse effects. We also developed a measure of lexical-semantic cohesion and examined its distribution in Babel and in good-faith essays.Results: We found that the classifier built on Babel-generated essays and good-faith essays and using features from the automated scoring engine can distinguish the Babel-generated essays from the good-faith ones with 100% accuracy. We also found that if we integrated this classifier into the automated scoring engine it flagged very few responses that were submitted as part of operational submissions (76 of 434,656). The responses that were flagged had previously been assigned a score of Null (non-scorable) or a score of 1 by human experts. The measure of lexical-semantic cohesion shows promise in being able to distinguish Babel-generated essays from good-faith essays.Conclusions: Our results show that it is possible to detect the kind of gaming strategy illustrated by the Babel system and add it to an automated scoring engine without adverse effects on essays seen during real high-stakes tests. We also show that a measure of lexical-semantic cohesion can separate Babel-generated essays from good-faith essays to a certain degree, depending on task. This points to future work that would generalize the capability to detect semantic incoherence in essays. Directions for Further Research: Babel-generated essays can be identified and flagged by an automated scoring system without any adverse effects on a large set of good-faith essays. However, this is just one type of gaming strategy. It is important for developers of automated scoring systems to continue to be diligent about expanding the construct coverage of their systems in order to prevent weaknesses that can be exploited by tools such as Babel. It is also important to focus on the underlying linguistic reasons that lead to nonsense sentences. Successful identification of such nonsense would lead to improved automated scoring and feedback.

    doi:10.37514/jwa-j.2018.2.1.08
  3. De-Identification of Laboratory Reports in STEM
    Abstract

    Background: Employing natural language processing and latent semantic analysis, the current work was completed as a constituent part of a larger research project for designing and launching artificial intelligence in the form of deep artificial neural networks. The models were evaluated on a proprietary corpus retrieved from a data warehouse, where it was extracted from MyReviewers, a sophisticated web application purposed for peer review in written communication, which was actively used in several higher education institutions. The corpus of laboratory reports in STEM annotated by instructors and students was used to train the models. Under the Common Rule, research ethics were ensured by protecting the privacy of subjects and maintaining the confidentiality of data, which mandated corpus de-identification.Literature Review: De-identification and pseudonymization of textual data remains an actively studied research question for several decades. Its importance is stipulated by numerous laws and regulations in the United States and internationally with HIPAA Privacy Rule and FERPA.Research Question: Text de-identification requires a significant amount of manual post-processing for eliminating faculty and student names.  This work investigated automated and semi-automated methods for de-identifying student and faculty entities while preserving author names in cited sources and reference lists. It was hypothesized that a natural language processing toolkit and an artificial neural network model with named entity recognition capabilities would facilitate text processing and reduce the amount of manual labor required for post-processing after matching essays to a list of users’ names. The suggested techniques were applied with supplied pre-trained models without additional tagging and training. The goal of the study was to evaluate three approaches and find the most efficient one among those using a users’ list, a named entity recognition toolkit, and an artificial neural network.Research Methodology: The current work studied de-identification of STEM laboratory reports and evaluated the performance of the three techniques: brute forth search with a user lists, named entity recognition with the OpenNLP machine learning toolkit, and NeuroNER, an artificial neural network for named entity recognition built on the TensorFlow platform. The complexity of the given task was determined by the dilemma, where names belonging to students, instructors, or teaching assistants must be removed, while the rest of the names (e.g., authors of referenced papers) must be preserved.Results: The evaluation of the three selected methods demonstrated that automating de-identification of STEM lab reports is not possible in the setting, when named entity recognition methods are employed with pre-trained models. The highest results were achieved by the users’ list technique with 0.79 precision, 0.75 recall, and 0.77 F1 measure, which significantly outweighed OpenNLP with 0.06 precision, 0.14 recall, and 0.09 F1, and NeuroNER with 0.14 precision, 0.56 recall, and 0.23 F1.Discussion: Low performance of OpenNLP and NeuroNER toolkits was explained by the complexity of the task and unattainability of customized models due to imposed time constraints. An approach for masking possible de-identification errors is suggested.Conclusion: Unlike multiple cases described in the related work, de-identification of laboratory reports in STEM remained a non-trivial labor-intensive task. Applied out of the box, a machine learning toolkit and an artificial neural network technique did not enhance performance of the brute forth approach based on user list matching.Directions for Future Research: Customized tagging and training on the STEM corpus were presumed to advance outcomes of machine learning and predominantly artificial intelligence methods. Application of other natural language toolkits may lead to deducing a more effective solution.

    doi:10.37514/jwa-j.2018.2.1.07

January 2017

  1. (Re)Visualizing Rater Agreement: Beyond Single-Parameter Measures
    Abstract

    Technique Identification: A new graphical technique is presented for visualizing and assessing inter-rater agreement in discrete ordinal or categorical data, such as rubric ratings.  To that aim, a chance-corrected Kappa with two new features is derived. First, it is based on interpreting ratings for each subject as vectors to visualize the data. This is done by creating two-dimensional vectors from a subject-rating summary table, sorting the vectors by their slopes, and plotting them in that order to create a trajectory that displays all the data in context. Second, it presents a graph and accompanying statistics (Kappa, p -value) for each pair of ratings in an organized display so that all useful comparisons of the data are visually displayed and statistically assessed. This information is presented on a logical grid, usually called facets . Kappa is calculated in the usual way, by referencing the actual results with an average of random rating assignments. This average becomes a reference line on each graph as a visual cue, as well. The statistical basis for the Kappa and significance testing are derived, and the test assumptions are specified. Value Contribution: The most commonly used statistics for inter-rater agreement, such as the Cohen Kappa or Inter-Class Correlation, give only a single parameter estimate of reliability from which to make judgments about ratings data. The technique presented here constructs graphs of all the data that allow visual inspection of the ratings versus a reference curve that represents chance-matching. The detailed reports on inter-rater agreement can show how to fine-tune ratings systems, such as understanding which parts of an ordinal scale are working best. This solves a practical problem for researchers who rely on rating-type classification by revealing which overall aspects of the rating system need to be improved and adds to the list of tools available for assessing rating reliability. In creating this approach to analysis of rater data, human usability is emphasized. Specifically, the use of geometry is designed to facilitate interpretability rather than being a mathematical derivation from first principles. Technique Application: Two applications are given, both involving social meaning-making. The first uses data from wine-judging to illustrate how the method can illuminate expertise in that domain. The results reproduce published findings that were based on a classical statistical method. A second sample application uses data from a university assessment of student writing in which ratings on a developmental scale are assigned by course instructors to their students. The rating program is an example of social meaning-making that can be used to generate larger data sets than are typical for classroom-based assessment programs. The analysis shows the strengths and weaknesses of the rating system in terms of reliability and demonstrates how that knowledge leads to improvements in assessment. Directions for Further Research: An argument is made for a public library of inter-rater data for empirical use by researchers. The social aspects of rating are discussed, and there is an illustration of the potential to derive new measures of inter-rater agreement from the meaning-making program that produces the data.

    doi:10.37514/jwa-j.2017.1.1.10
  2. Assessing Writing Constructs: Toward an Expanded View of Inter-Reader Reliability
    Abstract

    Background: This study focuses on construct representation and inter-reader agreement and reliability in ePortfolio assessment of 1,315 writing portfolios. These portfolios were submitted by undergraduates enrolled in required writing seminars at the University of Pennsylvania (Penn) in the fall of 2014.  Penn is an Ivy League university with a diverse student population, half of whom identify as students of color. Over half of Penn’s students are women, 12% are international, and 12% are first-generation college students. The students’ portfolios are scored by the instructor and an outside reader drawn from a writing-in-the-disciplines faculty who represent 24 disciplines. The portfolios are the product of a shared curriculum that uses formative assessment and a program-wide multiple-trait rubric. The study contributes to scholarship on the inter-reader reliability and validity of multiple-trait portfolio assessments as well as to recent discussions about reconceptualizing evidence in ePortfolio assessment.  Research Questions: Four questions guided our study: What levels of interrater agreement and reliability can be achieved when assessing complex writing performances that a) contain several different documents to be assessed; b) use a construct-based, multi-trait rubric; c) are designed for formative assessment rather than testing; and d) are rated by a multidisciplinary writing faculty?   What can be learned from assessing agreement and reliability of individual traits? How might these measurements contribute to curriculum design, teacher development, and student learning? How might these findings contribute to research on fairness, reliability, and validity; rubrics; and multidisciplinary writing assessment? Literature Review: There is a long history of empirical work exploring the reliability of scoring highly controlled timed writings, particularly by test measurement specialists. However, until quite recently, there have been few instances of applying empirical assessment techniques to writing portfolios.  Developed by writing theorists, writing portfolios contain multiple documents and genres and are produced and assessed under conditions significantly different from those of timed essay measurement. Interrater reliability can be affected by the different approaches to reading texts depending on the background, training, and goals of the rater. While a few writing theorists question the use of rubrics, most quantitatively based scholarship points to their effectiveness for portfolio assessment and calls into question the meaningfulness of single score holistic grading, whether impressionistic or rubric-based. Increasing attention is being paid to multi-trait rubrics, including, in the field of writing portfolio assessment, the use of robust writing constructs based on psychometrics alongside the more conventional cognitive traits assessed in writing studies, and rubrics that can identify areas of opportunity as well as unfairness in relation to the background of the student or the assessor. Scholars in the emergent field of empirical portfolio assessment in writing advocate the use of reliability as a means to identify fairness and validity and to create great opportunities for portfolios to advance student learning and professional development of faculty.  They also note that while the writing assessment community has paid attention to the work of test measurement practitioners, the reverse has not been the case, and that conversations and collaborations between the two communities are long overdue. Methodology: We used two methods of calculating interrater agreement: absolute and adjacent percentages, and Cohen’s Unweighted Kappa, which calculates the extent to which interrater agreement is an effect of chance or expected outcome. For interrater reliability, we used the Pearson product-moment correlation coefficient. We used SPSS to produce all of the calculations in this study.  Results: Interrater agreement and reliability rates of portfolio scores landed in the medium range of statistical significance.  Combined absolute and adjacent percentages of interrater reliability were above the 90% range recommended; however, absolute agreement was below the 70% ideal.  Furthermore, Cohen’s Unweighted Kappa rates were statistically significant but very low, which may be due to “kappa paradox.” Discussion: The study suggests that a formative, rubric-based approach to ePortfolio assessment that uses disciplinarily diverse raters can achieve medium-level rates of interrater agreement and reliability. It raises the question of the extent to which absolute agreement is a desirable or even relevant goal for authentic feedback processes of a complex set of documents, and in which the aim is to advance student learning. At the same time, our findings point to how agreement and reliability measures can significantly contribute to our assessment process, teacher training, and curriculum. Finally, the study highlights potential concerns about construct validity and rater training.  Conclusion: This study contributes to the emergent field of empirical writing portfolio assessment that calls into question the prevailing standard of reliability built upon timed essay measurement rather than the measurement, conditions, and objectives of complex writing performances.  It also contributes to recent research on multi-trait and discipline-based portfolio assessment.  We point to several directions for further research:  conducting “talk aloud” and recorded sessions with raters to obtain qualitative data on areas of disagreement; expanding the number of constructs assessed; increasing the range and granularity of the numeric scoring scale; and investigating traits that are receiving low interrater reliability scores. We also ask whether absolute agreement might be more useful for writing portfolio assessment than reliability and point to the potential “kappa paradox,” borrowed from the field of medicine, which examines interrater reliability in assessment of rare cases. Kappa paradox might be useful in assessing types of portfolios that are less frequently encountered by faculty readers. These, combined with the identification of jagged profiles and student demographics, hold considerable potential for rethinking how to work with and assess students from a range of backgrounds, preparation, and abilities.  Finally, our findings contribute to a growing effort to understand the role of rater background, particularly disciplinarity, in shaping writing assessment. The goals of our assessment process are to ensure that we are measuring what we intend to measure, specifically those things that students have an equal chance at achieving and that advance student learning.  Our findings suggest that interrater agreement and reliability measures, if thoughtfully approached, will contribute significantly to each of these goals.

    doi:10.37514/jwa-j.2017.1.1.09