Journal of Writing Analytics
95 articlesJanuary 2020
January 2019
-
Understanding Attainment Disparity: The Case for a Corpus-Driven Analysis of the Language used in Written Feedback Information to Students of Different Backgrounds ↗
Abstract
Background: Disparity of attainment between different groups of students in UK higher education has been correlated with ethnicity (UUK & NUS, 2019). For example, students who declared their ethnicity as Black were 20% less likely to graduate with a top classification than those who declared their ethnicity as White (OfS, 2018a). The causes of such attainment gaps are complex, and one important factor may be the nature of the feedback given by academic staff on assignments written by different groups of students. This paper aims to explore the feasibility of investigating this hypothesis by analyzing written feedback and looking for patterns in feedback given to different groups of students. Literature Review: Research on attainment among Black and Minority Ethnic (BAME) students in the UK has explored a number of aspects, and has generally concluded that there are issues of “belonging” (Richardson, 2015), particularly in institutions where the majority of academic staff and students are White, but that no single variable can explain the disparity. The wording of feedback on lower-scoring papers has been shown to be more impersonal and distant than that given to students on higher-scoring papers (e.g., Gardner, 2004), which has the (unintended) result of increasing the sense of belonging of higher performing students in ways that can build incrementally over the years of a degree course. While there have been many such small-scale studies of written feedback, none have aimed to collect large quantities of authentic written feedback for analysis. Research Questions: The hypotheses that drive our exploration are that written feedback information (WFI) (Boud & Malloy, 2013) is worded differently to different groups of students, and that there is a direct relationship between this aspect of feedback and academic attainment as measured by grades on summative assessments. Specifically, we asked: 1. Can a framework of WFI functions be developed for our data that share a meaningful set of attributes? 2. Can these categories be used to differentiate WFI to different groups of students? Methodology: A small pilot corpus was compiled from written feedback comments on twelve student assignments from two large Faculties. Metadata was added to each file, and the WFI comments were annotated and analyzed according to a framework developed in a branching format through a recursive construction process informed by the literature reviewed and the data in the corpus. This technique was used to characterize the WFI styles of the two Faculties. Results: The results show that all WFI comments could be classified using the novel systematic framework developed, and that its binary nature enabled ready cross-tabulation with metadata variables. Praise and critique were found to be most frequent, with specific praise of ideas (P1A) accounting for 68% of all praise, and specific critique of content (C1A) accounting for 49% of all critique. Observations tend to be the longest feedback comments (average 15.4 words). When the two Faculties are compared, two different feedback styles are evident, with Fac1 providing more advice, query, and observation style feedback than Fac2, and Fac2 providing more praise and critique than Fac1.
January 2018
-
Abstract
This research note focuses on how corpus analysis tools can help researchers make sense of the data writing centers collect. Writing centers function, in many ways, like large data repositories; however, this data is under-analyzed. One example of data collected by writing centers is session notes, often collected after each consultation. The four institutions featured in this noteâ€"Michigan State University, the University of Michigan, Texas A&M University, and The Ohio State Universityâ€"have analyzed a subset of their session notes, over 44,000 session notes comprising around 2,000,000 words. By analyzing the session notes using tools such as Voyant, a web-based application for performing text analysis, writing center researchers can begin to explore critically their large data repositories to understand and establish evidence-based practice, as well as to shape external messaging about writing center laborâ€"separate from and in addition to impact on student writersâ€"to institutional administrators, state legislators, and other stakeholders.
-
Abstract
Background: Research incorporating large data sets and data and text mining methodologies is making initial contributions to writing studies. In writing program administration (WPA) work, one could best characterize the body of publications as small but growing, led by such work as Moxley and Eubanks’ 2015 “On Keeping Score: Instructors' vs. Students' Rubric Ratings of 46,689 Essays” and Arizona State University’s Science of Learning & Educational Technology (SoLET) Lab. Given the information that large-scale textual analysis can provide, it seems incumbent on program administrators to explore ways to make regular and aggressive use of such opportunities to give both students and instructors more resources for learning and development. This project is one attempt to add to this corpus of work; the sample for the study consisted of 17,534 pieces of student writing representing 141,659 discrete comments on that writing, with 58,300 unique words out of over 8.25 million total words written. This data is used to examine trends in the program’s instructor commentary over five years’ time. By doing so, this study revisits a fundamental task of writing instruction—responding to student writing, and from the data’s results considers how large writing programs with constant turnover of graduate teaching assistants (GTAs) might manage their ongoing instructor professional development and how those GTAs will improve their ability to teach and respond to writing.Literature Review: Researchers have attempted to unpack and understand the task of instructor commentary for several decades; the published literature demonstrates a complex and occasionally ambivalent relationship with this central task of writing instruction. Recent scholarship has moved from the small-scale studies long used by the field to implement large-scale examinations of the instruction occurring in writing programs. Research questions: Three questions guided the inquiry:Does the work of new instructors (MA1s) more closely resemble the lexicon of novice or experienced responders to student writing?How does the new instructors’ work compare to that of more experienced (PHD1 or INS) instructors in the program throughout their time?How does their work evolve over a four-semester longitudinal time frame (as MA1 or MA2 experience levels) in the first-year writing program? [Please note that the abbreviations used above and throughout the article to designate instructor experience levels are as follows: MA1 (first-year master’s students); MA2 (second-year master’s students); PHD1 (first-year doctoral students); INS (instructors—those with 3 or more years’ experience teaching and who are not currently pursuing an additional degree—nearly all of these individuals held a Master’s degree)].Methodology: This study extends the work of Anson and Anson (2017) who first surveyed writing instructors and program administrators to create wordlists that survey respondents associated with “high-quality” and “novice” responses, and then examined a corpus of nearly 50,000 peer responses produced at a single university to learn to what extent instructors and student peers adopted this lexicon. Specifically, the study analyzes a corpus of instructor comments to students using the Anson and Anson wordlists associated with principled and novice commentary to see if new writing instructors align more closely with the concepts represented in either list during their first semester in the program. It then tracks four cohorts for evolution and change in their vocabulary of feedback over their next three semesters in the program; the study also compares the vocabulary used in their comments to that used by experienced instructors in the program over the same time.Results: The study found that from the outset, the new instructors (MA1) incorporated more of the principled response terms than the novice response terms. Overall, in comparing the MA1 instructors with the most experienced group (INS), the results reveal three important findings about the feedback of both MA1s and INSs in this program.While there are some differences in commentary as seen via examination of the two lexicons, the differences are perhaps less than one might assume.The cohorts do increase their use of the principled terms as they move through the two years’ appointment in the program, but few of the increases demonstrate statistical significance.Few of the terms from either the novice or principled lexicon, with the exception of terms that also appear in the assignment descriptions, what I label as “content terms,” appear frequently in the overall corpus.Discussion: Based on the results, the instructors in this program had acquired a more consistent vocabulary, but not primarily one based on Anson and Anson’s two lexicons—instead, the most frequent and commonly used terms seem to come from a more local “canon,” that is, one based on the assignment descriptions and course outcomes. Regardless of whether the acquisition of a common vocabulary came from more global concepts or an assignment-based local canon, using common terms is something that Nancy Sommers (1982) saw as contributing to “thoughtful commentary” on student writing. As no one has previously studied how quickly new instructors acquire a professional vocabulary for responding to student writing, it is hard to know whether or not the results of this particular group of instructors would be considered “typical.” However, it may well be that the context of this writing program contributed to a more accelerated acquisition.Conclusions: Working with the lexicons developed via Anson and Anson’s survey is a useful starting point for understanding more of what our instructors actually do when responding to student writing, as well as for identifying critical differences in our instructors’ comments. The lexicons, though, only provide us with a subset of expected (thus acceptable) terms included in commentary—terms that afford students the opportunity to act upon receiving them via revision or transfer. Directions for Future Research: Additional research is necessary to expand and refine the lexicons and their impact on student writing. One possibility is to return to the current data set to engage in additional lexical analysis of both the novice and principled lexicons as well as the overall frequency tables to understand how terms are used in the context of response by the various instructor groups. Differences in the application of the terms might help us understand why comments might be labeled as more or less helpful to writers. Another strategy is to examine the data in terms of markers of stance; finally, topic modeling could be used to locate more subtle differences in the instructor comments that are not as easily identifiable with lexical analysis. Such examinations could serve as a baseline for broadening the study out to other sets of assignments and commentary, perhaps helping us build a set of threshold concepts for talking about writing with our students. Ultimately, it is important to replicate and expand Anson and Anson’s survey to other stakeholder groups. As with much research on the teaching of writing, we default to the group most accessible to us—other writing professionals. Replicating this survey with other stakeholders—graduate teaching assistants, undergraduate students at both lower and upper division levels— could help us understand whether or not a gap exists in understanding what constitutes good feedback from the various stakeholders.
-
Abstract
The Writing Mentor TM (WM) application is a Google Docs add-on designed to help students improve their writing in a principled manner and to promote their writing success in postsecondary settings. WM provides automated writing evaluation (AWE) feedback using natural language processing (NLP) methods and linguistic resources. AWE features in WM have been informed by research about postsecondary student writers often classified as developmental (Burstein et al., 2016b), and these features address a breadth of writing sub-constructs (including use of sources, claims, and evidence; topic development; coherence; and knowledge of English conventions). Through an optional entry survey, WM collects self-efficacy data about writing and English language status from users. Tool perceptions are collected from users through an optional exit survey. Informed by language arts models consistent with the Common Core State Standards Initiative and valued by the writing studies community, WM takes initial steps to integrate the reading and writing process by offering a range of textual features, including vocabulary support, intended to help users to understand unfamiliar vocabulary in coursework reading texts. This paper describes WM and provides discussion of descriptive evaluations from an Amazon Mechanical Turk (AMT) usability task situated in WM and from users-in-the-wild data. The paper concludes with a framework for developing writing feedback and analytics technology.
-
Abstract
Background: Current research in composition and writing studies is concerned with issues of writing program evaluation and how writing tasks and their sequences scaffold students toward learning outcomes. These issues are beginning to be addressed by writing analytics research, which can be useful for identifying recurring types of language in writing assignments and how those can inform task design and student outcomes. To address these issues, this study provides a three-step method of sequencing, comparison, and diagnosis to understand how specific writing tasks fit into a classroom sequence as well as compare to larger genres of writing outside of the immediate writing classroom environment. By doing so, we provide writing program administrators with tools for describing what skills students demonstrate in a sequence of writing tasks and diagnosing how these skills match with writing students will do in later contexts. Literature Review: Student writing that responds to classroom assignments can be understood as genres, insofar as they are constructed responses that exist in similar rhetorical situations and perform similar social actions. Previous work in corpus analysis has looked at these genres, which helps us as writing instructors understand what kind of constructed responses are required of students and to make those expectations explicit. Aull (2017) examined a corpus of first-year undergraduate writing assignments in two courses to create “sociocognitive profiles” of these assignments. We analyze student writing that responds to similar writing tasks, but use a different corpus method that allows us to understand the tasks in both local and global contexts. By doing so, we gain confidence and depth in our understanding of these tasks, analyze how they sequence together, and are able to compare argumentative writing across institutions and contexts. Research Questions: Two questions guided our study: What is the trajectory of skills targeted by the sequence of tasks in the two first-year writing courses, as evidenced by the rhetorical strategies employed by the writers in successive assignments? Focusing on the final argument assignments, how similar are they to argumentative writing in other contexts, in terms of rhetorical profiles? Methodology: We first conducted a local analysis, in which we used a dictionary-based corpus method to analyze the rhetorical strategies used by writers in the first-year writing courses to understand how they built on each other to form a sequence. Having understood what skills students are demonstrating in a course, we then conducted a global analysis which calculated a “distance” between the first-year argument writing and a corpus of argument writing drawn from other contexts. Recognizing that there was a non-trivial distance, we then identified and evaluated the sources of the distance so that the writing tasks could be assessed or modified. Results: The local analysis revealed eight key rhetorical strategies that student writing exhibits between the two first-year writing courses. With this understanding, we then placed the argument writing in global contexts to find that the assignments in both courses differ somewhat from argument writing in other contexts. Upon analyzing this difference, we found that the first-year writing primarily differs in its usage of academic language, the personal register, assertive language, and reasoning. We suggest that these differences stem primarily from the rhetorical situation and learning objectives associated with first-year writing, as well as the sequencing of the courses. Discussion: The three-step method presented provides a means for writing program administrators to describe and analyze writing that students produce in their writing programs. We intend these steps to be understood as an iterative process, whereby writing programs can use these results to evaluate what rhetorical skills their students are exhibiting and to benchmark those against the program’s goals and/or other similar writing programs. Conclusions: By presenting these analyses together, we ultimately provide a cohesive method by which to analyze a writing program and benchmark students’ use of rhetorical strategies in relation to other argumentative contexts. We believe this method to be useful not only to individual writing programs, but to assessment literature broadly. In future research, we anticipate learning how this process will practically feed back into pedagogy, as well as understanding what placing writing tasks into a global context can tell us about genre theory.
-
Abstract
Aim: The use of validated measures of writing motivation is imperative to improving our understanding and development of interventions to improve student writing utilizing motivation as a mechanism. One of the most important malleable factors involved in improving student writing is motivation, particularly for secondary school students. This research note systematically examines the measures of writing motivation for students in grades 4–12 used by researchers over the last ten years and summarizes their psychometric and measurement properties to the extent provided in the underlying literature. This collection of measures and their properties and features is designed to make researchers more aware of the various options and to point out the need for additional measures. Problem Formation: Writing is crucial to college and career readiness, but adolescents are inadequately prepared to be proficient writers. Grades 4–12, once students have generally learned the basics of writing, are when students begin to develop more fluent and sophisticated writing abilities. They turn from learning to write to writing to learn, and writing is increasingly done across content areas and in multiple genres. Unfortunately, writing is a difficult skill to master, and students in middle and high school suffer from declining motivation. The ability to measure changes in writing motivation at this developmental stage will allow researchers to more effectively design and assess writing interventions. What are the current, validated measures of writing motivation available for researchers working with adolescents? Motivation research has grown significantly in the last ten years, and a variety of motivation constructs (e.g., self-efficacy, expectancy-value) and related measures are used across the field. In addition to the variety of motivation constructs used in research today, researchers require domain- or context-specific measures of motivation (e.g., science motivation) to enable an accurate understanding of the role of motivation in achievement. Despite increased developments in both motivation and writing research over the past few decades, the intersection of these two fields remains relatively unexplored (Boscolo & Hidi, 2007; Troia, Harbaugh, Shankland, Wolbers, & Lawrence, 2013).Information Collection: A thorough literature search was done to find measures of writing motivation used for this age group within the last 10 years. Psychometric properties, to the extent available in the underlying articles, of each measure are described.Conclusions: Ultimately, seven discrete measures of adolescent writing motivation were found, but only limited psychometric details were available for many of the measures. No “gold standard” measure was found; indeed, the measures utilized varied motivational constructs and rarely reported more than the Cronbach’s alpha of the underlying instrument. Researchers need to carefully parse through the related motivation literature to understand the most likely constructs to be implicated in their intervention. They need to consider factors specifically related to their study, such as how stable the construct being targeted is developmentally, whether the term and type of intervention will be sufficient to make an impact on the students’ motivation as suggested by the underlying motivational literature, and what the target of the intervention is. Appropriate motivational constructs to be measured will vary depending on the intervention and its anticipated theory of change.Directions for Further Research: Several underlying motivation constructs have been used in the measures described in this review, particularly self-efficacy. However, a number of important motivation constructs, such as interest and self-determination theory, were not captured by the measures found. This review of currently available measures will give researchers options when wanting to include validated measures of writing motivation in their studies and suggests that additional, validated measures are needed to adequately cover the relevant motivational constructs.
-
Abstract
Collection and analysis of students' writing samples on a large scale is a part of the research agenda of the emerging writing analytics community that promises to deliver an unprecedented insight into characteristics of student writing. Yet with a large scale often comes variability of contexts in which the samples were produced-different institutions, different purposes of writing, different author demographics, to name just a few possible dimensions of variation. What are the implications of such variation for the ability of automated methods to create indices/features based on the writing samples that would be valid and meaningful? This paper presents a case study in system generalization. Building on a system developed to assess the expression of utility value (a social-psychology-based construct) in essays written by first-year biology students at one postsecondary institution, we vary data parameters and observe system performance. From the point of view of social psychology, all these variants represent the same underlying construct (i.e., utility value), and it is thus very tempting to think that an automatically produced utility-value score could provide a meaningful analytic, consistently, on a large collection of essays. However, findings from this research show that there are challenges: Some variations are easier to deal with than others, and some components of the automated system generalize better than others. The findings are then discussed both in the context of the case study and more generally.
-
Abstract
Background: Over a decade ago, the Stanford Study of Writing (SSW) collected more than 15,000 writing samples from undergraduate students, but to this point the corpus has not been analyzed using computational methods. Through the use of natural language processing (NLP) techniques, this study attempts to reveal underlying structures in the SSW, while at the same time developing a set of interpretable features for computationally understanding student writing. These features fall into three categories: topic-based features that reveal what students are writing about; stance-based features that reveal how students are framing their arguments; and structure-based features that reveal sentence complexity. Using these features, we are able to characterize the development of the SSW participants across four years of undergraduate study, specifically gaining insight into the different trajectories of humanities, social science, and STEM students. While the results are specific to Stanford University’s undergraduate program, they demonstrate that these three categories of features can give insight into how groups of students develop as writers.Literature Review: The Stanford Study of Writing (Lunsford et al., 2008; SSW, 2018) involved the collection of more than 15,000 writing samples from 189 students in the Stanford class of 2005. The literature surrounding the original study is largely qualitative (Fishman, Lunsford, McGregor, & Otuteye, 2005; Lunsford, 2013; Lunsford, Fishman, & Liew, 2013), so this study makes a first attempt at a quantitative analysis of the SSW. When considering the ethics of a computational approach, we find it important not to stray into the territory of writing evaluation, as purely evaluative systems have been shown to have limited instructional use in the classroom (Chen & Cheng, 2008; Weaver, 2006). Therefore, we find it important to take a descriptive, rather than evaluative approach. All of the features that we extract are both interpretable and grounded in prior research. Topic modeling has been used on undergraduate writing to improve the prediction of neuroticism and depression in college students (Resnik, Garron, & Resnik, 2013), stance markers have been used to show the development of undergraduate writers (Aull & Lancaster, 2014), and parse trees have been used to measure the syntactic complexity of student writing (Lu, 2010).Research Questions: What computational features are useful for analyzing the development of student writers? Based on these features, what insights can we gain into undergraduate writing at Stanford and similar institutions?Methodology: To extract topic features, we use LDA topic modeling (Blei, Ng, & Jordan, 2003) with Gibbs Sampling (Griffiths, 2002). To extract stance features, we replicate the stance markers approach from a past study (Aull & Lancaster, 2014). To describe sentence structure, we use parse trees generated using Shift-Reduce dependency parsing (Sagae & Tsujii, 2008). For each parse tree, we use the tree depth and the average dependency length as heuristics for the syntactic complexity of the sentence.Results: Topic modeling was useful for sorting papers into academic disciplines, as well as for distinguishing between argumentative and personal writing. Stance markers helped us characterize the intersection between the majors that students hold and the topics that they are writing about at a given time. Parse tree complexity demonstrated differences between writing in different disciplines. In addition, we found that students of different disciplines have different syntactic features even during their first year at Stanford.Discussion: Topic modeling has given us a picture of interdisciplinary study at Stanford by showing how often students in the SSW wrote about topics outside their majors. Furthermore, studying interdisciplinary Stanford students allowed us to examine the intersection of a student’s major and current topic of writing when analyzing the other two sets of features. Stance markers in the SSW show that both field of study and topic of writing influence the ways in which students employ metadiscourse. In addition, when looking at stance across years, we see that Seniors regress towards their First-Year habits. The complexity results raise the question of whether different disciplines have different “ideal” levels of writing complexity.Conclusions: The present study yields insight into undergraduate writing at Stanford in particular. Notably, we find that students develop most as writers during their first two years and that students of different majors develop as writers in different ways. We consider our three categories of features to be useful because they were able to give us these insights into the dataset. We hope that, moving forward, educators will be able to use this kind of analysis to understand how their students are developing as writers.
-
Abstract
Background: It is important for developers of automated scoring systems to ensure that their systems are as fair and valid as possible. This commitment means evaluating the performance of these systems in light of construct-irrelevant response strategies. The enhancement of systems to detect and deal with these kinds of strategies is often an iterative process, whereby as new strategies come to light they need to be evaluated and effective mechanisms built into the automated scoring systems to handle them. In this paper, we focus on the Babel system, which automatically generates semantically incohesive essays. We expect that these essays may unfairly receive high scores from automated scoring engines despite essentially being nonsense. Literature Review: We discuss literature related to gaming of automated scoring systems. One reason that Babel essays are so easy to identify as nonsense by human readers is that they lack any semantic cohesion. Therefore, we also discuss some literature related to cohesion and detecting semantic cohesion. Research Questions: This study addressed three research questions:Can we automatically detect essays generated by the Babel system?Can we integrate the detection of Babel-generated essays into an operational automated essay scoring system while making sure not to flag valid student responses?Does a general approach for detecting semantically incohesive essays also detect Babel-generated essays?Research Methodology: This article describes the creation of two corpora necessary to address the research questions: (1) a corpus of Babel-generated essays and (2) a corresponding corpus of good-faith essays. We built a classifier to distinguish Babel-generated essays from good-faith essays and investigated whether the classifier can be integrated into an automated scoring engine without adverse effects. We also developed a measure of lexical-semantic cohesion and examined its distribution in Babel and in good-faith essays.Results: We found that the classifier built on Babel-generated essays and good-faith essays and using features from the automated scoring engine can distinguish the Babel-generated essays from the good-faith ones with 100% accuracy. We also found that if we integrated this classifier into the automated scoring engine it flagged very few responses that were submitted as part of operational submissions (76 of 434,656). The responses that were flagged had previously been assigned a score of Null (non-scorable) or a score of 1 by human experts. The measure of lexical-semantic cohesion shows promise in being able to distinguish Babel-generated essays from good-faith essays.Conclusions: Our results show that it is possible to detect the kind of gaming strategy illustrated by the Babel system and add it to an automated scoring engine without adverse effects on essays seen during real high-stakes tests. We also show that a measure of lexical-semantic cohesion can separate Babel-generated essays from good-faith essays to a certain degree, depending on task. This points to future work that would generalize the capability to detect semantic incoherence in essays. Directions for Further Research: Babel-generated essays can be identified and flagged by an automated scoring system without any adverse effects on a large set of good-faith essays. However, this is just one type of gaming strategy. It is important for developers of automated scoring systems to continue to be diligent about expanding the construct coverage of their systems in order to prevent weaknesses that can be exploited by tools such as Babel. It is also important to focus on the underlying linguistic reasons that lead to nonsense sentences. Successful identification of such nonsense would lead to improved automated scoring and feedback.
-
Abstract
Background: The researchers conducted a corpus analysis of 548 research-based argument essays, totalling 1,465,091 words, written by first-year students at The City College of New York (CCNY). The purpose of this study was to better understand the ways in which CCNY students were constructing arguments in research essays in order to better support our instruction of the research essay. Curricular guidelines for the research assignment are general. Instructors are directed to require a research-based, persuasive argument that includes conflicting points of view. Model assignment sheets are provided to instructors, but they are free to write their own. Assignment sheets are not collected or approved. In the fall semester in which this corpus was collected, over 70 part-time instructors taught approximately 120 sections of the first- or second-semester composition course.Literature Review: The study of The City College of New York Corpus (CCNYC) partially replicates and relies on the analysis of three corpora of academic writing conducted by Zak Lancaster (2016a) in his examination of Gerald Graff’s and Cathy Birkenstein’s textbook They Say/I Say: The Moves that Matter in Academic Writing (2014). The current study also compares the CCNYC findings to studies of stance and voice markers frequency conducted by Ken Hyland (2012) and Ellen Barton (1993) and suggests the classroom use of corpus analysis as described by Raith Abid and Shakila Manan (2015), and Maggie Charles (2007).Research Questions: The study was guided by a narrowly-focused interest in learning whether or not the CCNYC would demonstrate the range and distribution of rhetorical moves that Lancaster found in his study of academic writing (2016a). The analysis of the corpus consists of frequency counts; we did not conduct other statistical analyses. Since we had little prior experience with corpus analysis, we wondered what would be revealed about students’ writing practices by a partial replication of Lancaster’s study. We did not reproduce Lancaster’s analysis but relied on his publised results. This study served as an assessment tool, providing a microscopic view of a limited number of rhetorical moves across a large corpus of student essays. As a result of our study, we hoped to be able to create assignments for research essays that responded directly to the patterns that we saw in our students’ essays.Methodology: Modeled on Lancaster’s study and the templates of rhetorical moves offered by Graff and Birkenstein, concordances of terms used to introduce objections, offer concessions, and make counterarguments were drawn from the CCNYC and then analyzed to confirm that the rhetorical form was in fact functioning as one of the above rhetorical moves within the context of the essay in which it was found.Results: Our study demonstrates that CCNY students use fewer linguistic resources than their peers at other institutions, a finding that helps shape faculty development seminars. The corpus analysis reveals that while CCNY students introduce objections to their arguments at about the same rates as in other corpora, they are less likely to concede to those objections. In addition, when students made counterarguments, they used only a limited range of the linguistic resources available to them.Conclusions: The low rate of engagement with opposing points of view and the limited use of linguistic resources for counterarguments all suggest the potential value of focused, corpus-based instruction.
-
Abstract
Background: Employing natural language processing and latent semantic analysis, the current work was completed as a constituent part of a larger research project for designing and launching artificial intelligence in the form of deep artificial neural networks. The models were evaluated on a proprietary corpus retrieved from a data warehouse, where it was extracted from MyReviewers, a sophisticated web application purposed for peer review in written communication, which was actively used in several higher education institutions. The corpus of laboratory reports in STEM annotated by instructors and students was used to train the models. Under the Common Rule, research ethics were ensured by protecting the privacy of subjects and maintaining the confidentiality of data, which mandated corpus de-identification.Literature Review: De-identification and pseudonymization of textual data remains an actively studied research question for several decades. Its importance is stipulated by numerous laws and regulations in the United States and internationally with HIPAA Privacy Rule and FERPA.Research Question: Text de-identification requires a significant amount of manual post-processing for eliminating faculty and student names. This work investigated automated and semi-automated methods for de-identifying student and faculty entities while preserving author names in cited sources and reference lists. It was hypothesized that a natural language processing toolkit and an artificial neural network model with named entity recognition capabilities would facilitate text processing and reduce the amount of manual labor required for post-processing after matching essays to a list of users’ names. The suggested techniques were applied with supplied pre-trained models without additional tagging and training. The goal of the study was to evaluate three approaches and find the most efficient one among those using a users’ list, a named entity recognition toolkit, and an artificial neural network.Research Methodology: The current work studied de-identification of STEM laboratory reports and evaluated the performance of the three techniques: brute forth search with a user lists, named entity recognition with the OpenNLP machine learning toolkit, and NeuroNER, an artificial neural network for named entity recognition built on the TensorFlow platform. The complexity of the given task was determined by the dilemma, where names belonging to students, instructors, or teaching assistants must be removed, while the rest of the names (e.g., authors of referenced papers) must be preserved.Results: The evaluation of the three selected methods demonstrated that automating de-identification of STEM lab reports is not possible in the setting, when named entity recognition methods are employed with pre-trained models. The highest results were achieved by the users’ list technique with 0.79 precision, 0.75 recall, and 0.77 F1 measure, which significantly outweighed OpenNLP with 0.06 precision, 0.14 recall, and 0.09 F1, and NeuroNER with 0.14 precision, 0.56 recall, and 0.23 F1.Discussion: Low performance of OpenNLP and NeuroNER toolkits was explained by the complexity of the task and unattainability of customized models due to imposed time constraints. An approach for masking possible de-identification errors is suggested.Conclusion: Unlike multiple cases described in the related work, de-identification of laboratory reports in STEM remained a non-trivial labor-intensive task. Applied out of the box, a machine learning toolkit and an artificial neural network technique did not enhance performance of the brute forth approach based on user list matching.Directions for Future Research: Customized tagging and training on the STEM corpus were presumed to advance outcomes of machine learning and predominantly artificial intelligence methods. Application of other natural language toolkits may lead to deducing a more effective solution.
January 2017
-
Statistical and Qualitative Analyses of Students� Answers to a Constructed Response Test of Science Inquiry Knowledge ↗
Abstract
Objective: We report on a comparative study of the language used by middle school students in their answers to a constructed response test of science inquiry knowledge. Background: Text analyses using statistical models have been conducted across a number of disciplines to identify topics in a journal, to extract topics in Twitter messages, and to investigate political preferences. In education, relatively few studies have analyzed the text of students’ written answers to investigate topics underlying the answers. Methodology: Two types of linguistic analysis were compared to investigate their utility in understanding students’ learning of scientific investigation practices. A statistical method, latent Dirichlet allocation (LDA), was used to extract topics from the texts of student responses. In the LDA model, topics are viewed as multinomial distributions over the vocabulary of documents. These topics were examined for content and used to characterize student responses on the constructed response items. The change from pre-test to post-test in proportions of use of each of the topics was related to students’ learning. Next, a qualitative method, systemic functional linguistic (SFL) analysis, was used to analyze the text of student responses on the same test of science inquiry knowledge. Student assessments were analyzed for two linguistic features that are important for convincing scientific communication: technical vocabulary usage and high lexical density. In this way, we investigated whether human judgement regarding the changes observed from texts based on the SFL framework agreed with the inference regarding the changes observed from the texts through LDA. Research questions: Two research questions were investigated in this study: (1) What do the LDA and SFL analyses tell us about students’ answers? (2) What are the similarities and differences of the two analyses? Data: The data for this study were taken from an NSF-funded host study on teaching science inquiry skills to middle school students who were a mix of both native English speakers and English-language learners. The primary objective was to enable participants to learn to take ownership of scientific language through the use of language-rich science investigation practices. The LDA analysis used a sample of 252 students’ pre-and post-assessments. The SFL analysis used a second sample of 90 students’ pre- and post-assessments. Results: In the LDA analysis, three topics were detected in student responses: “preponderance of everyday language (Topic 1),” “preponderance of general academic language (Topic 2),” and “preponderance of discipline-specific language (Topic 3).” Students’ use of topics changed from pre-test to post-test. Students on the post-test tended to have higher proportions of Topic 3 than students on the pre-test. In the SFL analysis, students tended to use more technical vocabulary and have higher lexical density in their written responses on the post-test than on the pre-test. Discussion: Results from the LDA and SFL analyses suggest that students responded using more discipline-specific language on the post-test than on the pre-test. In addition, the results of the two linguistic features from the SFL analysis, technical vocabulary usage and lexical density, were compared with the results from the LDA analysis. • Conclusion: Results of the LDA and SFL analyses were consistent with each other and clearly showed that students improved in their ability to use the discipline-specific and academic terminology of the language of scientific communication.
-
Applying Natural Language Processing Tools to a Student Academic Writing Corpus: How Large are Disciplinary Differences Across Science and Engineering Fields? ↗
Abstract
• Background: Researchers have been working towards better understanding differences in professional disciplinary writing (e.g., Ewer & Latorre, 1969; Hu & Cao, 2015; Hyland, 2002; Hyland & Tse, 2007) for decades. Recently, research has taken important steps towards understanding disciplinary variation in student writing. Much of this research is corpus-based and focuses on lexico-grammatical features in student writing as captured in the British Academic Written English (BAWE) corpus and the Michigan Corpus of Upper-level Student Papers (MICUSP). The present study extends this work by analyzing lexical and cohesion differences among disciplines in MICUSP. Critically, we analyze not only linguistic differences in macro-disciplines (science and engineering), but also in micro-disciplines within these macro-disciplines (biology, physics, industrial engineering, and mechanical engineering).\n• Literature Review: Hardy and Römer (2013) used a multidimensional analysis to investigate linguistic differences across four macro-disciplines represented in MICUSP. Durrant (2014, in press) analyzed vocabulary in texts produced by student writers in the BAWE corpus by discipline and level (year) and disciplinary differences in lexical bundles. Ward (2007) examined lexical differences within micro-disciplines of a single discipline.\n• Research Questions: The research questions that guide this study are as follows:\n1. Are there significant lexical and cohesive differences between science and engineering student writing? 2. Are there significant lexical and cohesive differences between micro-disciplines within science and engineering student writing?\n• Research Methodology: To address the research questions, student-produced science and engineering texts from MICUSP were analyzed with regard to lexical sophistication and textual features of cohesion. Specifically, 22 indices of lexical sophistication calculated by the Tool for the Automatic Analysis of Lexical Sophistication (TAALES; Kyle & Crossley, 2015) and 38 cohesion indices calculated by the Tool for the Automatic Analysis of Cohesion (TAACO; Crossley, Kyle, & McNamara, 2016) were used. These features were then compared both across science and engineering texts (addressing Research Question 1) and across micro-disciplines within science and engineering (biology and physics, industrial and mechanical engineering) using discriminate function analyses (DFA).\n• Results: The DFAs revealed significant linguistic differences, not only between student writing in the two macro-disciplines but also between the micro-disciplines. Differences in classification accuracy based on students’ years of study hovered at about 10%. An analysis of accuracies of classification by paper type found they were similar for larger and smaller sample sizes, providing some indication that paper type was not a confounding variable in classification accuracy.\n• Discussion: The findings provide strong support that macro-disciplinary and micro-disciplinary differences exist in student writing in these MICUSP samples and that these differences are likely not related to student level or paper type. These findings have important implications for understanding disciplinary differences. First, they confirm previous research that found the vocabulary used by different macro-disciplines to be “strikingly diverse” (Durrant, 2015), but they also show a remarkable diversity of cohesion features. The findings suggest that the common understanding of the STEM disciplines as “close” bears reconsideration in linguistic terms. Second, the lexical and cohesion differences between micro-disciplines are large enough and consistent enough to suggest that each micro-discipline can be thought of as containing a unique linguistic profile of features. Third, the differences discerned in the NLP analysis are evident at least as early as the final year of undergraduate study, suggesting that students at this level already have a solid understanding of the conventions of the disciplines of which they are aspiring to be members. Moreover, the differences are relatively homogeneous across levels, which confirms findings by Durrant (2015) but, importantly, extends these findings to include cohesion markers.\n• Conclusions: The findings from this study provide evidence that macro-disciplinary and micro-disciplinary differences at the linguistic level exist in student writing, not only in lexical use but also in text cohesion. A number of pedagogical applications of writing analytics are proposed based on the reported findings from TAALES and TAACO. Further studies using different corpora (e.g., BAWE) or purpose assembled corpora are suggested to address limitations in the size and range of text types found within MICUSP. This study also points the way toward studies of disciplinary differences using NLP approaches that capture data which goes beyond the lexical and cohesive features of text, including the use of part-of-speech tags, syntactic parsing, indices related to syntactic complexity and similarity, rhetorical features, or more advanced cohesion metrics (latent semantic analysis, latent Dirichlet allocation, Word2Vec approaches).
-
Abstract
Aim: This research note narrates existing and continuing potential crossover between the digital humanities and writing studies. I identify synergies between the two fields’ methodologies and categorize current research in terms of four permutations, or “valences,” of the phrase “writing analytics.” These valences include analytics of writing , writing of analytics , writing as analytics , and analytics as writing . I bring recent work in the two fields together under these common labels, with the goal of building strategic alliances between them rather than to delimit or be comprehensive. I offer the valences as one heuristic for establishing connections and distinctions between two fields engaged in complementary work without firm or definitive discursive borders. Writing analytics might provide a disciplinary ground that incorporates and coheres work from these different domains. I further hope to locate the areas in which my current research in digital humanities, grounded in archival studies, might most shape writing analytics. Problem Formation: Digital humanities and writing studies are two fields in which scholars are performing massive data analysis research projects, including those in which data are writing or metadata that accompanies writing. There is an emerging environment in the Modern Language Association friendly to crossover between the humanities and writing studies, especially in work that involves digital methods and media. Writing analytics accordingly hopes to find common disciplinary ground with digital humanities, with the goal of benefitting from and contributing to conversations about the ethical application of digital methods to its research questions. Recent work to bridge digital humanities and writing studies more broadly has unfortunately focused more on territorial and usability concerns than on identifying resonances between the fields’ methodological and ethical commitments. Information Collection: I draw from a history of meta-academic literature in digital humanities and writing studies to review their shared methodological commitments, particularly in literature that recognizes and responds to pushback against the fields’ ostensible use of extra-disciplinary methods. I then turn to current research in both fields that uses and critiques computational techniques, which is most relevant to writing analytics’ articulated focus on massive data analysis. I provide a more detailed explanation, drawing from my categorization of this work, of the conversations in digital humanities surrounding the digital archives that enable data analysis. Conclusions: A review of past and current research in digital humanities and writing studies reveals shared attention to techniques for tokenizing texts at different scales for analysis, which is made possible by the curation of large corpora. Both fields are writing new genres to compose this analysis. In these genres, both fields emphasize process in their provisional work, which is sociocognitively repurposed in different rhetorical contexts. Finally, both fields recognize that the analytical methods they employ are themselves modes of composition and argumentation. An ethics of data transformation present in digital humanities, however, is largely absent from writing studies. This ethics comes to digital humanities from the influence of textual studies and archival studies. Further research in writing analytics might benefit from reframing writing corpora as archives—what Paul Fyfe (2017) calls a shift from “data mining” to “data archaeology”—in its analyses. This is especially true for analyses of text, which in particular foreground writing and analysis of writing as acts of transformation. Directions for Further Research: I recommend that future efforts to find crossover between digital humanities and writing studies do so by identifying their common values rather than trying to co-opt language and spaces or engaging in broad definitional work. I further provide a set of guiding principles that writing analytics might follow in order to pursue research that draws upon and contributes to both digital humanities and writing studies. These research projects might consider and account for the silences of writing corpora—unseen versions of documents, and documents’ elements not described in structured data—while attending to the silences that these efforts might in turn (re)produce.
-
Abstract
Background: A shift of focus has been marked in recent years in the development of automated essay scoring systems (AES) passing from merely assigning a holistic score to an essay to providing constructive feedback over it. Despite all the major advances in the domain, many objections persist concerning their credibility and readiness to replace human scoring in high-stakes writing assessments. The purpose of this study is to shed light on how to build a relatively simple AES system based on five baseline writing features. The study shows that the proposed AES system compares very well with other state-of-the-art systems despite its obvious limitations. Literature Review: In 2012, ASAP (Automated Student Assessment Prize) launched a demonstration to benchmark the performance of state-of-the-art AES systems using eight hand-graded essay datasets originating from state writing assessments. These datasets are still used today to measure the accuracy of new AES systems. Recently, Zupanc and Bosnic (2017) developed and evaluated another state-of-the-art AES system, called SAGE, which enclosed new semantic and consistency features and provided for the first time an automatic semantic feedback. SAGE’s agreement level between machine and human scores for ASAP dataset #8 (the dataset also of interest in this study) was measured and had a quadratic weighted kappa of 0.81, while it ranged for 10 other state-of-the-art systems between 0.60 and 0.73 (Chen et al., 2012; Shermis, 2014). Finally, this section discusses the limitations of AES, which come mainly from its omission to assess higher-order thinking skills that all writing constructs are ultimately designed to assess. Research Questions: The research questions that guide this study are as follows: RQ1: What is the power of the writing analytics tool’s five-variable model (spelling accuracy, grammatical accuracy, semantic similarity, connectivity, lexical diversity) to predict the holistic scores of Grade 10 narrative essays (ASAP dataset #8)? RQ2: What is the agreement level between the computer rater based on the regression model obtained in RQ1 and the human raters who scored the 723 narrative essays written by Grade 10 students (ASAP dataset #8)? Methodology: ASAP dataset #8 was used to train the predictive model of the writing analytics tool introduced in this study. Each essay was graded by two teachers. In case of disagreement between the two raters, the scoring was resolved by a third rater. Basically, essay scores were the weighted sums of four rubric scores. A multiple linear regression analysis was conducted to determine the extent to which a five-variable model (selected from a set of 86 writing features) was effective to predict essay scores. Results: The regression model in this study accounted for 57% of the essay score variability. The correlation (Pearson), the percentage of perfect matches, the percentage of adjacent matches (±2), and the quadratic weighted kappa between the resolved scores and predicted essay scores were 0.76, 10%, 49%, and 0.73, respectively. The results were measured on an integer scale of resolved essay scores between 10-60. Discussion: When measuring the accuracy of an AES system, it is important to take into account several metrics to better understand how predicted essay scores are distributed along the distribution of human scores. Using average ranking over correlation, exact/adjacent agreement, quadratic weighted kappa, and distributional characteristics such as standard deviation and mean, this study’s regression model ranks 4th out of 10 AES systems. Despite its relatively good rank, the predictions of the proposed AES system remain imprecise and do not even look optimal to identify poor-quality essays (binary condition) smaller than or equal to a 65% threshold (71% precision and 92% recall). Conclusions: This study sheds light on the implementation process and the evaluation of a new simple AES system comparable to the state of the art and reveals that the generally obscure state-of-the-art AES system is most likely concerned only with shallow assessment of text production features. Consequently, the authors advocate greater transparency in the development and publication of AES systems. In addition, the relationship between the explanation of essay score variability and the inter-rater agreement level should be further investigated to better represent the changes in terms of level of agreement when a new variable is added to a regression model. This study should also be replicated at a larger scale in several different writing settings for more robust results.
-
Abstract
Aim: This research note focuses on some of the consequences of big data as an emerging methodology. Its purpose is to provide a brief literature review of the method’s development and some of the critical questions researchers should consider as they move forward. Salvo (2012) contends that big data as a form of design of communication itself “is necessarily a rhetorically-based field” (p. 38). With big data as an up and coming methodology (McNely, 2012; Salvo, 2012), using caution in its application is a necessity for scholars. Not only should researchers seek out the unseen and untapped applications of big data, but they should learn its limitations as well (Spinuzzi, 2009). You adopt a methodology, you adopt its flaws. Problem Formation: This section identifies a gap in the field as it relates to some of the consequences of applying big data as a methodology and seeing it as a rhetorical tool. As big data gains steam in the field of humanities, some are sure to question what they see as a flaw: the act of quantifying language. This argument is not new nor is its rebuttal. Harris (1954) discusses the distributional structure of language with each part of a sentence acting as co-occurents, each in a particular position, and each with a relationship to the other co-occurents (p. 146). Salvo (2012) argues that the combination of these new methodologies and technologies “knits together invention, arrangement, style, memory, and delivery in ways that challenge conceptions of print based literacy and textuality” (p. 39). While big data itself has several rhetorical methodologies embedded within, deciding which one to use depends on the amount of data and how it’s aggregated. • Information Collection: As described above, this research note functions primarily as a brief review of literature. This section focuses on how writing analytics developed from content analysis in mass communications and shifted into latent semantic analysis assisted by computer technology. Riffe, Lacy, & Fico (1995) offer a clear explanation of content analysis, which was developed with comparably small data sets in mind: “Usually, but not always, content analysis involves drawing representative samples of content, training coders to use the category rules developed to measure or reflect differences in content, and measuring reliability (agreement or stability over time) of coders applying the rules” (p. 2). Finding a representative sample of content was once a more feasible methodology, but in the digital age that amount of content exponentially increases every day. Conclusions: As latent semantic analysis is an extension of quantitative content analysis (and vice versa)—and knowing that an adopted methodology carries adopted flaws—it makes sense to turn to some of the concerns voiced by mass communication scholars in order to understand limitations. While quantitative content analysis grew in popularity in mass communication, so did the refining of its methods. Reporting the reliability of a study adds credibility to the study itself, and when a human coder is involved, the reporting of this intercoder reliability becomes imperative (Hayes & Krippendorf, 2007; Krippendorf, 2008, 2011). While intercoder reliability measures the degree to which coders agree, researchers should also be keenly aware of the theory and valence informing their study, which impacts their coders, which ultimately impacts the results of the study itself. Directions for Further Research: As the field of writing studies begins to adopt big data methodologies, researchers must continue to challenge and question their applications, implementations, and implications, turning to familiar questions from our own fields. Big data is exciting and new, but it’s not the methodology to explain it all. It’s just as rhetorical as every other methodology—it’s just better at hiding it.
-
Abstract
Aim: The screencast (SC), a 21st century analytics tool, enables the simultaneous recording of audio and video feedback on any digital document, image, or website, and may be used to enhance feedback systems in many educational settings. Although previous findings show that students and teachers have had positive experiences with recorded commentary, this method is still rarely used by teachers in composition classrooms. There are many possible reasons for this, some of which include the accelerated pace at which classroom technology has changed over the past decade, concerns over privacy when new technologies are integrated into the classroom, and the general unease instructors may feel when asked to integrate a new technology system into their established composition pedagogy and response routine. The aim of this study was to replicate previous findings in favor of SC feedback and expand that body of research beyond instructor-to-student SC interactions and into the realm of SC-mediated peer review. Thus, this study seeks to improve on the widespread written peer review practices most common among writing instruction today, practices that tend to produce mediocre learning outcomes and fail to capitalize on 21st century technological innovations to enhance student learning. This research note demonstrates the validity of SC as a valuable writing analytics research tool that has the potential to collect and measure student learning. It also seeks to inspire those who have been reluctant to adopt SC in both digital learning and face-to-face educational environments by providing pragmatic guidance for doing so in ways that simultaneously increase student learning and facilitate a more rigorous and discursive peer-to-peer review process. Problem Formation: While research suggests positive student perceptions related to screencast instructor response, results in peer-to-peer screencast response are mixed. After several successful years of experience in instructor-to-student SC feedback, the author wondered what would happen if she asked students to use screencast technology to mediate peer review. How might students’ attitudes and perceptions impact the use of peer-to-peer screencast technology in the composition classroom? In order to address these questions, the author developed a survey measuring the user reliability of this new SC technology and the student affect and revision initiative it produces. Information Collection: This study extends Anson’s (2016) research and insights by reporting findings from a study of 138 writing students. Survey data was collected during the 2015-2016 academic year at three institutions. At High Point University, the author of this research note asked freshmen composition students in a traditional face-to-face lecture course to conduct a series of peer review sessions (including both traditional written comments and SC comments) over a 16-week semester. Students were surveyed after each peer review experience, and the results form the foundation of this research note’s conclusions. In addition to survey responses, researchers also collected the screencasts exchanged among peer-to-peer interactions within each educational setting. Conclusions: The author provides an in-depth analysis of students’ experiences, perceptions, and attitudes toward giving and receiving screencast feedback, focusing on the impact of this method on student revision initiative in comparison to that of a traditional written feedback system. Some conclusions are also drawn regarding the user reliability and effectiveness of the screencast technology, specifically the free software program known as Jing, a product available through Techsmith.com that enables a streamlined and user-friendly SC interface and cloud storage of all SC recordings through individualized hyperlinks, thereby alleviating concerns regarding student privacy. Directions for Further Research: While this research note provides compelling evidence to support the use of SC in composition classrooms, there are also many opportunities for continued study, particularly within the emerging field of writing analytics. While the actual student-to-student screencasts were collected in this study, they were not analyzed as a qualitative data set, and the researchers relied on self-reported survey data to assess the degree of revision initiative among the students surveyed. The screencasts themselves offer a treasure trove of data, should the researcher have the capability to code that data set or utilize automated natural language processing programs in the future. Perhaps this peer-to-peer SC feedback could be compared to similar corpus analyses of instructor-to-student feedback gathered by other writing analytics scholars. In addition, further research in this area could also collect the student writing itself and track revisions made by students after receiving SC feedback and traditional written feedback from their peers. In this way, researchers would be able to make comparisons between the actual changes made by the student writers, the extent of those changes (surface-level or higher-order revisions), and the student’s perceived degree of revision initiative reported in the survey. To facilitate future research in this area, the author has included teaching resources for those new to screencast technology and analytics.
-
Abstract
Background: Contemporary research in composition studies emphasizes the constitutive power of genres. It also highlights the prevalence of the most common genre in students’ transition into advanced college writing, the argumentative essay. Consistent with most research in composition, and therefore most studies of general, first-year college writing, such research has primarily emphasized genre context. Other research, in international applied linguistics research and particularly English for Academic Purposes (EAP), has focused less on first-year writers but has likewise shown the frequent use of argumentative essays in undergraduate writing. Together, these studies suggest that the argumentative essay is represented more than other genres in early college writing development, and that any given genre favors particular discourse features in contrast with other genres students might write. A productive next step, but one not yet realized, is to bring these discussions together, in research that uses context-informed corpus analysis that investigates students’ assignment contexts and analyzes the discourse that characterizes the tasks and genres students write. This study offers an exploratory, context-informed analysis of argumentative and explanatory writing by first-year college writers. Based on the corpus findings, the article underscores discourse as an integral part of the sociocognitive practices embedded in genres, and accordingly considers new ways to conceptualize student writing genres and to inform instruction and assignment design. Research questions: Four questions guided the inquiry: What are the key discursive practices associated with annotated bibliographies and argumentative essays written by the same students in the same course? What are the key discursive practices associated with visual analyses and argumentative essays written by the same students in the same course? What are the key discursive practices associated with the two argumentative tasks in comparison with the two explanatory tasks? Finally, how might corpus-based findings inform the design of particular assignment tasks and genres in light of a range of writing goals? Methodology: The article outlines a context-informed corpus analysis of lexical and grammatical keywords in part-of-speech tagged writing by first-year college students across courses at a U.S. institution. Using information from assignment descriptions and rubrics, the study considers four projects that also represent two macro-genres: an annotated bibliography and a visual analysis, both part of the explanatory macro-genre, and two argumentative essays, both part of the argumentative macro-genre. Results: The corpus analysis identifies lexical and grammatical keywords in each of the four tasks as well as in the macro-genres of argumentative versus explanatory writing. These include generalized, interpersonal, and persuasive discourse in argumentative essays versus more specified, informational, and elaborated discourse in explanatory writing, regardless of course or task. Based on these findings, the article discusses the discursive practices prioritized in each task and each macro-genre. Conclusions: The findings, based on key discourse patterns in tasks within the same course and in macro-genres across courses, pose important questions regarding writing task design and students’ adaptation to different genres. The macro-genre keywords specifically inform exploratory sociocognitive “profiles” of argumentative and explanatory tasks, offered in the final section. These argument and explanation profiles strive to account for discourse patterns, genre networks, and purposes and processes—in other words, multiple aspects of habituated thinking and writing practices entailed in each one relative to the other. As discussed in the conclusion, the profiles aim to (1) underscore discourse patterns as integral to the work of genres, (2) highlight adaptive discourse strategies as part of students’ meta-language for writing, and (3) identify multiple, macro-level (e.g., audience), meso-level (paragraph- and section-level), and micro-level (e.g., discourse patterns) aspects of genres to help instructors identify and specify multiple goals for writing assignments.
-
Abstract
Technique Identification: A new graphical technique is presented for visualizing and assessing inter-rater agreement in discrete ordinal or categorical data, such as rubric ratings. To that aim, a chance-corrected Kappa with two new features is derived. First, it is based on interpreting ratings for each subject as vectors to visualize the data. This is done by creating two-dimensional vectors from a subject-rating summary table, sorting the vectors by their slopes, and plotting them in that order to create a trajectory that displays all the data in context. Second, it presents a graph and accompanying statistics (Kappa, p -value) for each pair of ratings in an organized display so that all useful comparisons of the data are visually displayed and statistically assessed. This information is presented on a logical grid, usually called facets . Kappa is calculated in the usual way, by referencing the actual results with an average of random rating assignments. This average becomes a reference line on each graph as a visual cue, as well. The statistical basis for the Kappa and significance testing are derived, and the test assumptions are specified. Value Contribution: The most commonly used statistics for inter-rater agreement, such as the Cohen Kappa or Inter-Class Correlation, give only a single parameter estimate of reliability from which to make judgments about ratings data. The technique presented here constructs graphs of all the data that allow visual inspection of the ratings versus a reference curve that represents chance-matching. The detailed reports on inter-rater agreement can show how to fine-tune ratings systems, such as understanding which parts of an ordinal scale are working best. This solves a practical problem for researchers who rely on rating-type classification by revealing which overall aspects of the rating system need to be improved and adds to the list of tools available for assessing rating reliability. In creating this approach to analysis of rater data, human usability is emphasized. Specifically, the use of geometry is designed to facilitate interpretability rather than being a mathematical derivation from first principles. Technique Application: Two applications are given, both involving social meaning-making. The first uses data from wine-judging to illustrate how the method can illuminate expertise in that domain. The results reproduce published findings that were based on a classical statistical method. A second sample application uses data from a university assessment of student writing in which ratings on a developmental scale are assigned by course instructors to their students. The rating program is an example of social meaning-making that can be used to generate larger data sets than are typical for classroom-based assessment programs. The analysis shows the strengths and weaknesses of the rating system in terms of reliability and demonstrates how that knowledge leads to improvements in assessment. Directions for Further Research: An argument is made for a public library of inter-rater data for empirical use by researchers. The social aspects of rating are discussed, and there is an illustration of the potential to derive new measures of inter-rater agreement from the meaning-making program that produces the data.
-
Abstract
Background: This study focuses on construct representation and inter-reader agreement and reliability in ePortfolio assessment of 1,315 writing portfolios. These portfolios were submitted by undergraduates enrolled in required writing seminars at the University of Pennsylvania (Penn) in the fall of 2014. Penn is an Ivy League university with a diverse student population, half of whom identify as students of color. Over half of Penn’s students are women, 12% are international, and 12% are first-generation college students. The students’ portfolios are scored by the instructor and an outside reader drawn from a writing-in-the-disciplines faculty who represent 24 disciplines. The portfolios are the product of a shared curriculum that uses formative assessment and a program-wide multiple-trait rubric. The study contributes to scholarship on the inter-reader reliability and validity of multiple-trait portfolio assessments as well as to recent discussions about reconceptualizing evidence in ePortfolio assessment. Research Questions: Four questions guided our study: What levels of interrater agreement and reliability can be achieved when assessing complex writing performances that a) contain several different documents to be assessed; b) use a construct-based, multi-trait rubric; c) are designed for formative assessment rather than testing; and d) are rated by a multidisciplinary writing faculty? What can be learned from assessing agreement and reliability of individual traits? How might these measurements contribute to curriculum design, teacher development, and student learning? How might these findings contribute to research on fairness, reliability, and validity; rubrics; and multidisciplinary writing assessment? Literature Review: There is a long history of empirical work exploring the reliability of scoring highly controlled timed writings, particularly by test measurement specialists. However, until quite recently, there have been few instances of applying empirical assessment techniques to writing portfolios. Developed by writing theorists, writing portfolios contain multiple documents and genres and are produced and assessed under conditions significantly different from those of timed essay measurement. Interrater reliability can be affected by the different approaches to reading texts depending on the background, training, and goals of the rater. While a few writing theorists question the use of rubrics, most quantitatively based scholarship points to their effectiveness for portfolio assessment and calls into question the meaningfulness of single score holistic grading, whether impressionistic or rubric-based. Increasing attention is being paid to multi-trait rubrics, including, in the field of writing portfolio assessment, the use of robust writing constructs based on psychometrics alongside the more conventional cognitive traits assessed in writing studies, and rubrics that can identify areas of opportunity as well as unfairness in relation to the background of the student or the assessor. Scholars in the emergent field of empirical portfolio assessment in writing advocate the use of reliability as a means to identify fairness and validity and to create great opportunities for portfolios to advance student learning and professional development of faculty. They also note that while the writing assessment community has paid attention to the work of test measurement practitioners, the reverse has not been the case, and that conversations and collaborations between the two communities are long overdue. Methodology: We used two methods of calculating interrater agreement: absolute and adjacent percentages, and Cohen’s Unweighted Kappa, which calculates the extent to which interrater agreement is an effect of chance or expected outcome. For interrater reliability, we used the Pearson product-moment correlation coefficient. We used SPSS to produce all of the calculations in this study. Results: Interrater agreement and reliability rates of portfolio scores landed in the medium range of statistical significance. Combined absolute and adjacent percentages of interrater reliability were above the 90% range recommended; however, absolute agreement was below the 70% ideal. Furthermore, Cohen’s Unweighted Kappa rates were statistically significant but very low, which may be due to “kappa paradox.” Discussion: The study suggests that a formative, rubric-based approach to ePortfolio assessment that uses disciplinarily diverse raters can achieve medium-level rates of interrater agreement and reliability. It raises the question of the extent to which absolute agreement is a desirable or even relevant goal for authentic feedback processes of a complex set of documents, and in which the aim is to advance student learning. At the same time, our findings point to how agreement and reliability measures can significantly contribute to our assessment process, teacher training, and curriculum. Finally, the study highlights potential concerns about construct validity and rater training. Conclusion: This study contributes to the emergent field of empirical writing portfolio assessment that calls into question the prevailing standard of reliability built upon timed essay measurement rather than the measurement, conditions, and objectives of complex writing performances. It also contributes to recent research on multi-trait and discipline-based portfolio assessment. We point to several directions for further research: conducting “talk aloud” and recorded sessions with raters to obtain qualitative data on areas of disagreement; expanding the number of constructs assessed; increasing the range and granularity of the numeric scoring scale; and investigating traits that are receiving low interrater reliability scores. We also ask whether absolute agreement might be more useful for writing portfolio assessment than reliability and point to the potential “kappa paradox,” borrowed from the field of medicine, which examines interrater reliability in assessment of rare cases. Kappa paradox might be useful in assessing types of portfolios that are less frequently encountered by faculty readers. These, combined with the identification of jagged profiles and student demographics, hold considerable potential for rethinking how to work with and assess students from a range of backgrounds, preparation, and abilities. Finally, our findings contribute to a growing effort to understand the role of rater background, particularly disciplinarity, in shaping writing assessment. The goals of our assessment process are to ensure that we are measuring what we intend to measure, specifically those things that students have an equal chance at achieving and that advance student learning. Our findings suggest that interrater agreement and reliability measures, if thoughtfully approached, will contribute significantly to each of these goals.
-
Abstract
Background: While it is commonly recognized that almost every work and research discipline utilize their own taxonomy, the language used within a specific discipline may also vary depending on numerous factors, including the desired effect of the information being communicated and the intended audience. Different audiences are reached through publication of information, including research results, in different types of publication outlets such as newspapers, newsletters, magazines, websites, and journals. Prior research has shown that students, both undergraduate and graduate, as well as faculty may have a difficult time locating information in different publication outlet types (e.g., magazines, newspapers, journals). The type of publication may affect the ease of understanding and also the confidence placed in the acquired information. A text analytics tool for classifying the source of research as a newsletter (used as a substitute for newspaper articles), a magazine, or an academic journal article has been developed to assist students, faculty, and researchers in identifying the likely source type of information and classifying their own writings with respect to these possible publication outlet types. Literature Review: Literature on information literacy is discussed as this forms the motivation for the reported research. Additionally, prior research on using text mining and text analytics is examined to better understand the methodology employed, including a review of the original Scale of Theoretical and Applied Research system, adapted for the current research. Research Questions: The primary research question is: Can a text mining and text analytics approach accurately determine the most probable publication source type with respect to being from a newsletter, magazine, or journal? Methodology: A text mining and text analytics algorithm, STAR’ (System for Text Analytics-based Ranking), was developed from a previously researched text mining tool, STAR (Scale of Theoretical and Applied Research), that was used to classify the research type of articles between theoretical and applied research. The new text mining method, STAR’, analyzes the language used in manuscripts to determine the type of publication. This method first mines all words from corresponding publication source types to determine a keyword corpus. The corpus is then used in a text analytics process to classify full newsletters, magazine articles, and journal articles with respect to their publication source. All newsletters, magazine articles, and journal articles are from the library and information sciences (LIS) domain. Results: The STAR’ text analytics method was evaluated as a proof of concept on a specific LIS organizational newsletter, as well as articles from a single LIS magazine and a single LIS journal. STAR’ was able to classify the newsletters, magazine articles, and journal articles with 100% accuracy. Random samples from another similar LIS newsletter and a different LIS journal were also evaluated to examine the robustness of the STAR’ method in the initial proof of concept. Following the positive results of the proof of concept, additional journal, magazine, and newsletter articles were used to evaluate the generalizability of STAR’. The second-round results were very positive for differentiating journals and newsletters from other publication types, but revealed potential issues for distinguishing magazine articles from other types of publications. Discussion: STAR’ demonstrates that the language used for transferring information within a specific discipline does differ significantly depending on the intended recipients of the research knowledge. Further work is needed to examine language usage specific to magazine articles. Conclusions: The STAR’ method may be used by students and faculty to identify the likely source of research or discipline-specific information. This may improve trust in the reliability of information due to different levels of rigor applied to different types of publications. Additionally, the STAR’ classifications may be used by students, faculty, or researchers to determine the most appropriate type of outlet and correspondingly the most appropriate type of audience for the reported information in their own manuscripts, thereby improving the chance for successful sharing of information to appropriate audiences who will deem the information to be reliable, through publication in the most relevant outlet type.
-
Measuring the Written Language Disorder among Students with Attention Deficit Hyperactivity Disorder ↗
Abstract
Background: Attention Deficit Hyperactivity Disorder (ADHD) is a mental health disorder. People diagnosed with ADHD are often inattentive (have difficulty focusing on a task for a considerable period), overly impulsive (make rash decisions), and are hyperactive (move excessively, often at inappropriate times). ADHD is often diagnosed through psychiatric assessments with additional input from physical/neurological evaluations. Written Language Disorder (WLD) is a learning disorder. People diagnosed with WLD often make multiple spelling, grammar, and punctuation mistakes, have sentences that lack cohesion and topic flow, and have trouble completing written assignments. Typically, WLD is also diagnosed through psychological educational assessments with additional input from physical/neurological evaluation. Literature Review: Previous research has shown a link between ADHD and writing difficulties. Students with ADHD have an increased likelihood of having writing difficulties, and rarely is there a presence of writing difficulties without ADHD or another mental health disorder. However, the presence of writing difficulties does not necessarily indicate the presence of a WLD. There are other physical and behavioral factors of ADHD that can contribute to a student having a WLD as well. Therefore, a statistical association between these factors (in conjunction with written performance) and WLD must first be established. Research Question: To determine the statistical association between WLD and physical and behavioral aspects of ADHD that indicate writing difficulties, this research reviewed methodologies from the literature pertaining to contemporary diagnoses of writing difficulties in ADHD students, and reveal diagnostic methods that explicitly associate the presence of WLD with these writing difficulties among students with ADHD. The results demonstrate the association between writing difficulties and WLD as it pertains to ADHD students using an integrated computational model employed on data from a systematic review. These results will be validated in a future study that will employ the integrated computational model to measure WLD among students with ADHD. Methodology: To measure the association of WLD among students with ADHD, the authors created a novel computational model that integrates the outcomes of common screening methods for WLD (physical questionnaire, behavioral questionnaire, and written performance tasks) with common screening methods for ADHD (physical questionnaire, behavioral questionnaire, adult self-reporting scales, and reaction-based continuous performance tasks (CPTs)). The outcomes of these screening methods were fed into an artificial neural network (ANN ) first, to ‘artificially learn’ about measuring the prevalence of WLD among ADHD students and second, to adjust the prevalence value based on information from different screening methods. This can be considered as the priming of the ANN. The ANN model was then tested with data from previous studies about ADHD students who had writing difficulties. The ANN model was also tested with data from students without ADHD or WLD, to serve as control. Results: The results show that physical, behavioral, and written performance attributes of ADHD students have a high correlation with WLD (r = 0.72 to 0.80) in comparison to control students (r = 0.30 to 0.20), substantiating the link between WLD and ADHD. It should be noted that due to lack of female participation, most studies in the literature only employed and reported on the relationship between WLD and ADHD for male participants. Discussion and Conclusion: By testing ADHD students and control students against the WLD criteria, the study shows a strong correlation between WLD and ADHD. There are limitations to the results’ accuracy in terms of a) sample size (average n=88, mean age = 19, 8 studies used for a meta-analysis), b) analysis (original study reviewing ADHD factors first, WLD factors second), and c) causation (the study only reviews prevalence of WLD in ADHD students, not causation). A clinical trial will validate the data and address some of these limitations in a future phase of the research. A computational causal model will be introduced in the discussion portion to illustrate how causation between writing metrics and WLD as it pertains to ADHD can be achieved. These results open the door to advancing pedagogical techniques in education, where students afflicted with ADHD and/or WLD could not only receive assistance for the behavioral aspects of their disorder, but also expect assistance for the learning aspects of their disorder, empowering them to succeed in their studies.