Journal of Writing Analytics
4 articlesJanuary 2018
-
Abstract
Background: Over a decade ago, the Stanford Study of Writing (SSW) collected more than 15,000 writing samples from undergraduate students, but to this point the corpus has not been analyzed using computational methods. Through the use of natural language processing (NLP) techniques, this study attempts to reveal underlying structures in the SSW, while at the same time developing a set of interpretable features for computationally understanding student writing. These features fall into three categories: topic-based features that reveal what students are writing about; stance-based features that reveal how students are framing their arguments; and structure-based features that reveal sentence complexity. Using these features, we are able to characterize the development of the SSW participants across four years of undergraduate study, specifically gaining insight into the different trajectories of humanities, social science, and STEM students. While the results are specific to Stanford University’s undergraduate program, they demonstrate that these three categories of features can give insight into how groups of students develop as writers.Literature Review: The Stanford Study of Writing (Lunsford et al., 2008; SSW, 2018) involved the collection of more than 15,000 writing samples from 189 students in the Stanford class of 2005. The literature surrounding the original study is largely qualitative (Fishman, Lunsford, McGregor, & Otuteye, 2005; Lunsford, 2013; Lunsford, Fishman, & Liew, 2013), so this study makes a first attempt at a quantitative analysis of the SSW. When considering the ethics of a computational approach, we find it important not to stray into the territory of writing evaluation, as purely evaluative systems have been shown to have limited instructional use in the classroom (Chen & Cheng, 2008; Weaver, 2006). Therefore, we find it important to take a descriptive, rather than evaluative approach. All of the features that we extract are both interpretable and grounded in prior research. Topic modeling has been used on undergraduate writing to improve the prediction of neuroticism and depression in college students (Resnik, Garron, & Resnik, 2013), stance markers have been used to show the development of undergraduate writers (Aull & Lancaster, 2014), and parse trees have been used to measure the syntactic complexity of student writing (Lu, 2010).Research Questions: What computational features are useful for analyzing the development of student writers? Based on these features, what insights can we gain into undergraduate writing at Stanford and similar institutions?Methodology: To extract topic features, we use LDA topic modeling (Blei, Ng, & Jordan, 2003) with Gibbs Sampling (Griffiths, 2002). To extract stance features, we replicate the stance markers approach from a past study (Aull & Lancaster, 2014). To describe sentence structure, we use parse trees generated using Shift-Reduce dependency parsing (Sagae & Tsujii, 2008). For each parse tree, we use the tree depth and the average dependency length as heuristics for the syntactic complexity of the sentence.Results: Topic modeling was useful for sorting papers into academic disciplines, as well as for distinguishing between argumentative and personal writing. Stance markers helped us characterize the intersection between the majors that students hold and the topics that they are writing about at a given time. Parse tree complexity demonstrated differences between writing in different disciplines. In addition, we found that students of different disciplines have different syntactic features even during their first year at Stanford.Discussion: Topic modeling has given us a picture of interdisciplinary study at Stanford by showing how often students in the SSW wrote about topics outside their majors. Furthermore, studying interdisciplinary Stanford students allowed us to examine the intersection of a student’s major and current topic of writing when analyzing the other two sets of features. Stance markers in the SSW show that both field of study and topic of writing influence the ways in which students employ metadiscourse. In addition, when looking at stance across years, we see that Seniors regress towards their First-Year habits. The complexity results raise the question of whether different disciplines have different “ideal” levels of writing complexity.Conclusions: The present study yields insight into undergraduate writing at Stanford in particular. Notably, we find that students develop most as writers during their first two years and that students of different majors develop as writers in different ways. We consider our three categories of features to be useful because they were able to give us these insights into the dataset. We hope that, moving forward, educators will be able to use this kind of analysis to understand how their students are developing as writers.
January 2017
-
Applying Natural Language Processing Tools to a Student Academic Writing Corpus: How Large are Disciplinary Differences Across Science and Engineering Fields? ↗
Abstract
• Background: Researchers have been working towards better understanding differences in professional disciplinary writing (e.g., Ewer & Latorre, 1969; Hu & Cao, 2015; Hyland, 2002; Hyland & Tse, 2007) for decades. Recently, research has taken important steps towards understanding disciplinary variation in student writing. Much of this research is corpus-based and focuses on lexico-grammatical features in student writing as captured in the British Academic Written English (BAWE) corpus and the Michigan Corpus of Upper-level Student Papers (MICUSP). The present study extends this work by analyzing lexical and cohesion differences among disciplines in MICUSP. Critically, we analyze not only linguistic differences in macro-disciplines (science and engineering), but also in micro-disciplines within these macro-disciplines (biology, physics, industrial engineering, and mechanical engineering).\n• Literature Review: Hardy and Römer (2013) used a multidimensional analysis to investigate linguistic differences across four macro-disciplines represented in MICUSP. Durrant (2014, in press) analyzed vocabulary in texts produced by student writers in the BAWE corpus by discipline and level (year) and disciplinary differences in lexical bundles. Ward (2007) examined lexical differences within micro-disciplines of a single discipline.\n• Research Questions: The research questions that guide this study are as follows:\n1. Are there significant lexical and cohesive differences between science and engineering student writing? 2. Are there significant lexical and cohesive differences between micro-disciplines within science and engineering student writing?\n• Research Methodology: To address the research questions, student-produced science and engineering texts from MICUSP were analyzed with regard to lexical sophistication and textual features of cohesion. Specifically, 22 indices of lexical sophistication calculated by the Tool for the Automatic Analysis of Lexical Sophistication (TAALES; Kyle & Crossley, 2015) and 38 cohesion indices calculated by the Tool for the Automatic Analysis of Cohesion (TAACO; Crossley, Kyle, & McNamara, 2016) were used. These features were then compared both across science and engineering texts (addressing Research Question 1) and across micro-disciplines within science and engineering (biology and physics, industrial and mechanical engineering) using discriminate function analyses (DFA).\n• Results: The DFAs revealed significant linguistic differences, not only between student writing in the two macro-disciplines but also between the micro-disciplines. Differences in classification accuracy based on students’ years of study hovered at about 10%. An analysis of accuracies of classification by paper type found they were similar for larger and smaller sample sizes, providing some indication that paper type was not a confounding variable in classification accuracy.\n• Discussion: The findings provide strong support that macro-disciplinary and micro-disciplinary differences exist in student writing in these MICUSP samples and that these differences are likely not related to student level or paper type. These findings have important implications for understanding disciplinary differences. First, they confirm previous research that found the vocabulary used by different macro-disciplines to be “strikingly diverse” (Durrant, 2015), but they also show a remarkable diversity of cohesion features. The findings suggest that the common understanding of the STEM disciplines as “close” bears reconsideration in linguistic terms. Second, the lexical and cohesion differences between micro-disciplines are large enough and consistent enough to suggest that each micro-discipline can be thought of as containing a unique linguistic profile of features. Third, the differences discerned in the NLP analysis are evident at least as early as the final year of undergraduate study, suggesting that students at this level already have a solid understanding of the conventions of the disciplines of which they are aspiring to be members. Moreover, the differences are relatively homogeneous across levels, which confirms findings by Durrant (2015) but, importantly, extends these findings to include cohesion markers.\n• Conclusions: The findings from this study provide evidence that macro-disciplinary and micro-disciplinary differences at the linguistic level exist in student writing, not only in lexical use but also in text cohesion. A number of pedagogical applications of writing analytics are proposed based on the reported findings from TAALES and TAACO. Further studies using different corpora (e.g., BAWE) or purpose assembled corpora are suggested to address limitations in the size and range of text types found within MICUSP. This study also points the way toward studies of disciplinary differences using NLP approaches that capture data which goes beyond the lexical and cohesive features of text, including the use of part-of-speech tags, syntactic parsing, indices related to syntactic complexity and similarity, rhetorical features, or more advanced cohesion metrics (latent semantic analysis, latent Dirichlet allocation, Word2Vec approaches).
-
Abstract
Background: A shift of focus has been marked in recent years in the development of automated essay scoring systems (AES) passing from merely assigning a holistic score to an essay to providing constructive feedback over it. Despite all the major advances in the domain, many objections persist concerning their credibility and readiness to replace human scoring in high-stakes writing assessments. The purpose of this study is to shed light on how to build a relatively simple AES system based on five baseline writing features. The study shows that the proposed AES system compares very well with other state-of-the-art systems despite its obvious limitations. Literature Review: In 2012, ASAP (Automated Student Assessment Prize) launched a demonstration to benchmark the performance of state-of-the-art AES systems using eight hand-graded essay datasets originating from state writing assessments. These datasets are still used today to measure the accuracy of new AES systems. Recently, Zupanc and Bosnic (2017) developed and evaluated another state-of-the-art AES system, called SAGE, which enclosed new semantic and consistency features and provided for the first time an automatic semantic feedback. SAGE’s agreement level between machine and human scores for ASAP dataset #8 (the dataset also of interest in this study) was measured and had a quadratic weighted kappa of 0.81, while it ranged for 10 other state-of-the-art systems between 0.60 and 0.73 (Chen et al., 2012; Shermis, 2014). Finally, this section discusses the limitations of AES, which come mainly from its omission to assess higher-order thinking skills that all writing constructs are ultimately designed to assess. Research Questions: The research questions that guide this study are as follows: RQ1: What is the power of the writing analytics tool’s five-variable model (spelling accuracy, grammatical accuracy, semantic similarity, connectivity, lexical diversity) to predict the holistic scores of Grade 10 narrative essays (ASAP dataset #8)? RQ2: What is the agreement level between the computer rater based on the regression model obtained in RQ1 and the human raters who scored the 723 narrative essays written by Grade 10 students (ASAP dataset #8)? Methodology: ASAP dataset #8 was used to train the predictive model of the writing analytics tool introduced in this study. Each essay was graded by two teachers. In case of disagreement between the two raters, the scoring was resolved by a third rater. Basically, essay scores were the weighted sums of four rubric scores. A multiple linear regression analysis was conducted to determine the extent to which a five-variable model (selected from a set of 86 writing features) was effective to predict essay scores. Results: The regression model in this study accounted for 57% of the essay score variability. The correlation (Pearson), the percentage of perfect matches, the percentage of adjacent matches (±2), and the quadratic weighted kappa between the resolved scores and predicted essay scores were 0.76, 10%, 49%, and 0.73, respectively. The results were measured on an integer scale of resolved essay scores between 10-60. Discussion: When measuring the accuracy of an AES system, it is important to take into account several metrics to better understand how predicted essay scores are distributed along the distribution of human scores. Using average ranking over correlation, exact/adjacent agreement, quadratic weighted kappa, and distributional characteristics such as standard deviation and mean, this study’s regression model ranks 4th out of 10 AES systems. Despite its relatively good rank, the predictions of the proposed AES system remain imprecise and do not even look optimal to identify poor-quality essays (binary condition) smaller than or equal to a 65% threshold (71% precision and 92% recall). Conclusions: This study sheds light on the implementation process and the evaluation of a new simple AES system comparable to the state of the art and reveals that the generally obscure state-of-the-art AES system is most likely concerned only with shallow assessment of text production features. Consequently, the authors advocate greater transparency in the development and publication of AES systems. In addition, the relationship between the explanation of essay score variability and the inter-rater agreement level should be further investigated to better represent the changes in terms of level of agreement when a new variable is added to a regression model. This study should also be replicated at a larger scale in several different writing settings for more robust results.
-
Measuring the Written Language Disorder among Students with Attention Deficit Hyperactivity Disorder ↗
Abstract
Background: Attention Deficit Hyperactivity Disorder (ADHD) is a mental health disorder. People diagnosed with ADHD are often inattentive (have difficulty focusing on a task for a considerable period), overly impulsive (make rash decisions), and are hyperactive (move excessively, often at inappropriate times). ADHD is often diagnosed through psychiatric assessments with additional input from physical/neurological evaluations. Written Language Disorder (WLD) is a learning disorder. People diagnosed with WLD often make multiple spelling, grammar, and punctuation mistakes, have sentences that lack cohesion and topic flow, and have trouble completing written assignments. Typically, WLD is also diagnosed through psychological educational assessments with additional input from physical/neurological evaluation. Literature Review: Previous research has shown a link between ADHD and writing difficulties. Students with ADHD have an increased likelihood of having writing difficulties, and rarely is there a presence of writing difficulties without ADHD or another mental health disorder. However, the presence of writing difficulties does not necessarily indicate the presence of a WLD. There are other physical and behavioral factors of ADHD that can contribute to a student having a WLD as well. Therefore, a statistical association between these factors (in conjunction with written performance) and WLD must first be established. Research Question: To determine the statistical association between WLD and physical and behavioral aspects of ADHD that indicate writing difficulties, this research reviewed methodologies from the literature pertaining to contemporary diagnoses of writing difficulties in ADHD students, and reveal diagnostic methods that explicitly associate the presence of WLD with these writing difficulties among students with ADHD. The results demonstrate the association between writing difficulties and WLD as it pertains to ADHD students using an integrated computational model employed on data from a systematic review. These results will be validated in a future study that will employ the integrated computational model to measure WLD among students with ADHD. Methodology: To measure the association of WLD among students with ADHD, the authors created a novel computational model that integrates the outcomes of common screening methods for WLD (physical questionnaire, behavioral questionnaire, and written performance tasks) with common screening methods for ADHD (physical questionnaire, behavioral questionnaire, adult self-reporting scales, and reaction-based continuous performance tasks (CPTs)). The outcomes of these screening methods were fed into an artificial neural network (ANN ) first, to ‘artificially learn’ about measuring the prevalence of WLD among ADHD students and second, to adjust the prevalence value based on information from different screening methods. This can be considered as the priming of the ANN. The ANN model was then tested with data from previous studies about ADHD students who had writing difficulties. The ANN model was also tested with data from students without ADHD or WLD, to serve as control. Results: The results show that physical, behavioral, and written performance attributes of ADHD students have a high correlation with WLD (r = 0.72 to 0.80) in comparison to control students (r = 0.30 to 0.20), substantiating the link between WLD and ADHD. It should be noted that due to lack of female participation, most studies in the literature only employed and reported on the relationship between WLD and ADHD for male participants. Discussion and Conclusion: By testing ADHD students and control students against the WLD criteria, the study shows a strong correlation between WLD and ADHD. There are limitations to the results’ accuracy in terms of a) sample size (average n=88, mean age = 19, 8 studies used for a meta-analysis), b) analysis (original study reviewing ADHD factors first, WLD factors second), and c) causation (the study only reviews prevalence of WLD in ADHD students, not causation). A clinical trial will validate the data and address some of these limitations in a future phase of the research. A computational causal model will be introduced in the discussion portion to illustrate how causation between writing metrics and WLD as it pertains to ADHD can be achieved. These results open the door to advancing pedagogical techniques in education, where students afflicted with ADHD and/or WLD could not only receive assistance for the behavioral aspects of their disorder, but also expect assistance for the learning aspects of their disorder, empowering them to succeed in their studies.