Journal of Writing Analytics

17 articles
Year: Topic: Clear
Export:
assessment ×

January 2026

  1. A Tribute to Robert J. Mislevy Part 1: Mapping the Skills of Tomorrow--Principled Assessment of Literacy and Numeracy Skills Embedded in U.S. Workplace Contexts
    doi:10.37514/jwa-j.2026.8.1.09

January 2022

  1. Book Review of Kelly-Riley, D., and Elliot, N. (Eds.). (2021). Improving outcomes: Disciplinary writing, local assessment, and the aim of fairness. MLA Press.
    doi:10.37514/jwa-j.2022.6.1.09
  2. Across Performance Contexts: Using Automated Writing Evaluation to Explore Student Writing
    doi:10.37514/jwa-j.2022.6.1.07

January 2021

  1. Communicating Assessment Information in the Context of a Workplace Formative Task
    doi:10.37514/jwa-j.2021.5.1.10
  2. Computer-Assisted Rhetorical Analysis: Instructional Design and Formative Assessment Using DocuScope
    doi:10.37514/jwa-j.2021.5.1.09

January 2020

  1. Writing Analytics and Program Assessment: How Novice Writers Use Rubric Terminology in Reflective Essays
    doi:10.37514/jwa-j.2020.4.1.05

January 2019

  1. Considering Consequences in Writing Analytics: Humanistic Inquiry and Empirical Research in The Journal of Writing Assessment
    doi:10.37514/jwa-j.2019.3.1.16
  2. When Scores Do Not Increase: Notes on Quantitative Approaches to Writing Assessment
    doi:10.37514/jwa-j.2019.3.1.12

January 2018

  1. Writing MentorTM: Writing Progress Using Self-Regulated Writing Support
    Abstract

    The Writing Mentor TM (WM) application is a Google Docs add-on designed to help students improve their writing in a principled manner and to promote their writing success in postsecondary settings. WM provides automated writing evaluation (AWE) feedback using natural language processing (NLP) methods and linguistic resources. AWE features in WM have been informed by research about postsecondary student writers often classified as developmental (Burstein et al., 2016b), and these features address a breadth of writing sub-constructs (including use of sources, claims, and evidence; topic development; coherence; and knowledge of English conventions). Through an optional entry survey, WM collects self-efficacy data about writing and English language status from users. Tool perceptions are collected from users through an optional exit survey. Informed by language arts models consistent with the Common Core State Standards Initiative and valued by the writing studies community, WM takes initial steps to integrate the reading and writing process by offering a range of textual features, including vocabulary support, intended to help users to understand unfamiliar vocabulary in coursework reading texts. This paper describes WM and provides discussion of descriptive evaluations from an Amazon Mechanical Turk (AMT) usability task situated in WM and from users-in-the-wild data. The paper concludes with a framework for developing writing feedback and analytics technology.

    doi:10.37514/jwa-j.2018.2.1.12
  2. Placing Writing Tasks in Local and Global Contexts: The Case of Argumentative Writing
    Abstract

    Background: Current research in composition and writing studies is concerned with issues of writing program evaluation and how writing tasks and their sequences scaffold students toward learning outcomes. These issues are beginning to be addressed by writing analytics research, which can be useful for identifying recurring types of language in writing assignments and how those can inform task design and student outcomes. To address these issues, this study provides a three-step method of sequencing, comparison, and diagnosis to understand how specific writing tasks fit into a classroom sequence as well as compare to larger genres of writing outside of the immediate writing classroom environment. By doing so, we provide writing program administrators with tools for describing what skills students demonstrate in a sequence of writing tasks and diagnosing how these skills match with writing students will do in later contexts. Literature Review: Student writing that responds to classroom assignments can be understood as genres, insofar as they are constructed responses that exist in similar rhetorical situations and perform similar social actions. Previous work in corpus analysis has looked at these genres, which helps us as writing instructors understand what kind of constructed responses are required of students and to make those expectations explicit. Aull (2017) examined a corpus of first-year undergraduate writing assignments in two courses to create “sociocognitive profiles” of these assignments. We analyze student writing that responds to similar writing tasks, but use a different corpus method that allows us to understand the tasks in both local and global contexts. By doing so, we gain confidence and depth in our understanding of these tasks, analyze how they sequence together, and are able to compare argumentative writing across institutions and contexts. Research Questions: Two questions guided our study: What is the trajectory of skills targeted by the sequence of tasks in the two first-year writing courses, as evidenced by the rhetorical strategies employed by the writers in successive assignments? Focusing on the final argument assignments, how similar are they to argumentative writing in other contexts, in terms of rhetorical profiles? Methodology: We first conducted a local analysis, in which we used a dictionary-based corpus method to analyze the rhetorical strategies used by writers in the first-year writing courses to understand how they built on each other to form a sequence. Having understood what skills students are demonstrating in a course, we then conducted a global analysis which calculated a “distance” between the first-year argument writing and a corpus of argument writing drawn from other contexts. Recognizing that there was a non-trivial distance, we then identified and evaluated the sources of the distance so that the writing tasks could be assessed or modified. Results: The local analysis revealed eight key rhetorical strategies that student writing exhibits between the two first-year writing courses. With this understanding, we then placed the argument writing in global contexts to find that the assignments in both courses differ somewhat from argument writing in other contexts. Upon analyzing this difference, we found that the first-year writing primarily differs in its usage of academic language, the personal register, assertive language, and reasoning. We suggest that these differences stem primarily from the rhetorical situation and learning objectives associated with first-year writing, as well as the sequencing of the courses. Discussion: The three-step method presented provides a means for writing program administrators to describe and analyze writing that students produce in their writing programs. We intend these steps to be understood as an iterative process, whereby writing programs can use these results to evaluate what rhetorical skills their students are exhibiting and to benchmark those against the program’s goals and/or other similar writing programs. Conclusions: By presenting these analyses together, we ultimately provide a cohesive method by which to analyze a writing program and benchmark students’ use of rhetorical strategies in relation to other argumentative contexts. We believe this method to be useful not only to individual writing programs, but to assessment literature broadly. In future research, we anticipate learning how this process will practically feed back into pedagogy, as well as understanding what placing writing tasks into a global context can tell us about genre theory.

    doi:10.37514/jwa-j.2018.2.1.03
  3. Structural Features of Undergraduate Writing: A Computational Approach
    Abstract

    Background: Over a decade ago, the Stanford Study of Writing (SSW) collected more than 15,000 writing samples from undergraduate students, but to this point the corpus has not been analyzed using computational methods. Through the use of natural language processing (NLP) techniques, this study attempts to reveal underlying structures in the SSW, while at the same time developing a set of interpretable features for computationally understanding student writing. These features fall into three categories: topic-based features that reveal what students are writing about; stance-based features that reveal how students are framing their arguments; and structure-based features that reveal sentence complexity. Using these features, we are able to characterize the development of the SSW participants across four years of undergraduate study, specifically gaining insight into the different trajectories of humanities, social science, and STEM students. While the results are specific to Stanford University’s undergraduate program, they demonstrate that these three categories of features can give insight into how groups of students develop as writers.Literature Review: The Stanford Study of Writing (Lunsford et al., 2008; SSW, 2018) involved the collection of more than 15,000 writing samples from 189 students in the Stanford class of 2005. The literature surrounding the original study is largely qualitative (Fishman, Lunsford, McGregor, & Otuteye, 2005; Lunsford, 2013; Lunsford, Fishman, & Liew, 2013), so this study makes a first attempt at a quantitative analysis of the SSW. When considering the ethics of a computational approach, we find it important not to stray into the territory of writing evaluation, as purely evaluative systems have been shown to have limited instructional use in the classroom (Chen & Cheng, 2008; Weaver, 2006). Therefore, we find it important to take a descriptive, rather than evaluative approach. All of the features that we extract are both interpretable and grounded in prior research. Topic modeling has been used on undergraduate writing to improve the prediction of neuroticism and depression in college students (Resnik, Garron, & Resnik, 2013), stance markers have been used to show the development of undergraduate writers (Aull & Lancaster, 2014), and parse trees have been used to measure the syntactic complexity of student writing (Lu, 2010).Research Questions: What computational features are useful for analyzing the development of student writers? Based on these features, what insights can we gain into undergraduate writing at Stanford and similar institutions?Methodology: To extract topic features, we use LDA topic modeling (Blei, Ng, & Jordan, 2003) with Gibbs Sampling (Griffiths, 2002). To extract stance features, we replicate the stance markers approach from a past study (Aull & Lancaster, 2014). To describe sentence structure, we use parse trees generated using  Shift-Reduce dependency parsing (Sagae & Tsujii, 2008). For each parse tree, we use the tree depth and the average dependency length as heuristics for the syntactic complexity of the sentence.Results: Topic modeling was useful for sorting papers into academic disciplines, as well as for distinguishing between argumentative and personal writing. Stance markers helped us characterize the intersection between the majors that students hold and the topics that they are writing about at a given time. Parse tree complexity demonstrated differences between writing in different disciplines. In addition, we found that students of different disciplines have different syntactic features even during their first year at Stanford.Discussion: Topic modeling has given us a picture of interdisciplinary study at Stanford by showing how often students in the SSW wrote about topics outside their majors. Furthermore, studying interdisciplinary Stanford students allowed us to examine the intersection of a student’s major and current topic of writing when analyzing the other two sets of features. Stance markers in the SSW show that both field of study and topic of writing influence the ways in which students employ metadiscourse. In addition, when looking at stance across years, we see that Seniors regress towards their First-Year habits. The complexity results raise the question of whether different disciplines have different “ideal” levels of writing complexity.Conclusions: The present study yields insight into undergraduate writing at Stanford in particular. Notably, we find that students develop most as writers during their first two years and that students of different majors develop as writers in different ways. We consider our three categories of features to be useful because they were able to give us these insights into the dataset. We hope that, moving forward, educators will be able to use this kind of analysis to understand how their students are developing as writers.

    doi:10.37514/jwa-j.2018.2.1.06
  4. Is this Too Polite? The Limited Use of Rhetorical Moves in a First-Year Corpus
    Abstract

    Background: The researchers conducted a corpus analysis of 548 research-based argument essays, totalling 1,465,091 words, written by first-year students at The City College of New York (CCNY). The purpose of this study was to better understand the ways in which CCNY students were constructing arguments in research essays in order to better support our instruction of the research essay. Curricular guidelines for the research assignment are general. Instructors are directed to require a research-based, persuasive argument that includes conflicting points of view. Model assignment sheets are provided to instructors, but they are free to write their own. Assignment sheets are not collected or approved. In the fall semester in which this corpus was collected, over 70 part-time instructors taught approximately 120 sections of the first- or second-semester composition course.Literature Review: The study of The City College of New York Corpus (CCNYC) partially replicates and relies on the analysis of three corpora of academic writing conducted by Zak Lancaster (2016a) in his examination of Gerald Graff’s and Cathy Birkenstein’s textbook They Say/I Say: The Moves that Matter in Academic Writing (2014). The current study also compares the CCNYC findings to studies of stance and voice markers frequency conducted by Ken Hyland (2012) and Ellen Barton (1993) and suggests the classroom use of corpus analysis as described by Raith Abid and Shakila Manan (2015), and Maggie Charles (2007).Research Questions: The study was guided by a narrowly-focused interest in learning whether or not the CCNYC would demonstrate the range and distribution of rhetorical moves that Lancaster found in his study of academic writing (2016a). The analysis of the corpus consists of frequency counts; we did not conduct other statistical analyses. Since we had little prior experience with corpus analysis, we wondered what would be revealed about students’ writing practices by a partial replication of Lancaster’s study. We did not reproduce Lancaster’s analysis but relied on his publised results. This study served as an assessment tool, providing a microscopic view of a limited number of rhetorical moves across a large corpus of student essays. As a result of our study, we hoped to be able to create assignments for research essays that responded directly to the patterns that we saw in our students’ essays.Methodology: Modeled on Lancaster’s study and the templates of rhetorical moves offered by Graff and Birkenstein, concordances of terms used to introduce objections, offer concessions, and make counterarguments were drawn from the CCNYC and then analyzed to confirm that the rhetorical form was in fact functioning as one of the above rhetorical moves within the context of the essay in which it was found.Results: Our study demonstrates that CCNY students use fewer linguistic resources than their peers at other institutions, a finding that helps shape faculty development seminars. The corpus analysis reveals that while CCNY students introduce objections to their arguments at about the same rates as in other corpora, they are less likely to concede to those objections. In addition, when students made counterarguments, they used only a limited range of the linguistic resources available to them.Conclusions: The low rate of engagement with opposing points of view and the limited use of linguistic resources for counterarguments all suggest the potential value of focused, corpus-based instruction.

    doi:10.37514/jwa-j.2018.2.1.04
  5. De-Identification of Laboratory Reports in STEM
    Abstract

    Background: Employing natural language processing and latent semantic analysis, the current work was completed as a constituent part of a larger research project for designing and launching artificial intelligence in the form of deep artificial neural networks. The models were evaluated on a proprietary corpus retrieved from a data warehouse, where it was extracted from MyReviewers, a sophisticated web application purposed for peer review in written communication, which was actively used in several higher education institutions. The corpus of laboratory reports in STEM annotated by instructors and students was used to train the models. Under the Common Rule, research ethics were ensured by protecting the privacy of subjects and maintaining the confidentiality of data, which mandated corpus de-identification.Literature Review: De-identification and pseudonymization of textual data remains an actively studied research question for several decades. Its importance is stipulated by numerous laws and regulations in the United States and internationally with HIPAA Privacy Rule and FERPA.Research Question: Text de-identification requires a significant amount of manual post-processing for eliminating faculty and student names.  This work investigated automated and semi-automated methods for de-identifying student and faculty entities while preserving author names in cited sources and reference lists. It was hypothesized that a natural language processing toolkit and an artificial neural network model with named entity recognition capabilities would facilitate text processing and reduce the amount of manual labor required for post-processing after matching essays to a list of users’ names. The suggested techniques were applied with supplied pre-trained models without additional tagging and training. The goal of the study was to evaluate three approaches and find the most efficient one among those using a users’ list, a named entity recognition toolkit, and an artificial neural network.Research Methodology: The current work studied de-identification of STEM laboratory reports and evaluated the performance of the three techniques: brute forth search with a user lists, named entity recognition with the OpenNLP machine learning toolkit, and NeuroNER, an artificial neural network for named entity recognition built on the TensorFlow platform. The complexity of the given task was determined by the dilemma, where names belonging to students, instructors, or teaching assistants must be removed, while the rest of the names (e.g., authors of referenced papers) must be preserved.Results: The evaluation of the three selected methods demonstrated that automating de-identification of STEM lab reports is not possible in the setting, when named entity recognition methods are employed with pre-trained models. The highest results were achieved by the users’ list technique with 0.79 precision, 0.75 recall, and 0.77 F1 measure, which significantly outweighed OpenNLP with 0.06 precision, 0.14 recall, and 0.09 F1, and NeuroNER with 0.14 precision, 0.56 recall, and 0.23 F1.Discussion: Low performance of OpenNLP and NeuroNER toolkits was explained by the complexity of the task and unattainability of customized models due to imposed time constraints. An approach for masking possible de-identification errors is suggested.Conclusion: Unlike multiple cases described in the related work, de-identification of laboratory reports in STEM remained a non-trivial labor-intensive task. Applied out of the box, a machine learning toolkit and an artificial neural network technique did not enhance performance of the brute forth approach based on user list matching.Directions for Future Research: Customized tagging and training on the STEM corpus were presumed to advance outcomes of machine learning and predominantly artificial intelligence methods. Application of other natural language toolkits may lead to deducing a more effective solution.

    doi:10.37514/jwa-j.2018.2.1.07

January 2017

  1. Discovering the Predictive Power of Five Baseline Writing Competences
    Abstract

    Background: A shift of focus has been marked in recent years in the development of automated essay scoring systems (AES) passing from merely assigning a holistic score to an essay to providing constructive feedback over it. Despite all the major advances in the domain, many objections persist concerning their credibility and readiness to replace human scoring in high-stakes writing assessments. The purpose of this study is to shed light on how to build a relatively simple AES system based on five baseline writing features. The study shows that the proposed AES system compares very well with other state-of-the-art systems despite its obvious limitations. Literature Review: In 2012, ASAP (Automated Student Assessment Prize) launched a demonstration to benchmark the performance of state-of-the-art AES systems using eight hand-graded essay datasets originating from state writing assessments. These datasets are still used today to measure the accuracy of new AES systems. Recently, Zupanc and Bosnic (2017) developed and evaluated another state-of-the-art AES system, called SAGE, which enclosed new semantic and consistency features and provided for the first time an automatic semantic feedback. SAGE’s agreement level between machine and human scores for ASAP dataset #8 (the dataset also of interest in this study) was measured and had a quadratic weighted kappa of 0.81, while it ranged for 10 other state-of-the-art systems between 0.60 and 0.73 (Chen et al., 2012; Shermis, 2014). Finally, this section discusses the limitations of AES, which come mainly from its omission to assess higher-order thinking skills that all writing constructs are ultimately designed to assess. Research Questions: The research questions that guide this study are as follows: RQ1: What is the power of the writing analytics tool’s five-variable model (spelling accuracy, grammatical accuracy, semantic similarity, connectivity, lexical diversity) to predict the holistic scores of Grade 10 narrative essays (ASAP dataset #8)? RQ2: What is the agreement level between the computer rater based on the regression model obtained in RQ1 and the human raters who scored the 723 narrative essays written by Grade 10 students (ASAP dataset #8)? Methodology: ASAP dataset #8 was used to train the predictive model of the writing analytics tool introduced in this study. Each essay was graded by two teachers. In case of disagreement between the two raters, the scoring was resolved by a third rater. Basically, essay scores were the weighted sums of four rubric scores. A multiple linear regression analysis was conducted to determine the extent to which a five-variable model (selected from a set of 86 writing features) was effective to predict essay scores. Results: The regression model in this study accounted for 57% of the essay score variability. The correlation (Pearson), the percentage of perfect matches, the percentage of adjacent matches (±2), and the quadratic weighted kappa between the resolved scores and predicted essay scores were 0.76, 10%, 49%, and 0.73, respectively. The results were measured on an integer scale of resolved essay scores between 10-60. Discussion: When measuring the accuracy of an AES system, it is important to take into account several metrics to better understand how predicted essay scores are distributed along the distribution of human scores. Using average ranking over correlation, exact/adjacent agreement, quadratic weighted kappa, and distributional characteristics such as standard deviation and mean, this study’s regression model ranks 4th out of 10 AES systems. Despite its relatively good rank, the predictions of the proposed AES system remain imprecise and do not even look optimal to identify poor-quality essays (binary condition) smaller than or equal to a 65% threshold (71% precision and 92% recall). Conclusions: This study sheds light on the implementation process and the evaluation of a new simple AES system comparable to the state of the art and reveals that the generally obscure state-of-the-art AES system is most likely concerned only with shallow assessment of text production features. Consequently, the authors advocate greater transparency in the development and publication of AES systems. In addition, the relationship between the explanation of essay score variability and the inter-rater agreement level should be further investigated to better represent the changes in terms of level of agreement when a new variable is added to a regression model. This study should also be replicated at a larger scale in several different writing settings for more robust results.

    doi:10.37514/jwa-j.2017.1.1.08
  2. (Re)Visualizing Rater Agreement: Beyond Single-Parameter Measures
    Abstract

    Technique Identification: A new graphical technique is presented for visualizing and assessing inter-rater agreement in discrete ordinal or categorical data, such as rubric ratings.  To that aim, a chance-corrected Kappa with two new features is derived. First, it is based on interpreting ratings for each subject as vectors to visualize the data. This is done by creating two-dimensional vectors from a subject-rating summary table, sorting the vectors by their slopes, and plotting them in that order to create a trajectory that displays all the data in context. Second, it presents a graph and accompanying statistics (Kappa, p -value) for each pair of ratings in an organized display so that all useful comparisons of the data are visually displayed and statistically assessed. This information is presented on a logical grid, usually called facets . Kappa is calculated in the usual way, by referencing the actual results with an average of random rating assignments. This average becomes a reference line on each graph as a visual cue, as well. The statistical basis for the Kappa and significance testing are derived, and the test assumptions are specified. Value Contribution: The most commonly used statistics for inter-rater agreement, such as the Cohen Kappa or Inter-Class Correlation, give only a single parameter estimate of reliability from which to make judgments about ratings data. The technique presented here constructs graphs of all the data that allow visual inspection of the ratings versus a reference curve that represents chance-matching. The detailed reports on inter-rater agreement can show how to fine-tune ratings systems, such as understanding which parts of an ordinal scale are working best. This solves a practical problem for researchers who rely on rating-type classification by revealing which overall aspects of the rating system need to be improved and adds to the list of tools available for assessing rating reliability. In creating this approach to analysis of rater data, human usability is emphasized. Specifically, the use of geometry is designed to facilitate interpretability rather than being a mathematical derivation from first principles. Technique Application: Two applications are given, both involving social meaning-making. The first uses data from wine-judging to illustrate how the method can illuminate expertise in that domain. The results reproduce published findings that were based on a classical statistical method. A second sample application uses data from a university assessment of student writing in which ratings on a developmental scale are assigned by course instructors to their students. The rating program is an example of social meaning-making that can be used to generate larger data sets than are typical for classroom-based assessment programs. The analysis shows the strengths and weaknesses of the rating system in terms of reliability and demonstrates how that knowledge leads to improvements in assessment. Directions for Further Research: An argument is made for a public library of inter-rater data for empirical use by researchers. The social aspects of rating are discussed, and there is an illustration of the potential to derive new measures of inter-rater agreement from the meaning-making program that produces the data.

    doi:10.37514/jwa-j.2017.1.1.10
  3. Assessing Writing Constructs: Toward an Expanded View of Inter-Reader Reliability
    Abstract

    Background: This study focuses on construct representation and inter-reader agreement and reliability in ePortfolio assessment of 1,315 writing portfolios. These portfolios were submitted by undergraduates enrolled in required writing seminars at the University of Pennsylvania (Penn) in the fall of 2014.  Penn is an Ivy League university with a diverse student population, half of whom identify as students of color. Over half of Penn’s students are women, 12% are international, and 12% are first-generation college students. The students’ portfolios are scored by the instructor and an outside reader drawn from a writing-in-the-disciplines faculty who represent 24 disciplines. The portfolios are the product of a shared curriculum that uses formative assessment and a program-wide multiple-trait rubric. The study contributes to scholarship on the inter-reader reliability and validity of multiple-trait portfolio assessments as well as to recent discussions about reconceptualizing evidence in ePortfolio assessment.  Research Questions: Four questions guided our study: What levels of interrater agreement and reliability can be achieved when assessing complex writing performances that a) contain several different documents to be assessed; b) use a construct-based, multi-trait rubric; c) are designed for formative assessment rather than testing; and d) are rated by a multidisciplinary writing faculty?   What can be learned from assessing agreement and reliability of individual traits? How might these measurements contribute to curriculum design, teacher development, and student learning? How might these findings contribute to research on fairness, reliability, and validity; rubrics; and multidisciplinary writing assessment? Literature Review: There is a long history of empirical work exploring the reliability of scoring highly controlled timed writings, particularly by test measurement specialists. However, until quite recently, there have been few instances of applying empirical assessment techniques to writing portfolios.  Developed by writing theorists, writing portfolios contain multiple documents and genres and are produced and assessed under conditions significantly different from those of timed essay measurement. Interrater reliability can be affected by the different approaches to reading texts depending on the background, training, and goals of the rater. While a few writing theorists question the use of rubrics, most quantitatively based scholarship points to their effectiveness for portfolio assessment and calls into question the meaningfulness of single score holistic grading, whether impressionistic or rubric-based. Increasing attention is being paid to multi-trait rubrics, including, in the field of writing portfolio assessment, the use of robust writing constructs based on psychometrics alongside the more conventional cognitive traits assessed in writing studies, and rubrics that can identify areas of opportunity as well as unfairness in relation to the background of the student or the assessor. Scholars in the emergent field of empirical portfolio assessment in writing advocate the use of reliability as a means to identify fairness and validity and to create great opportunities for portfolios to advance student learning and professional development of faculty.  They also note that while the writing assessment community has paid attention to the work of test measurement practitioners, the reverse has not been the case, and that conversations and collaborations between the two communities are long overdue. Methodology: We used two methods of calculating interrater agreement: absolute and adjacent percentages, and Cohen’s Unweighted Kappa, which calculates the extent to which interrater agreement is an effect of chance or expected outcome. For interrater reliability, we used the Pearson product-moment correlation coefficient. We used SPSS to produce all of the calculations in this study.  Results: Interrater agreement and reliability rates of portfolio scores landed in the medium range of statistical significance.  Combined absolute and adjacent percentages of interrater reliability were above the 90% range recommended; however, absolute agreement was below the 70% ideal.  Furthermore, Cohen’s Unweighted Kappa rates were statistically significant but very low, which may be due to “kappa paradox.” Discussion: The study suggests that a formative, rubric-based approach to ePortfolio assessment that uses disciplinarily diverse raters can achieve medium-level rates of interrater agreement and reliability. It raises the question of the extent to which absolute agreement is a desirable or even relevant goal for authentic feedback processes of a complex set of documents, and in which the aim is to advance student learning. At the same time, our findings point to how agreement and reliability measures can significantly contribute to our assessment process, teacher training, and curriculum. Finally, the study highlights potential concerns about construct validity and rater training.  Conclusion: This study contributes to the emergent field of empirical writing portfolio assessment that calls into question the prevailing standard of reliability built upon timed essay measurement rather than the measurement, conditions, and objectives of complex writing performances.  It also contributes to recent research on multi-trait and discipline-based portfolio assessment.  We point to several directions for further research:  conducting “talk aloud” and recorded sessions with raters to obtain qualitative data on areas of disagreement; expanding the number of constructs assessed; increasing the range and granularity of the numeric scoring scale; and investigating traits that are receiving low interrater reliability scores. We also ask whether absolute agreement might be more useful for writing portfolio assessment than reliability and point to the potential “kappa paradox,” borrowed from the field of medicine, which examines interrater reliability in assessment of rare cases. Kappa paradox might be useful in assessing types of portfolios that are less frequently encountered by faculty readers. These, combined with the identification of jagged profiles and student demographics, hold considerable potential for rethinking how to work with and assess students from a range of backgrounds, preparation, and abilities.  Finally, our findings contribute to a growing effort to understand the role of rater background, particularly disciplinarity, in shaping writing assessment. The goals of our assessment process are to ensure that we are measuring what we intend to measure, specifically those things that students have an equal chance at achieving and that advance student learning.  Our findings suggest that interrater agreement and reliability measures, if thoughtfully approached, will contribute significantly to each of these goals.

    doi:10.37514/jwa-j.2017.1.1.09
  4. Measuring the Written Language Disorder among Students with Attention Deficit Hyperactivity Disorder
    Abstract

    Background: Attention Deficit Hyperactivity Disorder (ADHD) is a mental health disorder. People diagnosed with ADHD are often inattentive (have difficulty focusing on a task for a considerable period), overly impulsive (make rash decisions), and are hyperactive (move excessively, often at inappropriate times). ADHD is often diagnosed through psychiatric assessments with additional input from physical/neurological evaluations. Written Language Disorder (WLD) is a learning disorder. People diagnosed with WLD often make multiple spelling, grammar, and punctuation mistakes, have sentences that lack cohesion and topic flow, and have trouble completing written assignments. Typically, WLD is also diagnosed through psychological educational assessments with additional input from physical/neurological evaluation. Literature Review: Previous research has shown a link between ADHD and writing difficulties. Students with ADHD have an increased likelihood of having writing difficulties, and rarely is there a presence of writing difficulties without ADHD or another mental health disorder. However, the presence of writing difficulties does not necessarily indicate the presence of a WLD. There are other physical and behavioral factors of ADHD that can contribute to a student having a WLD as well. Therefore, a statistical association between these factors (in conjunction with written performance) and WLD must first be established. Research Question: To determine the statistical association between WLD and physical and behavioral aspects of ADHD that indicate writing difficulties, this research reviewed methodologies from the literature pertaining to contemporary diagnoses of writing difficulties in ADHD students, and reveal diagnostic methods that explicitly associate the presence of WLD with these writing difficulties among students with ADHD. The results demonstrate the association between writing difficulties and WLD as it pertains to ADHD students using an integrated computational model employed on data from a systematic review. These results will be validated in a future study that will employ the integrated computational model to measure WLD among students with ADHD. Methodology: To measure the association of WLD among students with ADHD, the authors created a novel computational model that integrates the outcomes of common screening methods for WLD (physical questionnaire, behavioral questionnaire, and written performance tasks) with common screening methods for ADHD (physical questionnaire, behavioral questionnaire, adult self-reporting scales, and reaction-based continuous performance tasks (CPTs)). The outcomes of these screening methods were fed into an artificial neural network (ANN ) first, to ‘artificially learn’ about measuring the prevalence of WLD among ADHD students and second, to adjust the prevalence value based on information from different screening methods. This can be considered as the priming of the ANN. The ANN model was then tested with data from previous studies about ADHD students who had writing difficulties. The ANN model was also tested with data from students without ADHD or WLD, to serve as control. Results: The results show that physical, behavioral, and written performance attributes of ADHD students have a high correlation with WLD (r = 0.72 to 0.80) in comparison to control students (r = 0.30 to 0.20), substantiating the link between WLD and ADHD. It should be noted that due to lack of female participation, most studies in the literature only employed and reported on the relationship between WLD and ADHD for male participants. Discussion and Conclusion: By testing ADHD students and control students against the WLD criteria, the study shows a strong correlation between WLD and ADHD. There are limitations to the results’ accuracy in terms of a) sample size (average n=88, mean age = 19, 8 studies used for a meta-analysis), b) analysis (original study reviewing ADHD factors first, WLD factors second), and c) causation (the study only reviews prevalence of WLD in ADHD students, not causation). A clinical trial will validate the data and address some of these limitations in a future phase of the research. A computational causal model will be introduced in the discussion portion to illustrate how causation between writing metrics and WLD as it pertains to ADHD can be achieved. These results open the door to advancing pedagogical techniques in education, where students afflicted with ADHD and/or WLD could not only receive assistance for the behavioral aspects of their disorder, but also expect assistance for the learning aspects of their disorder, empowering them to succeed in their studies.

    doi:10.37514/jwa-j.2017.1.1.07