Journal of Writing Analytics

9 articles
Year: Topic: Clear
Export:
teacher development ×

January 2024

  1. Digital Analysis of First-Year Composition Archive for Seeking Writing Teaching Job and Professionalization Purposes
    doi:10.37514/jwa-j.2024.7.1.03

January 2022

  1. The Relationship Between Teacher Efficacy, Writing Apprehension, and Writing to Learn Using Structural Equation Modeling
    doi:10.37514/jwa-j.2022.6.1.03

January 2020

  1. Gender Preferences in Writing Center Appointments: The Case for a Metadata-Driven Approach
    Abstract

    Writing center studies has sought to move towards research methods that are replicable, aggregable, and data-supported (RAD) as a means to scholarly legitimacy. While a number of RAD research methods have been identified (surveys, qualitative analysis, observation, case studies, experimentation, discourse analysis, teacher research, action research, and ethnography), one important source of information has been largely overlooked: the scheduling metadata that writing centers routinely collect in the course of normal operations. The present research seeks to demonstrate the validity of metadata-driven research by interrogating an area of writing center scholarship that has been predominantly studied through theoretical or small group means: the impact of gender on writing consultations. It investigates whether the gender of the writing consultant significantly affects a student’s choice in scheduling appointments.

    doi:10.37514/jwa-j.2020.4.1.10

January 2019

  1. Research in the Teaching of English: From Alchemy and Science to Methodological Plurality
    doi:10.37514/jwa-j.2019.3.1.15

January 2018

  1. Evolution of Instructor Response? Analysis of Five Years of Feedback to Students
    Abstract

    Background: Research incorporating large data sets and data and text mining methodologies is making initial contributions to writing studies. In writing program administration (WPA) work, one could best characterize the body of publications as small but growing, led by such work as Moxley and Eubanks’ 2015 “On Keeping Score: Instructors' vs. Students' Rubric Ratings of 46,689 Essays” and Arizona State University’s Science of Learning & Educational Technology (SoLET) Lab. Given the information that large-scale textual analysis can provide, it seems incumbent on program administrators to explore ways to make regular and aggressive use of such opportunities to give both students and instructors more resources for learning and development. This project is one attempt to add to this corpus of work; the sample for the study consisted of 17,534 pieces of student writing representing 141,659 discrete comments on that writing, with 58,300 unique words out of over 8.25 million total words written. This data is used to examine trends in the program’s instructor commentary over five years’ time.  By doing so, this study revisits a fundamental task of writing instruction—responding to student writing, and from the data’s results considers how large writing programs with constant turnover of graduate teaching assistants (GTAs) might manage their ongoing instructor professional development and how those GTAs will improve their ability to teach and respond to writing.Literature Review: Researchers have attempted to unpack and understand the task of instructor commentary for several decades; the published literature demonstrates a complex and occasionally ambivalent relationship with this central task of writing instruction. Recent scholarship has moved from the small-scale studies long used by the field to implement large-scale examinations of the instruction occurring in writing programs. Research questions: Three questions guided the inquiry:Does the work of new instructors (MA1s) more closely resemble the lexicon of novice or experienced responders to student writing?How does the new instructors’ work compare to that of more experienced (PHD1 or INS) instructors in the program throughout their time?How does their work evolve over a four-semester longitudinal time frame (as MA1 or MA2 experience levels) in the first-year writing program? [Please note that the abbreviations used above and throughout the article to designate instructor experience levels are as follows: MA1 (first-year master’s students); MA2 (second-year master’s students); PHD1 (first-year doctoral students); INS (instructors—those with 3 or more years’ experience teaching and who are not currently pursuing an additional degree—nearly all of these individuals held a Master’s degree)].Methodology: This study extends the work of Anson and Anson (2017) who first surveyed writing instructors and program administrators to create wordlists that survey respondents associated with “high-quality” and “novice” responses, and then examined a corpus of nearly 50,000 peer responses produced at a single university to learn to what extent instructors and student peers adopted this lexicon. Specifically, the study analyzes a corpus of instructor comments to students using the Anson and Anson wordlists associated with principled and novice commentary to see if new writing instructors align more closely with the concepts represented in either list during their first semester in the program.  It then tracks four cohorts for evolution and change in their vocabulary of feedback over their next three semesters in the program; the study also compares the vocabulary used in their comments to that used by experienced instructors in the program over the same time.Results: The study found that from the outset, the new instructors (MA1) incorporated more of the principled response terms than the novice response terms. Overall, in comparing the MA1 instructors with the most experienced group (INS), the results reveal three important findings about the feedback of both MA1s and INSs in this program.While there are some differences in commentary as seen via examination of the two lexicons, the differences are perhaps less than one might assume.The cohorts do increase their use of the principled terms as they move through the two years’ appointment in the program, but few of the increases demonstrate statistical significance.Few of the terms from either the novice or principled lexicon, with the exception of terms that also appear in the assignment descriptions, what I label as “content terms,” appear frequently in the overall corpus.Discussion: Based on the results, the instructors in this program had acquired a more consistent vocabulary, but not primarily one based on Anson and Anson’s two lexicons—instead, the most frequent and commonly used terms seem to come from a more local “canon,” that is, one based on the assignment descriptions and course outcomes. Regardless of whether the acquisition of a common vocabulary came from more global concepts or an assignment-based local canon, using common terms is something that Nancy Sommers (1982) saw as contributing to “thoughtful commentary” on student writing. As no one has previously studied how quickly new instructors acquire a professional vocabulary for responding to student writing, it is hard to know whether or not the results of this particular group of instructors would be considered “typical.” However, it may well be that the context of this writing program contributed to a more accelerated acquisition.Conclusions: Working with the lexicons developed via Anson and Anson’s survey is a useful starting point for understanding more of what our instructors actually do when responding to student writing, as well as for identifying critical differences in our instructors’ comments. The lexicons, though, only provide us with a subset of expected (thus acceptable) terms included in commentary—terms that afford students the opportunity to act upon receiving them via revision or transfer. Directions for Future Research: Additional research is necessary to expand and refine the lexicons and their impact on student writing. One possibility is to return to the current data set to engage in additional lexical analysis of both the novice and principled lexicons as well as the overall frequency tables to understand how terms are used in the context of response by the various instructor groups. Differences in the application of the terms might help us understand why comments might be labeled as more or less helpful to writers.  Another strategy is to examine the data in terms of markers of stance; finally, topic modeling could be used to locate more subtle differences in the instructor comments that are not as easily identifiable with lexical analysis. Such examinations could serve as a baseline for broadening the study out to other sets of assignments and commentary, perhaps helping us build a set of threshold concepts for talking about writing with our students. Ultimately, it is important to replicate and expand Anson and Anson’s survey to other stakeholder groups. As with much research on the teaching of writing, we default to the group most accessible to us—other writing professionals. Replicating this survey with other stakeholders—graduate teaching assistants, undergraduate students at both lower and upper division levels— could help us understand whether or not a gap exists in understanding what constitutes good feedback from the various stakeholders.

    doi:10.37514/jwa-j.2018.2.1.02
  2. De-Identification of Laboratory Reports in STEM
    Abstract

    Background: Employing natural language processing and latent semantic analysis, the current work was completed as a constituent part of a larger research project for designing and launching artificial intelligence in the form of deep artificial neural networks. The models were evaluated on a proprietary corpus retrieved from a data warehouse, where it was extracted from MyReviewers, a sophisticated web application purposed for peer review in written communication, which was actively used in several higher education institutions. The corpus of laboratory reports in STEM annotated by instructors and students was used to train the models. Under the Common Rule, research ethics were ensured by protecting the privacy of subjects and maintaining the confidentiality of data, which mandated corpus de-identification.Literature Review: De-identification and pseudonymization of textual data remains an actively studied research question for several decades. Its importance is stipulated by numerous laws and regulations in the United States and internationally with HIPAA Privacy Rule and FERPA.Research Question: Text de-identification requires a significant amount of manual post-processing for eliminating faculty and student names.  This work investigated automated and semi-automated methods for de-identifying student and faculty entities while preserving author names in cited sources and reference lists. It was hypothesized that a natural language processing toolkit and an artificial neural network model with named entity recognition capabilities would facilitate text processing and reduce the amount of manual labor required for post-processing after matching essays to a list of users’ names. The suggested techniques were applied with supplied pre-trained models without additional tagging and training. The goal of the study was to evaluate three approaches and find the most efficient one among those using a users’ list, a named entity recognition toolkit, and an artificial neural network.Research Methodology: The current work studied de-identification of STEM laboratory reports and evaluated the performance of the three techniques: brute forth search with a user lists, named entity recognition with the OpenNLP machine learning toolkit, and NeuroNER, an artificial neural network for named entity recognition built on the TensorFlow platform. The complexity of the given task was determined by the dilemma, where names belonging to students, instructors, or teaching assistants must be removed, while the rest of the names (e.g., authors of referenced papers) must be preserved.Results: The evaluation of the three selected methods demonstrated that automating de-identification of STEM lab reports is not possible in the setting, when named entity recognition methods are employed with pre-trained models. The highest results were achieved by the users’ list technique with 0.79 precision, 0.75 recall, and 0.77 F1 measure, which significantly outweighed OpenNLP with 0.06 precision, 0.14 recall, and 0.09 F1, and NeuroNER with 0.14 precision, 0.56 recall, and 0.23 F1.Discussion: Low performance of OpenNLP and NeuroNER toolkits was explained by the complexity of the task and unattainability of customized models due to imposed time constraints. An approach for masking possible de-identification errors is suggested.Conclusion: Unlike multiple cases described in the related work, de-identification of laboratory reports in STEM remained a non-trivial labor-intensive task. Applied out of the box, a machine learning toolkit and an artificial neural network technique did not enhance performance of the brute forth approach based on user list matching.Directions for Future Research: Customized tagging and training on the STEM corpus were presumed to advance outcomes of machine learning and predominantly artificial intelligence methods. Application of other natural language toolkits may lead to deducing a more effective solution.

    doi:10.37514/jwa-j.2018.2.1.07

January 2017

  1. Statistical and Qualitative Analyses of Students� Answers to a Constructed Response Test of Science Inquiry Knowledge
    Abstract

    Objective: We report on a comparative study of the language used by middle school students in their answers to a constructed response test of science inquiry knowledge. Background: Text analyses using statistical models have been conducted across a number of disciplines to identify topics in a journal, to extract topics in Twitter messages, and to investigate political preferences. In education, relatively few studies have analyzed the text of students’ written answers to investigate topics underlying the answers. Methodology: Two types of linguistic analysis were compared to investigate their utility in understanding students’ learning of scientific investigation practices. A statistical method, latent Dirichlet allocation (LDA), was used to extract topics from the texts of student responses. In the LDA model, topics are viewed as multinomial distributions over the vocabulary of documents. These topics were examined for content and used to characterize student responses on the constructed response items. The change from pre-test to post-test in proportions of use of each of the topics was related to students’ learning. Next, a qualitative method, systemic functional linguistic (SFL) analysis, was used to analyze the text of student responses on the same test of science inquiry knowledge. Student assessments were analyzed for two linguistic features that are important for convincing scientific communication: technical vocabulary usage and high lexical density. In this way, we investigated whether human judgement regarding the changes observed from texts based on the SFL framework agreed with the inference regarding the changes observed from the texts through LDA. Research questions: Two research questions were investigated in this study: (1) What do the LDA and SFL analyses tell us about students’ answers? (2) What are the similarities and differences of the two analyses? Data: The data for this study were taken from an NSF-funded host study on teaching science inquiry skills to middle school students who were a mix of both native English speakers and English-language learners. The primary objective was to enable participants to learn to take ownership of scientific language through the use of language-rich science investigation practices. The LDA analysis used a sample of 252 students’ pre-and post-assessments. The SFL analysis used a second sample of 90 students’ pre- and post-assessments. Results: In the LDA analysis, three topics were detected in student responses: “preponderance of everyday language (Topic 1),” “preponderance of general academic language (Topic 2),” and “preponderance of discipline-specific language (Topic 3).” Students’ use of topics changed from pre-test to post-test. Students on the post-test tended to have higher proportions of Topic 3 than students on the pre-test. In the SFL analysis, students tended to use more technical vocabulary and have higher lexical density in their written responses on the post-test than on the pre-test. Discussion: Results from the LDA and SFL analyses suggest that students responded using more discipline-specific language on the post-test than on the pre-test. In addition, the results of the two linguistic features from the SFL analysis, technical vocabulary usage and lexical density, were compared with the results from the LDA analysis. • Conclusion: Results of the LDA and SFL analyses were consistent with each other and clearly showed that students improved in their ability to use the discipline-specific and academic terminology of the language of scientific communication.

    doi:10.37514/jwa-j.2017.1.1.05
  2. I Hear What You�re Saying: The Power of Screencasts in Peer-to-Peer Review
    Abstract

    Aim: The screencast (SC), a 21st century analytics tool, enables the simultaneous recording of audio and video feedback on any digital document, image, or website, and may be used to enhance feedback systems in many educational settings. Although previous findings show that students and teachers have had positive experiences with recorded commentary, this method is still rarely used by teachers in composition classrooms. There are many possible reasons for this, some of which include the accelerated pace at which classroom technology has changed over the past decade, concerns over privacy when new technologies are integrated into the classroom, and the general unease instructors may feel when asked to integrate a new technology system into their established composition pedagogy and response routine. The aim of this study was to replicate previous findings in favor of SC feedback and expand that body of research beyond instructor-to-student SC interactions and into the realm of SC-mediated peer review. Thus, this study seeks to improve on the widespread written peer review practices most common among writing instruction today, practices that tend to produce mediocre learning outcomes and fail to capitalize on 21st century technological innovations to enhance student learning. This research note demonstrates the validity of SC as a valuable writing analytics research tool that has the potential to collect and measure student learning. It also seeks to inspire those who have been reluctant to adopt SC in both digital learning and face-to-face educational environments by providing pragmatic guidance for doing so in ways that simultaneously increase student learning and facilitate a more rigorous and discursive peer-to-peer review process. Problem Formation: While research suggests positive student perceptions related to screencast instructor response, results in peer-to-peer screencast response are mixed. After several successful years of experience in instructor-to-student SC feedback, the author wondered what would happen if she asked students to use screencast technology to mediate peer review. How might students’ attitudes and perceptions impact the use of peer-to-peer screencast technology in the composition classroom? In order to address these questions, the author developed a survey measuring the user reliability of this new SC technology and the student affect and revision initiative it produces. Information Collection: This study extends Anson’s (2016) research and insights by reporting findings from a study of 138 writing students. Survey data was collected during the 2015-2016 academic year at three institutions. At High Point University, the author of this research note asked freshmen composition students in a traditional face-to-face lecture course to conduct a series of peer review sessions (including both traditional written comments and SC comments) over a 16-week semester. Students were surveyed after each peer review experience, and the results form the foundation of this research note’s conclusions. In addition to survey responses, researchers also collected the screencasts exchanged among peer-to-peer interactions within each educational setting. Conclusions: The author provides an in-depth analysis of students’ experiences, perceptions, and attitudes toward giving and receiving screencast feedback, focusing on the impact of this method on student revision initiative in comparison to that of a traditional written feedback system. Some conclusions are also drawn regarding the user reliability and effectiveness of the screencast technology, specifically the free software program known as Jing, a product available through Techsmith.com that enables a streamlined and user-friendly SC interface and cloud storage of all SC recordings through individualized hyperlinks, thereby alleviating concerns regarding student privacy. Directions for Further Research: While this research note provides compelling evidence to support the use of SC in composition classrooms, there are also many opportunities for continued study, particularly within the emerging field of writing analytics. While the actual student-to-student screencasts were collected in this study, they were not analyzed as a qualitative data set, and the researchers relied on self-reported survey data to assess the degree of revision initiative among the students surveyed. The screencasts themselves offer a treasure trove of data, should the researcher have the capability to code that data set or utilize automated natural language processing programs in the future. Perhaps this peer-to-peer SC feedback could be compared to similar corpus analyses of instructor-to-student feedback gathered by other writing analytics scholars. In addition, further research in this area could also collect the student writing itself and track revisions made by students after receiving SC feedback and traditional written feedback from their peers. In this way, researchers would be able to make comparisons between the actual changes made by the student writers, the extent of those changes (surface-level or higher-order revisions), and the student’s perceived degree of revision initiative reported in the survey. To facilitate future research in this area, the author has included teaching resources for those new to screencast technology and analytics.

    doi:10.37514/jwa-j.2017.1.1.13
  3. Assessing Writing Constructs: Toward an Expanded View of Inter-Reader Reliability
    Abstract

    Background: This study focuses on construct representation and inter-reader agreement and reliability in ePortfolio assessment of 1,315 writing portfolios. These portfolios were submitted by undergraduates enrolled in required writing seminars at the University of Pennsylvania (Penn) in the fall of 2014.  Penn is an Ivy League university with a diverse student population, half of whom identify as students of color. Over half of Penn’s students are women, 12% are international, and 12% are first-generation college students. The students’ portfolios are scored by the instructor and an outside reader drawn from a writing-in-the-disciplines faculty who represent 24 disciplines. The portfolios are the product of a shared curriculum that uses formative assessment and a program-wide multiple-trait rubric. The study contributes to scholarship on the inter-reader reliability and validity of multiple-trait portfolio assessments as well as to recent discussions about reconceptualizing evidence in ePortfolio assessment.  Research Questions: Four questions guided our study: What levels of interrater agreement and reliability can be achieved when assessing complex writing performances that a) contain several different documents to be assessed; b) use a construct-based, multi-trait rubric; c) are designed for formative assessment rather than testing; and d) are rated by a multidisciplinary writing faculty?   What can be learned from assessing agreement and reliability of individual traits? How might these measurements contribute to curriculum design, teacher development, and student learning? How might these findings contribute to research on fairness, reliability, and validity; rubrics; and multidisciplinary writing assessment? Literature Review: There is a long history of empirical work exploring the reliability of scoring highly controlled timed writings, particularly by test measurement specialists. However, until quite recently, there have been few instances of applying empirical assessment techniques to writing portfolios.  Developed by writing theorists, writing portfolios contain multiple documents and genres and are produced and assessed under conditions significantly different from those of timed essay measurement. Interrater reliability can be affected by the different approaches to reading texts depending on the background, training, and goals of the rater. While a few writing theorists question the use of rubrics, most quantitatively based scholarship points to their effectiveness for portfolio assessment and calls into question the meaningfulness of single score holistic grading, whether impressionistic or rubric-based. Increasing attention is being paid to multi-trait rubrics, including, in the field of writing portfolio assessment, the use of robust writing constructs based on psychometrics alongside the more conventional cognitive traits assessed in writing studies, and rubrics that can identify areas of opportunity as well as unfairness in relation to the background of the student or the assessor. Scholars in the emergent field of empirical portfolio assessment in writing advocate the use of reliability as a means to identify fairness and validity and to create great opportunities for portfolios to advance student learning and professional development of faculty.  They also note that while the writing assessment community has paid attention to the work of test measurement practitioners, the reverse has not been the case, and that conversations and collaborations between the two communities are long overdue. Methodology: We used two methods of calculating interrater agreement: absolute and adjacent percentages, and Cohen’s Unweighted Kappa, which calculates the extent to which interrater agreement is an effect of chance or expected outcome. For interrater reliability, we used the Pearson product-moment correlation coefficient. We used SPSS to produce all of the calculations in this study.  Results: Interrater agreement and reliability rates of portfolio scores landed in the medium range of statistical significance.  Combined absolute and adjacent percentages of interrater reliability were above the 90% range recommended; however, absolute agreement was below the 70% ideal.  Furthermore, Cohen’s Unweighted Kappa rates were statistically significant but very low, which may be due to “kappa paradox.” Discussion: The study suggests that a formative, rubric-based approach to ePortfolio assessment that uses disciplinarily diverse raters can achieve medium-level rates of interrater agreement and reliability. It raises the question of the extent to which absolute agreement is a desirable or even relevant goal for authentic feedback processes of a complex set of documents, and in which the aim is to advance student learning. At the same time, our findings point to how agreement and reliability measures can significantly contribute to our assessment process, teacher training, and curriculum. Finally, the study highlights potential concerns about construct validity and rater training.  Conclusion: This study contributes to the emergent field of empirical writing portfolio assessment that calls into question the prevailing standard of reliability built upon timed essay measurement rather than the measurement, conditions, and objectives of complex writing performances.  It also contributes to recent research on multi-trait and discipline-based portfolio assessment.  We point to several directions for further research:  conducting “talk aloud” and recorded sessions with raters to obtain qualitative data on areas of disagreement; expanding the number of constructs assessed; increasing the range and granularity of the numeric scoring scale; and investigating traits that are receiving low interrater reliability scores. We also ask whether absolute agreement might be more useful for writing portfolio assessment than reliability and point to the potential “kappa paradox,” borrowed from the field of medicine, which examines interrater reliability in assessment of rare cases. Kappa paradox might be useful in assessing types of portfolios that are less frequently encountered by faculty readers. These, combined with the identification of jagged profiles and student demographics, hold considerable potential for rethinking how to work with and assess students from a range of backgrounds, preparation, and abilities.  Finally, our findings contribute to a growing effort to understand the role of rater background, particularly disciplinarity, in shaping writing assessment. The goals of our assessment process are to ensure that we are measuring what we intend to measure, specifically those things that students have an equal chance at achieving and that advance student learning.  Our findings suggest that interrater agreement and reliability measures, if thoughtfully approached, will contribute significantly to each of these goals.

    doi:10.37514/jwa-j.2017.1.1.09