Matthew Johnson
1 article-
Abstract
In this study, we examine the feasibility of augmenting student-written essays with those generated by large language models (LLMs) for scoring essays. We found that with correct instructions, generative AI systems such as GPT-4 and GPT-4o can generate essays similar to those written by students in terms of surface-level linguistic features, although material differences may still exist. Systematic analyses revealed that scoring models trained with synthetic data perform comparably to models trained using student essays, but the performance varies across prompts and the sizes of the model training sample. The augmented models could alleviate large discrepancies between human and AI scores on the subgroup level that may be introduced by a lack of training samples for a particular subgroup or due to inherent biases in LLMs. We also explored an established method – DecompX – on token importance to identify and explain AI predictions. Future research directions and limitations of this study are also discussed.