On the Limitations of Human-Computer Agreement in
Automated Essay Scoring
Afrizal Doewes, Mykola Pechenizkiy
Eindhoven University of Technology
{a.doewes, m.pechenizkiy}@tue.nl
ABSTRACT
Scoring essays is generally an exhausting and time-consuming task
for teachers. Automated Essay Scoring (AES) facilitates the
scoring process to be faster and more consistent. The most logical
way to assess the performance of an automated scorer is by
measuring the score agreement with the human raters. However, we
provide empirical evidence that a well-performing essay scorer
from the quantitative evaluation point of view are still too risky to
be deployed. We propose several input scenarios to evaluate the
reliability and the validity of the system, such as off-topic essays,
gibberish, and paraphrased answers. We demonstrate that
automated scoring models with high human-computer agreement
fail to perform well on two out of three test scenarios. We also
discuss the strategies to improve the performance of the system.
Keywords
Automated Essay Scoring, Testing Scenarios, Reliability and
Validity
1. INTRODUCTION
Automated Essay Scoring (AES) system is a computer software
designed as a tool to facilitate the evaluation of student essays.
Theoretically, AES systems work faster, reduce cost in term of
evaluator’s time, and eliminate concerns about rater consistency.
The most logical way to assess the performance of an automated
scorer is by measuring the score agreement with the human raters.
The score agreement rate must exceed a specific threshold value to
be considered as having a good performance. Consequently, most
studies have focused on increasing the level of agreement between
human and computer scoring. However, the process of establishing
reliability should not stop with the calculation of inter-coder
reliability, because automated scoring poses some distinctive
validity challenges such as the potential to misrepresent the
construct of interest, vulnerability to cheating, impact on examinee
behavior, and users’ interpretation on score and use of scores [1].
Bennet and Bejar [2] have argued that reliability scores are limited
in their reliance on human ratings for evaluating the performance
of automated scoring primarily because human graders are fallible.
Humans raters may experience fatigue and have problems with
scoring consistency across time. Reliability calculations alone are
therefore not adequate as the current trend for establishing validity
[3]. A well-performing essay scorer from the quantitative
evaluation perspective is too risky to be deployed before evaluating
the system’s reliability and validity.
The initial attempt to discuss validity issues regarding automated
scoring in a larger context of a validity argument for the assessment
was made by Clauser et al. [4]. They presented several outlines of
the potential validity threats that automated scoring would
introduce to the overall interpretation and use. Enright and Quinlan
[5] discussed how the evidence for a scoring process that uses both
human and e-rater scoring is relevant to validity argument. They
described an e-rater model which was proposed to score one of the
two writing tasks on the TOEFL-iBT writing section. Automated
scorer was investigated as a tool to complement to human
judgement on essays written by English language learners.
Several criticisms for Automated Essay Scoring (AES) system
were highlighted in [6]. They argued that there were limited studies
on how effective automated writing evaluation was used in writing
classes as a pedagogical tool. In their study, the students gave
negative reactions towards the automated assessment. One of the
problems was that the scoring system favored lengthiness; higher
scores were awarded to longer essays. It also overemphasized the
use of transition words, which increased the score of an essay
immediately. Moreover, it ignored coherence and content
development as an essay could achieve a high score by having four
or five paragraphs with relevant keywords, although it had major
coherence problems and illogical ideas. Another concern is
described in [7]. Specifically, knowing how the system evaluates
an essay may be a reason why students can fool the system into
assigning a higher score than what is warranted. They concluded
that the system was not ready yet as the only scorer, especially for
high-stakes testing, without the help of expert human raters.
Most researchers agree that human - automated score agreement
still serves as the standard baseline for measuring the quality of
machine score prediction. However, there is an inherent limitation
with this measurement because the agreement rate is usually
derived only from the data used for training and testing the machine
learning model. The aim of this paper is to highlight some
limitations of standard performance metrics used to evaluate
automated essay scoring model, using several input scenarios to
evaluate the reliability and the validity of the system, such as off-
topic essays, gibberish, and paraphrased answers. We show
empirical evidence that a well-performing automated essay scorer,
with high agreement rate between human-machine, is not
necessarily ready for deployment for operational use, since it fails
to perform well on two out of three test scenarios. In addition, we
also discuss some strategies to improve the performance of the
system. This paper begins with the explanation of the quantitative
performance acceptance criteria for an automated scoring model
from [1]. Then, we present the experiment settings, including the
training algorithm and the essay features, for creating the model.
Afterwards, we discuss the experiment results, model performance
analysis, reliability and validity evaluation and the strategies for
improvement, and finally, we conclude our work.