Learning to Read Between the Lines: Can AI Assess Writing Fairly?

Both teachers & learners appreciate the value of writing as a way to develop ideas, practise language & show what someone can really do in English. However, when it comes to assessing writing, especially at scale, we often fall back on shortcuts such as length, accuracy & vocabulary range. Which leads me to wonder whether technology is now capable of moving beyond these surface cues to provide more meaningful assessment.

A recent study in Scientific Reports explores exactly this by testing a new AI-based system designed to judge English writing more consistently across different essay topics, not just familiar ones.

The study

Ren, Fan & Wang (2025) introduce a new Automated Essay Scoring model called HFC-AES (Hybrid Feature-based Cross-Prompt Automated Essay Scoring). Their focus is a long-standing weakness in automated marking: essays written to unseen or unfamiliar prompts are often scored unreliably.

To investigate this, the researchers trained & tested their system on large collections of learner essays, including the ASAP dataset as well as TOEFL11 & ICLE. The essays were written by non-native English users responding to multiple prompts, with human examiner scores used as the reference point.

The model works in two main stages:

A topic-independent stage that looks at overall writing quality using a mix of simple indicators (like sentence length or error patterns) & deeper neural analysis
A topic-related stage that checks how closely the essay content actually matches the task, using attention-based modelling to track relevance & organisation

Performance was measured using Quadratic Weighted Kappa (QWK), a common way of checking how closely automated scores match human ones.

The findings

The results are striking. On average, the system’s scores matched human examiners very closely, and it performed better than several well-known automated marking tools, including ones based on BERT & GPT (two well-known types of AI language models used to analyse [the former] & generate [the latter] text.).

Some key takeaways:

Removing features linked to text organisation led to a clear drop in accuracy, showing that structure really matters
Features connected to task relevance were especially important for argumentative essays
Attention mechanisms helped most with essays that involved abstract thinking, weighing options or developing a position
The system was fast enough for practical use, scoring roughly 60–70 essays per minute

That said, the model still struggled with subtle elements of writing. Rhetorical questions, shifts in tone or carefully balanced arguments were sometimes undervalued, while fluent but shallow responses could receive higher scores than a human examiner might give.

Why this matters for ELT

What makes this study interesting for me isn’t the tech itself, but what it shows about writing. The model performs best when it can track coherence, organisation & relevance, the same things we often prioritise when marking by hand.

It also highlights a familiar tension. Accuracy & fluency are relatively easy to measure, for humans & machines alike. Depth of argument, stance & originality are much harder. Even advanced AI systems still find these aspects challenging.

To put it simply, imagine two essays with similar grammar & vocabulary. One develops an idea logically & stays focused on the task. The other sounds fluent but goes in circles. Systems like HFC-AES are becoming better at telling the difference, though they are not there yet.

Teacher Takeaways

Automated scoring is moving beyond grammar & word counts towards organisation & relevance
Task fulfilment is now a central focus for newer AI-based assessment tools
These systems can support feedback & large-scale marking, but they still miss nuance, voice & creativity

Rather than replacing teachers, research like this invites us to reflect on what we value most in student writing, & which aspects of that are hardest to capture, whether by humans or machines.

How do you decide what matters most when you assess writing in your classroom?

tl;dr-ELT

Learning to Read Between the Lines: Can AI Assess Writing Fairly?

Like this:

Leave a ReplyCancel reply

Welcome to my blog

Let’s connect

Recent posts

Native Speakerism: When “Sounding Right” Outweighs Being Right

Thinking Out Loud… Literally: What Multilingual Minds Reveal

Trust Me, I’m a Teacher

Whose English Counts? Rethinking Power & Practice in EAP

Bottom‑up by default: the non‑native brain at work

BELF and the Rise of Linguacultural Competence

Share this:

Like this:

Leave a ReplyCancel reply

Welcome to my blog

Let’s connect

Recent posts

Discover more from tl;dr-ELT