- CPD in 5
- Posts
- Can AI Mark History Essays? A Look at the Evidence
Can AI Mark History Essays? A Look at the Evidence
With the rise of AI in education, it was only a matter of time before someone asked: can artificial intelligence reliably mark GCSE History papers?
To find out, we uploaded (and in one case typed out manually) a series of GCSE responses from 2024 that had been recalled from Edexcel. We then ran them through a range of AI tools, including Tutor2U’s new AI platform, markme.ai, and a few others offering automated assessment services. The aim? To see how well the AI’s marks aligned with Edexcel’s – and whether this technology might one day support departmental marking at scale.
Here’s what we found.
The Data: 10 Scripts, Marked by AI and Edexcel
Test | Question | Edexcel Mark | Tutor2U AI Mark | Notes |
1 | Medicine paper, 8-mark source Q (Student A) | 4/8 | 4/8 | Matched perfectly. |
2 | Medicine paper, 8-mark source Q (Student B) | 7/8 | 5/8 | AI was noticeably harsher. |
3 | Medicine paper, 12-mark explain why Q (Student C) | 10/12 | 8/12 | AI wrongly claimed the answer didn’t go beyond stimulus points. |
4 | Elizabeth paper, 12-mark explain why Q (Student D) | 8/12 | 8/12 | Consistent with Edexcel. |
5 | Elizabeth paper, 12-mark explain why Q (Student E) | 12/12 | 11/12 | Slightly under-marked. |
6 | Germany paper, 16-mark interpretations Q (Student F) | 14/16 | 15/16 | Typed version (AI couldn’t read handwriting). |
7 | Germany paper, 16-mark interpretations Q (Student G) | 10/16 | 11/16 | Minor discrepancy. |
8 | Cold War paper, 8-mark narrative account Q (Student G) | 8/8 | 7/8 | Slightly under-marked. |
9 | Cold War paper, 8-mark narrative account Q (Student H) | 6/8 | 4/8 | AI was significantly harsher. |
10 | Medicine paper, 16-mark essay Q (Student H) | 16/16 | 12/16 | AI again failed to recognise when a student had moved beyond the stimulus material. |
Credit to my colleague, Matt Duncan, for running the data and for the analysis
So, How Did the AI do?
Overall, the results were mixed. The AI was capable of broadly accurate marking on lower-tariff questions like the 4- or 8-markers. But in higher-mark questions – particularly 12- and 16-mark essays – it often under-awarded marks, especially when evaluating whether a student had “gone beyond the stimulus points”.
This seems to be a critical blind spot. In both Test 3 and Test 10, for example, the AI claimed that the response didn’t extend beyond the provided prompts – when in fact, it clearly did. It raises a major concern: if AI is to support high-stakes marking, it must be trained to detect nuance and subtle argumentation, particularly in extended historical analysis.
Reviews of Each Platform
Tutor2U AI
Cost: £100 for 2,500 answers
Pros: Very affordable. Quick turnaround.
Cons: Too harsh on extended responses. Limited written feedback.
Verdict: Promising for small-mark questions, but not yet reliable for full essays.
TopMarks.ai
Cost: £40/month for just 90 scripts
Verdict: Too expensive for school departments to use at scale. Not pursued further.
Tilf.io
Cost: £162/year for 250 scripts/month
Verdict: Pricing not flexible enough for term-time marking peaks. No trial available.
ChatGPT
Verdict: Can work as a one-off tool for typed work, but unable to read handwriting and too clunky to scale for classroom or departmental use.
MarkMe.ai
Cost: £48/year for unlimited marking (claimed)
Trial Result: Marked the same 16-mark Germany answer from Test 6 and gave it a “Grade 9” – though without specific marks.
Pros: Detailed feedback highlighting strengths and areas for development.
Cons: Numerical marks unclear, and trial version was very limited.
Verdict: Strong potential for formative feedback, but questionable reliability on summative scores.
What Did We Learn?
AI is fairly good at short-answer marking.
For source analysis and narrative accounts, the AI tools were reasonably accurate. In fact, in most cases, the AI marks were within 1 or 2 points of the Edexcel grade.Extended responses remain a challenge.
The real issue arises with higher-mark questions. The AI too often fails to credit students for going beyond stimulus points – especially in cases where the argument is woven through the response rather than explicitly stated.Feedback matters.
While Tutor2U’s AI was quick, its feedback was sparse. MarkMe.ai, by contrast, offered much more useful insights for students – even if its scoring lacked clarity. For teaching and learning purposes, this may be more valuable.
Should We Be Using AI to Mark History Essays?
Right now, not as a replacement for teachers or exam boards. The technology simply isn’t ready to assess the complexity, nuance, and interpretation that history essays demand – particularly at the top end.
However, as a support tool, AI does show promise. It could help with:
Initial formative assessments
Feedback for draft answers
Bulk marking of shorter questions
Self-marking practice tasks for students
In the long term, the combination of teacher expertise plus AI efficiency might be the sweet spot. But for now, AI should be seen as a supplement to human judgement – not a substitute.
Reply