Estonia looking into AI grading for native language exams

The Education and Youth Board (Harno) is discussing the possibility of using artificial intelligence to grade mother tongue final language exams in the future with Tallinn University.
A study conducted by the university's researchers showed that language models give fairly similar grades on the exam as humans do.
The study's lead, Merilin Aruvee, lecturer of L1 didactics and applied linguistics, has been involved in pilot projects for ninth-grade e-exams. Together with her doctoral student and junior researcher, Katarin Leppik, Aruvee developed a ninth-grade final exam essay, or writing task, and created new assessment criteria.
From there arose the question: could artificial intelligence be included in the grading?
"We described the criteria for assessing exam papers more precisely and examined assessment models elsewhere in the world. Harno assessors — experienced native language teachers — tried grading according to those criteria, we collected feedback and adjusted the model based on the teachers' recommendations. When artificial intelligence lecturer Andres Karjus began working at the Institute of Humanities, we started thinking: what if artificial intelligence graded students' exam papers on the basis of these criteria?" Aruvee said.
Inspiration also came from Kais Allkivi, who works at Tallinn University's Institute of Digital Technologies and has dealt with assessing the language proficiency of learners of Estonian as a second language using machine learning.

Model offers same grades as humans
The researchers have been using the 2024 and 2025 pilot exam papers for analysis in an anonymous form. Only the text and the grade assigned by assessors were visible.
From the results, the researchers found that language models offer fairly similar grades to humans.
"While variation between people can be quite large, in 60 percent of cases, the language model stays within the same range in which people vary among themselves. This means that the grades given by the models do not differ excessively from those proposed by humans," Aruvee explained.
The use of artificial intelligence for grading is linked to changes in the basic school exams, which will become e-exams from 2027.
Among other innovations, the new exam will also require the use of a source text. In the researcher's model, the extent and principles of using the source text are clearly formulated to reduce subjectivity in the exam.
"According to our analysis, the criterion of using the source text stood out most clearly: it can be assessed very unambiguously. At the same time, a machine could assist a human here: the machine can compare the source text and the student's text and show which sentence parts have been copied directly from the source text or how the student has paraphrased it. In the case of the source text criterion, the machine's and the human's grades matched quite well," Aruvee explained.

Humans will not disappear
The lecturer said the language model is well-suited to assessing the use of the source text, there are also criteria where a human eye may be necessary. This includes the formulation of the problem in the introduction, the persuasiveness of a paragraph, or tying the text into a coherent whole in the conclusion.
"Personally, it seems to me that writing the conclusion is precisely the place where a human should rather give their subjective assessment. The question is whether the text creates, in their view, a meaningful and coherent whole. Here the assessor's experience counts, which is shaped by previously read texts and a general knowledge of ninth-grade students' writing level," she said.
Aruvee explained the researchers' initial idea was to see how large language models behave in general and what they are capable of, since it is somewhat accidental that they are able to use Estonian at all.
"Language models mostly contain texts taken from the internet, but they are not familiar at all with a learner's text. A learner's text is different; in the case of a native-language learner it contains its own kind of errors, for example a very long and uninterrupted stream of thought where it is clear that the student's thinking is very sincere, even cute, but the train of thought is not divided into clear clauses, rather as if it had flowed directly from the head onto paper. The fact that a language model can assess such a text quite adequately was a rather big surprise for us," Aruvee said.
It must also be kept in mind that both European Union directives and general good practice do not allow a machine to make the final decision on a person's text.

Harno to see final results
Later this month, Aruvee will meet with Harno to present the results of the study.
"Before that, we cannot yet take a position on the study," a Harno spokesperson told ERR.
Aruvee said, that as a researcher, her task is to describe the situation and offer solutions.
"I can say: we did this work, these are the results, now we could think further. I am glad if these possibilities can be discussed publicly," she said.
"It is true that today's learner no longer exists only in the space of edited texts. Their writing skills need increasingly more support and precise guidance. During the pilot, for example, we learned from the ninth-grade material that the use of the source text still requires practice. It is necessary to develop the methodology of writing based on a source text and to create teaching materials that guide the learner to do this better," Aruvee said.
--
Follow ERR News on Facebook and Twitter and never miss an update!
Editor: Helen Wright









