Experiment: Which AI chatbots know Estonian language and culture?

ERR posed questions about the Estonian language and culture to five of the most popular large language models and compiled a ranking based on their responses. Grok provided the sharpest answers, while Estonian researchers are also satisfied with the performance of a model trained specifically on Estonian-language material.
"When Arno arrived at the schoolhouse with his father, the schoolhouse had already burned down." That is the kind of answer one can currently get from a language model trained on Estonian-language materials. We spoke with researchers and conducted a small experiment to gauge how much basic school students pressed for time can rely on technology when questions concern the specifics of the Estonian language and culture.
The sample for the mini-experiment included five widely used language models: Grok, Claude Sonnet, Gemini, ChatGPT and Mistral. In addition, we tested a chatbot being developed by Tallinn University of Technology, the University of Tartu and the Institute of the Estonian Language, which is trained specifically on open Estonian-language material. We submitted questions to the free versions of the models between February 9 and 13. For a more detailed description of the models, click on the graphic.
All of the language models answered a total of 20 questions, divided into two categories: Estonian language and Estonian cultural history. In drafting the questions, we aimed to cover as broad a range of topics as possible.
In the cultural history category, we asked about Juri Lotman's concept of the "semiosphere," for example, or requested that the models complete the sentence, "When Arno arrived at the schoolhouse with his father..." (The first sentence in Estonian writer Oskar Luts' novel "Kevade" ["Spring"] – ed.)
For the language-related questions, we tested the models' knowledge of dialects and also asked, for instance, how many vowels are in the word jäääär (Estonian word meaning edge of ice – ed.).
According to Kairit Sirts, associate professor of language technology at the Institute of Computer Science at the University of Tartu, the results were surprising in several respects. While the artificial intelligence barometer created by researchers also shows that the models are at a similar level, Sirts did not expect Grok's sensitivity to Estonian. Unlike several competitors, the model knew how to say "vacuum cleaner" in the Võro dialect: pudsunudsija, of course.
"It is difficult to say why Grok knows Estonian better. Since it uses posts from X in its training, perhaps some language examples reached the model that way," Sirts explained. In complete honesty, however, she said it is not fully known what data commercial models are trained on.
"The process of training models takes place in several stages. First, they are trained on texts. This is followed by a post-training process, during which the models are shown various tasks: how to respond and how to follow a user's instructions. This is a very important stage, but in the case of commercial models, we actually know even less about it," Sirts added.
In the cultural history questions, the results differed by only a few answers. According to Sirts, combining knowledge and forming connections is a language-independent skill that carries over from English-language training. "The models have probably been exposed to more cultural knowledge explicitly," she said. However, they have seen less meta-information specific to the Estonian language.
The wording of the questions also plays a major role in the results. "At least one question essentially required counting: how many vowels are in the word jäääär? That is more of a mathematical question. Models vary greatly in their ability to perform this kind of logical reasoning and it is not tied to any particular language," Sirts noted.
Tanel Alumäe, associate professor of speech technology at Tallinn University of Technology and head of its language technology lab, said an interesting phenomenon emerges in the models' language skills where data from larger languages also helps improve quality in smaller languages. "If you look at the models released over the past six months, they are already very good in Estonian. However, they all still make mistakes and far more than a human does," Alumäe added.
Alumäe last tested the models' language and cultural awareness in the fall, when he evaluated their ability to decline words, find terms based on definitions, correct grammar, summarize texts and extract information. "In general, large commercial models struggle most with tasks that require producing very precise and correct grammar. For example, however, top models handle declension very well," Alumäe explained.
Open data Llama
Tallinn University of Technology, the University of Tartu and the Institute of the Estonian Language are developing a model based on open data that is trained specifically on Estonian-language materials. Training on open data means that all of the model's training material is public and verifiable. In turn, this means that the volume of data used to generate responses is significantly more limited — in the case of the Estonian researchers' model, about 100 times smaller than that of leading commercial models.
It is therefore not surprising that the results of the Estonian-trained Llama model lagged behind the others. "The result is honestly very encouraging because we plan to start training a larger Llama model soon on the same principles, with 70 billion parameters, although even that is not actually as large as commercial models," Kairit Sirts explained.
According to Sirts, the model is currently being trained on data from the Estonian National Corpus managed by the Institute of the Estonian Language. In addition, the researchers have access to web data that is also used by other developers of open language models.
Unfortunately, this may not be enough to answer even simpler cultural history questions. "For a model to know what happens when Arno gets to the schoolhouse, it would have to have read the novel 'Kevade.' Books are complicated, however, because they are subject to copyright," Sirts explained. As a result, many cornerstone works of Estonian culture are not included in the Estonian-language training materials available to researchers.
In Sirts' assessment, it is also unrealistic for a model trained on open Estonian-language data to reach the same level as commercial models. "We want to do things openly and honestly. We are limited by the data that are available to us."
Aim not to compete with industry giants
According to the researcher, the goal is not to compete with tech giants but to create an open Estonian-language model that can also be downloaded. "With an open model running on your own server, it is possible to guarantee that data does not leave the building. This is important in scenarios involving sensitive or confidential information," Sirts said.
Tanel Alumäe added that dependence on companies from major powers must be reduced. "There are many tasks where we do not want to send our data to a server in the United States or China. In that case, you can take this free model and use it on your own closed server, so that the data do not leak anywhere," he said.
In Kairit Sirts' view, it is also necessary to build and maintain expertise. "The reason we work with these models is that the training process is technically quite complex. It is important that large technology companies do not dictate the terms and prices. If we have the capability to take an open model that is currently the best available and improve its Estonian-language performance, then we have some degree of control over the situation," Sirts added.
--
Editor: Marcus Turovski










