University of Tartu thesis: transfer learning boosts Estonian AI models

The University of Tartu's main building. Source: Simo Sepp/Minupilt.err.ee

A doctoral thesis at the University of Tartu reveals that effective Estonian-language artificial intelligence models can be developed despite limited data by utilizing cross-lingual transfer learning.

Modern language models require vast amounts of text, but Estonian, like many small languages, lacks sufficient digital data. This creates a key challenge regarding how to build capable models with scarce training data.

According to the recently defended thesis author, Hele-Andra Kuulmets, the solution lies not just in collecting more data, but in combining existing resources more intelligently.

Most language model methods have been developed for English and cannot be directly applied to smaller languages. This is where transfer learning comes in — reusing knowledge learned from one language to improve models in another. When models are trained on multiple languages, their internal representations begin to align, allowing what is learned in one language to support understanding in others.

Estonian News Image — Keelemudelid Autor/allikas: AIrika Harrik/ERR

Kuulmets' results show this works well in practice. The best-performing models used multilingual data in two stages: large-scale pretraining followed by fine-tuning for a specific language. These clearly outperformed models trained only on Estonian data.

The study found that large models, which are mostly trained on English, can transfer knowledge across languages even when data in the target language is scarce. Even a small amount of additional Estonian training can significantly improve performance. Such models can also be enhanced with synthetic data, including machine translations or text generated by other models. Even English-language instructions can help improve Estonian understanding.

A major obstacle for smaller languages is not only a lack of data, but also the lack of proper benchmarks for testing models. As part of the work, Kuulmets created a new evaluation dataset for four Finno-Ugric languages: Estonian, Võru, Livonian and Komi.

The thesis concludes that intelligently combining multilingual resources is currently the most effective way to develop language models for small languages.

The dissertation, titled "Cross-Lingual Transfer Learning and
Evaluation in Low-Resource Settings," was supervised by Mark Fišel, with opponents Barbara Plank from the Ludwig Maximilian University of Munich and Jindřich Helcl from the University of Oslo.

Follow ERR News on Facebook and X and never miss an update!

Editor: Jaan-Juhan Oidermaa, Argo Ideon

{{noticeUi.heading[langCode]}}

Related

watch live on etv

ERR carrying FIFA World Cup 2026 semifinals, final

Where to watch the climax of the FIFA world cup

simple news in estonian

Lihtsad uudised 10. juulil

listen: radio tallinn

watch: jupiter

About us

Russia holds live-fire exercise on Lake Peipus without informing Estonia Updated 10:20

be prepared!

Download the 'Ole valmis!' crisis preparedness app

Reminder: How to act when Estonia's EE‑ALARM public warning alert is issued

Where can I find crisis information for Estonia?

Russia holds live-fire exercise on Lake Peipus without informing Estonia Updated 10:20

public broadcasters launch news portal

European public broadcasters launch joint trusted news portal

useful information

Commenting on ERR News articles

Explore More Estonian News