University of Tartu thesis: transfer learning boosts Estonian AI models

A doctoral thesis at the University of Tartu reveals that effective Estonian-language artificial intelligence models can be developed despite limited data by utilizing cross-lingual transfer learning.
Modern language models require vast amounts of text, but Estonian, like many small languages, lacks sufficient digital data. This creates a key challenge regarding how to build capable models with scarce training data.
According to the recently defended thesis author, Hele-Andra Kuulmets, the solution lies not just in collecting more data, but in combining existing resources more intelligently.
Most language model methods have been developed for English and cannot be directly applied to smaller languages. This is where transfer learning comes in — reusing knowledge learned from one language to improve models in another. When models are trained on multiple languages, their internal representations begin to align, allowing what is learned in one language to support understanding in others.

Kuulmets' results show this works well in practice. The best-performing models used multilingual data in two stages: large-scale pretraining followed by fine-tuning for a specific language. These clearly outperformed models trained only on Estonian data.
The study found that large models, which are mostly trained on English, can transfer knowledge across languages even when data in the target language is scarce. Even a small amount of additional Estonian training can significantly improve performance. Such models can also be enhanced with synthetic data, including machine translations or text generated by other models. Even English-language instructions can help improve Estonian understanding.
A major obstacle for smaller languages is not only a lack of data, but also the lack of proper benchmarks for testing models. As part of the work, Kuulmets created a new evaluation dataset for four Finno-Ugric languages: Estonian, Võru, Livonian and Komi.
The thesis concludes that intelligently combining multilingual resources is currently the most effective way to develop language models for small languages.
The dissertation, titled "Cross-Lingual Transfer Learning and
Evaluation in Low-Resource Settings," was supervised by Mark Fišel, with opponents Barbara Plank from the Ludwig Maximilian University of Munich and Jindřich Helcl from the University of Oslo.
--
Editor: Jaan-Juhan Oidermaa, Argo Ideon









