Indrek Ibrus: AI needs glasses, Estonian glasses

An audiovisual artificial intelligence model that is unfamiliar with small cultures may turn out to be the most powerful force for cultural homogenization of our time, writes Indrek Ibrus in a commentary originally published in Sirp.
Since the start of the year, large language models have been a hot topic. People worry whether they can properly understand Estonian and whether they're capable of speaking it. There's a fear that if AI-powered devices — refrigerators or other gadgets — use Estonian poorly or in overly simplified ways, the language itself will eventually deteriorate. That's why these models are being studied, tested and continuously retrained. The Estonian Language Institute is deeply involved, as are university researchers and Ministry of Justice officials working to compile training corpora, among others.
In this era of artificial intelligence, the government is expecting a major language development plan by the end of the calendar year. Hopefully, that plan will be completed.
But is that enough? Let me offer a thought experiment. The recent Song Festival has just concluded — an event widely regarded as a cornerstone of Estonian identity, a core element of Estonian culture. But is it possible for an AI to truly grasp the essence of such an event if it's only fed written material: song lyrics, the event program, rehearsal guides, newspaper reviews, interviews, essays written over the years and the like?
Some understanding may be achieved. But ultimately, that's still akin to Plato's shadows on the cave wall — a dim understanding of life's rich tapestry. Without video footage of the rehearsals, the parade, the festival itself; without experiencing the atmosphere of the Song Festival Grounds, the density of the crowds and their reactions, photos of raincoats, the mud at the field's edge, food stalls or the variety of stylized folk costumes — there's no way to understand the network of meanings surrounding the Song Festival.
And if you don't understand that, you can't meaningfully talk about the essence of Estonia or contribute to its culture and society. The Song Festival is just one example. It's difficult to speak fully and meaningfully about anything when you're half-blind. And that's exactly what large language models are.
In this light, a comment made a few months ago by Yann LeCun, one of the godfathers of AI, at an Nvidia conference was especially striking — almost shocking. He said that large language models no longer interest him much.
Why? Imagine trying to understand a soccer match but being allowed to read only a written recap in the newspaper. You never see the field, the passes or the goals. That's the position large language models are in when they "read" the world solely through text. LeCun emphasizes that while text conveys information, it omits all the visible and audible details — how the ball spins, how the crowd reacts, how a player falls. If a model can't see or hear, it must guess those details, often inaccurately, because it lacks real-world experience.
Young children learn differently. They observe and listen to millions of small fragments — facial expressions, the bounce of a ball, the patter of rain — and mentally stitch them together into a coherent worldview. Without reading a single syllable, they develop a rich, connected and detailed understanding of how the world works in their first few years of life.
LeCun calls this internal map in our minds a "world model." Such a model allows a child to predict what will happen if they stretch out their arm to hit a ball. You can't build that kind of world using text alone — you need video and sound, which provide a continuous picture of what's happening and allow a computer to form the same types of hypotheses and plans that humans create subconsciously.
To be a bit more technical: large language models (LLMs) work with sequences of discrete symbols, all derived from traditional writing practices. A tokenizer breaks sentences into dozens of subunits. In the process, however, the continuity of life is lost — the interdependent relationships between motion, speed, light and shadow, and sound vibrations disappear.
As a result, the model must "guess" variables that fall outside the text: how quickly a cup cools on a table, how long a pause lasts or how a change in a material's friction coefficient affects the next event. Plugging these gaps through statistical interpolation causes hallucinations and wastes computing power, because the model has no direct access to physical reality.
That's why LeCun argues that the role of a neural network should not be to predict the next token in a sentence, but the next latent condition — a compressed representation of all visible and audible reality. It's this continuity of latent space that enables a world model to simulate "what-if" scenarios more like a physics engine in a video game than a text-generation algorithm.
Another critical flaw in LLMs lies in their heuristic search mechanisms: to solve complex problems, the model generates thousands of token sequences and evaluates them, much like a programmer trying random lines of code until something works.
But the more vast the solution space, the more exponentially complex the combinatorics. Without internalized rules for how the world behaves, an LLM can't preemptively discard improbable trajectories — each attempt is a shot in the dark.
Video-based learning offers a solution. A sensory chain allows for energy- or contrast-based evaluation functions, which sharply reduce the search space by filtering out outcomes that violate the laws of physics or social norms early in the process. This swaps brute-force statistics for actual understanding; the algorithm doesn't just finish a sentence — it knows why the floor is wet and sticky if the cat knocks a bottle of Fanta off the table.
In other words, LeCun and his colleagues around the world are saying that AI reliant on words alone is like a blind film critic. For a machine to understand why a ball bounces the way it does or why the crowd gasps in unison, it must be shown the game itself — movement, colors and sounds — and must learn from those inputs just as we do from childhood.
Text can "say" what's written, but it "doesn't know" what happens when someone climbs onto the conductor's podium during the Song Festival. Video is essential for perceiving force, gravity, temperature and emotional modulation. These signals give the so-called world model its weight and scope, enabling it to predict sentences — not just complete them.
Let's be clear: this isn't just about physics, but about understanding complex relationships, which also benefit from latent-space modeling. Interestingly, LeCun finds support in the cultural semiotics of Juri Lotman to explain this point.
Lotman argued that every culture is a conglomerate of sign systems — a unique translation engine that takes a text created in one medium (whether auditory, visual, spatial or verbal) and converts it into another. Through translation, something new is created, yet every culture itself is the result of such interconnected, ongoing translations. Every culture thus has its own world model — a polymodal symphony where no single voice can carry the whole.
An artificial intelligence that can meaningfully serve Estonian culture and society — that understands its continuity and internal complexities — will sooner or later need video, photo and audio material as input.
Estonia's problem is that no one — no institution or agency — is prepared to provide these other modalities. No goals have been set. No groundwork has been laid. The Estonian Language Institute deals with language. The Ministry of Justice assists in building training corpora. And that's it.
Only at BFM (Baltic Film, Media and Arts School) are there experiments underway on how to create knowledge graphs from audiovisual material so they can be used to more effectively train AI. These graphs aim specifically to link information across different modalities.
But academic experiments are not enough. We need a systematic plan for how to begin training AI using visual and audiovisual material — and to decide who will take responsibility. This must begin before LeCun-style world model-based AIs become widespread. After all, video is now the primary medium for learning and teaching and we must ensure that creators have video production tools that align as closely as possible with the specificities of Estonian culture. An audiovisual AI model that is unfamiliar with small cultures may become the most powerful force for cultural homogenization in history.
So which institutions should be responsible for such training efforts? I believe it's time to look to ERR and the National Archives with fresh eyes.
Until now, the mission of archives has been to preserve and restore cultural heritage. But one of them must now also be entrusted with training AI models that serve our native culture.
The most promising candidate is ERR, which — unlike the National Archives' Film Archive, focused mainly on cinematic art — has data reserves that contain far more representations of real-life Estonia. ERR's recordings of Midsummer bonfires on Vormsi, summer scooter rallies or student satellite launches from the technical university offer rich material for understanding the diversity of Estonian life.
Only with such material can we build, in Lotman's terms, translation bridges between modalities (whether as knowledge graphs, latent spaces or something else) and construct, in LeCun's terms, a world model of Estonian culture — not just a statistical compiler, but a living, predictive and deeply reflective thinking machine.
To make audiovisual material usable, ERR's archives need better metadata systems and a robust supplementary budget to hire development teams for this domain.
Admittedly, there are signs that the once-ambitious process to revise the ERR Act — started a couple of years ago — is now being reduced to a shadow of its former self under the leadership of the Ministry of Culture and at the urging of the Ministry of Finance. In this process, the modern goal of making ERR a driver and coordinator of domain-specific innovation in the digital age has been stripped away.
Yet this is a principle already recognized and embraced across most of Europe today: innovation and its coordination in the audiovisual and media sectors are among the ways public institutions can create value for society. That the Ministry of Culture is once again ducking under the bar — especially during the peak years of AI innovation — is shortsighted and irresponsible both for the future of national culture and the success of the ongoing AI transformation.
If we fail to pursue this direction, AI will remain stuck in front of the cave wall, drawing conclusions about our existence from only the few words that have been written down.
But if we bring in video, sound and images, machines will begin to "see" how the beach atmosphere in Pärnu differs from that on the shores of Lake Pühajärv, how Estonian language and literature draw from the daily rhythms of city life and how our driving culture differs from Finland's. Seeing these kinds of connections will not only reduce AI hallucinations about Estonian subject matter but also allow it to function as a coherent, evolving and socially useful predictor of national development.
In other words, the time has come not only to take audiovisual culture seriously, but to view it as a prerequisite for a meaningful AI leap forward and for improving Estonia's global competitiveness. And if we truly can — and want — to take this seriously, then it's time to give our relevant institutions new mandates and support them in carrying them out.
--
Editor: Marcus Turovski