Martin Öövel and Krister Kruusmaa: Information locked away does not a society serve

The visibility and accessibility of Estonian-language content is not just a matter of convenience. When local journalism and culture become hard to access, public debate increasingly falls under the influence of foreign and unchecked information, write Martin Öövel and Krister Kruusmaa.
News recently broke that Postimees Grupp, Delfi Meedia and Õhtuleht have restricted access to their content over data mining concerns. Since this step concerns a significant part of our journalistic heritage, questions have arisen about how public access to preserved media content should function and who has the right to use it for technological development.
First, it is important to clarify that because of the media companies' actions, Digar has not been "locked." Nothing has changed for the average user. Many books and newspaper issues in Digar were already under restricted access (meaning hidden behind a small lock icon for the user). This fact deserves a closer explanation.
Who can access the digital archive?
Under the law, publishers — including newspapers and digital outlets — are required to send copies of their publications, including digitally born works, to the National Library. The National Library's responsibility is to preserve these works and, when appropriate, make them accessible to users. To provide such access, the National Library uses the digital archive Digar.
At the same time, all copyrights remain with the owners of the works. By law, publishers have the right to decide whether and how their works are displayed digitally. That is why many newspaper issues in Digar have already been "locked" before.
Publishers' caution about widely sharing their content is entirely understandable, as it reflects their legitimate desire to protect their work and business model. However, it is worth asking whether restricting access is justified for news stories that are two, ten or even more than thirty years old.
Texts "behind the lock" can be accessed only at so-called authorized workstations — special computers located in a few libraries. This means that to read "locked" works, one must go to a reading room in person, even though the material itself is digital. While such a system ensures copyright protection, it no longer meets the expectations of the digital age. Authorized workstations are located in only five research libraries, all in either Tallinn or Tartu. This puts people living elsewhere in Estonia, as well as those with mobility difficulties, at a disadvantage.
The National Library is currently working on a legal and technical solution that would make it possible to create virtual authorized workstations in all public libraries, thereby expanding access across Estonia without infringing publishers' rights.
Who will fill the information void?
It is important to understand that the visibility and accessibility of Estonian-language content is not merely a matter of convenience. When our own journalistic and cultural material becomes difficult to access, public discourse increasingly comes to be shaped by foreign-language and unverified information.
The role of journalism is to inform society, and when the results of that work are not visible, it becomes harder to fight against propaganda and misinformation. One can also expect an even greater influx of foreign-language content in all areas of life, because when local information is unavailable, people turn to sources from abroad.
Education also suffers. Imagine a student in Saaremaa who wants to write a report on invasive species that threaten Estonia's native crayfish and needs sources for their research.
Living in a digital country, one would expect that all the necessary newspaper issues would be accessible from a home computer. In reality, the student would have to travel either to Tartu or Tallinn to read the articles at an authorized workstation. Most likely, they won't make that trip, even if they can afford the bus ticket. Instead, they'll choose an "easier" topic — one with information available from online sources. Limited access to local information drives people toward global and foreign content.
Several countries have recognized that in the information age, those who make their knowledge most accessible have the advantage — and they have addressed this challenge differently.
For example, in Norway, the National Library and publishers have entered into what is known as the Bookshelf Agreement (Bokhylla-avtalen), under which all works published up to 2005 — including those still under copyright — are freely available online from any Norwegian IP address. As a result, access to Norwegian-language information and culture has grown significantly and the agreement is regularly renewed to include more recent publications.
Data mining — who and why?
Now let's turn to data mining. As mentioned earlier, the decision made by media companies this spring did not actually change anything for the average reader, since a considerable portion of their archived content was already accessible only through authorized workstations. What the media houses did, however, was prohibit the use of their publications' content for data mining conducted for commercial purposes.
What does this mean? Data mining is the automated analysis of large datasets to identify connections and patterns. It is used in language technology, academic research, journalism analysis and cultural heritage studies. Media monitoring and many other services are also based on data mining.
Thus, data mining goes far beyond training artificial intelligence. However, the rise of generative AI has made it an economically sensitive activity. While data mining was previously used mainly for analytical purposes, the extracted data can now serve as raw material for products such as ChatGPT. The media houses invoked the so-called opt-out clause provided by law and prohibited any data mining of their works that could serve commercial interests.
As the publications themselves have explained, they are concerned about the proper and lawful use of their content. Their concern is understandable and has been echoed elsewhere as well — for instance, in the United States, several media outlets have sued AI companies to seek compensation for the use of their articles. Media content is clearly protected by intellectual property rights and its use must comply with the law.
The price of cultural heritage
The media companies' decision highlights a significant point of tension: Estonia still lacks a clear solution for how to support data mining that serves our collective interests. Not all business activity can be lumped together. At one extreme are global tech giants earning vast sums from covertly collected data; at the other are local, more transparent companies whose work could genuinely benefit society as a whole.
There are indications that the media houses are not so much worried about the integrity of their intellectual property as about lost revenue. For instance, some media companies have reportedly offered their articles to large AI firms — for a price, of course. The situation in which one hand shuts off public access to journalistic content while the other tries to sell it abroad is, to put it mildly, perplexing. It's hard not to think of the story of the Native Americans who sold Manhattan for glass beads, without realizing what they were giving away.
Or imagine, instead, a village woman from centuries past telling a folklore collector that she'll only sing for money — and besides, there's no point in collecting anymore, since all the old folk songs have already been sold to the Germans!
This is, admittedly, an exaggerated comparison — but only slightly. Jakob Hurt and his colleagues understood that collecting folklore was not merely archival work but an investment in the future. Their efforts helped lay the foundation of our national consciousness. Following their example, we should strive to ensure that our linguistic and cultural data — texts, images and all other valuable material — are preserved, protected and at the same time accessible. Most importantly, the benefits should go to the people of Estonia themselves, not to anyone else.
How to wisely manage data?
Under market logic, the tech giants have a strong advantage over local stakeholders. Estonia and our partners will never be able to pay for data mining at the level of Google or OpenAI. But if we do not support local AI innovation with high-quality data, we are simply painting ourselves into a corner. The dangers that arise from the absence of high-quality, native-language AI have been pointed out repeatedly.
In the future, Estonian-language AI may only exist as paid models offered by large corporations — models that, by using them, make us pay back the one-time revenue derived from our data through subscription fees.
One possible alternative is to empower open models. Open models could be run on our own servers here in Estonia without intermediary fees and without the data-protection problems that block AI adoption in many institutions.
Unfortunately, such models usually do not yet know Estonian well enough and it is not possible to teach them without data mining . Rather than seeking a one-off financial compensation, we should focus on locally valorizing our strategic resource — our data. That way we can still derive value from our data in ten and perhaps even a hundred, years.
Opt-in instead of opt-out
The National Library welcomes the fact that the issues of cultural heritage accessibility and data mining have entered the public discussion. Now is the right time to sit down together. Media companies, publishers, authors and AI developers should work jointly to find a balance between access, intellectual property protection and the promotion of innovation.
Several other countries have already recognized that these are not mutually exclusive values but rather complementary ones. Again, we can look to Norway, where agreements have enabled local laboratories to create much higher-quality Norwegian-language AI models.
Iceland has shown other small nations how to protect their language and culture in the age of artificial intelligence. In Latvia and Finland, capable AI companies are operating with permission to use local data for model training.
Working toward a shared goal should also be possible in Estonia. In addition to the opt-out option provided by law, copyright holders should have the opportunity to "opt in" — that is, to choose to make their creations available for the benefit of society. We must understand that there are no winners or losers here: ensuring access to native-language culture — both for ordinary readers and for data mining — serves all of our common interests.
--
Editor: Marcus Turovski










