UT's machine translation engine now supports 17 more Finno-Ugric languages

  • 2023-03-14
  • BNS/TBT Staff

TALLINN – Researchers at the University of Tartu Institute of Computer Science have added Livonian, Komi, Veps and 14 other low-resource Finno-Ugric languages to Neurotolge, the university's machine translation engine.

Most of these languages became available on a public translation engine for the first time, as they are not part of Google Translate and similar services, the university said in a press release.

In total, the translation engine supports 23 Finno-Ugric languages: in addition to the more commonly supported Estonian, Finnish and Hungarian, it now includes Livonian, Voro, Proper Karelian, Livvi Karelian, Ludian, Veps, Northern Sami, Southern Sami, Inari Sami, Skolt Sami, Lule Sami, Komi, Komi-Permyak, Udmurt, Hill Mari and Meadow Mari, Erzya, Moksha, Mansi and Khanty.

The research group invites speakers and researchers of these languages to contribute corrected translations to improve translation quality. This can be done by editing translations at translate.ut.ee. Texts such as poems, articles, books and similar in these languages are also of great help and can be sent to [email protected].

Lisa Yankovskaya, a research fellow in natural language processing at the University of Tartu Institute of Computer Science, said feedback is needed to improve the translation quality because many of these languages have extremely scarce resources for creating translation systems. This means two things -- first, the translation quality can vary a lot, and it can be especially low when translating into low-resourced languages. Secondly, the developers need the help of speakers of these languages in the form of contributing correct translations on the platform. 

She said there were several reasons for developing machine translation for low-resource languages. For example, philologists and others need the machine translation option to understand texts without learning the language. Translating into these languages is also a way of preserving endangered languages and supporting the speakers of these languages. This is why the translation system is open to all users and the software and the created models are open-source.

The developers started working with Finno-Ugric languages in 2021, with the first system supporting Voro, Northern Sami and Southern Sami, said Maali Tars, scientific programmer at the Institute of Computer Science. Livonian, an extremely endangered language with just about 20 near-native speakers, was added in the same year. They intend to continue to improve the quality of the machine translation system and include more Finno-Ugric languages and dialects.

The developers worked with the Livonian Institute at the University of Latvia, Voro Institute, the University of Eastern Finland, the Karelian language revitalization program of the University of Eastern Finland, and the Arctic University of Norway.