Enriching Odia Language Resource: Our Collective Effort

Return to site

Enriching Odia Language Resource: Our Collective Effort

Odia (also spelled Oriya) is an Indo-Aryan language spoken in the Indian state Odisha. Apart from Odisha, Odia has significant speaking populations in five neighboring states (Andhra Pradesh, Madhya Pradesh, West Bengal, Jharkhand, and Chhattisgarh) and one neighboring country (Bangladesh). Odia is categorized as a classical Indian language (the sixth Indian language to have this prestigious status out of 23 official languages including English) with a literary history of more than 1000 years. Odia is nowadays spoken by 50 million speakers. It is heavily influenced by the Dravidian languages as well as Arabic, Persian, English. Odia's inflectional morphology is rich with a three-tier tense system. The prototypical word order is subject-object-verb (SOV) [2].

A multilingual country like India needs language corpora for low resource languages not only to provide its citizens with technologies of natural language processing (NLP) readily available in other countries but also to support its people in their education and cultural needs.

In today's digital world, there is a big need and potential for Odia machine translation (MT), primarily for the English-to-Odia direction. Machine translation needs a large number of parallel sentence pairs (a corpus) to help in reaching a good translation quality. For languages without large corpora (“low-resource” languages or language pairs), machine translation is severely limited. Odia language lacks sizable online content and there are only a few independent and small English-Odia parallel corpora available. Although Odia language has a rich cultural heritage, this is not digitized or accessible, resulting in a lack of resources. Consequently, many machine translation systems are not supporting Odia.

Odia is neither available in popular corpora lists for machine translation nor listed in any shared task for machine translation. Although there were a few attempts to build Odia corpora, none of them is large enough and suitable for machine translation, neither available online for research purposes.

The above reasons strongly motivate us in our attempt to build an English-Odia parallel corpus and an Odia monolingual corpus suitable for machine translation and NLP research. Our first attempt was OdiEnCorp 1.0 by collecting English-Odia parallel text from the Bible containing English-Odia parallel text and other online resources mostly government websites for citizens, online digital libraries, and online magazines [1].

The Odia Corpus development (OdiEnCorp) was initiated at the Institute of Formal and Applied Linguistics, Charles University, Czech Republic and continued with a collaboration work by the researchers from KIIT university, India and Idiap Research Institute, Switzerland.

OdiEnCorp 1.0 parallel corpus contains 29,346 sentence pairs and 756K English and 648K Odia tokens. OdiMonoCorp contains 2.6 million tokens in 221K sentences in 71K paragraphs. Our corpora (OdiEnCorp and OdiEnMonoCorp) are available for research and non-commercial use under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 License at http://hdl.handle.net/11234/1-2879. The released corpus is used by many NLP researchers for machine translation particularly for Odia language and encouraged researchers for developing Odia corpus for machine translation.

Due to the small size of the OdiEnCorp 1.0, we couldn't get very good performance in terms of BLEU score (an automatic method to measure the translation quality) using neural machine translation (NMT) system and the performance was better using a phrase-based machine translation system. Also, we didn't use it for any machine translation shared task/campaign.

To enrich OdiEnCorp 1.0, we tried to collect English-Odia parallel text from all possible sources. We found that Odia language lacks online content, so we tried to make use of its richness in books of various fields (e.g. literature books, grammar books, study books), collecting books available in both languages. Also, we utilized Odia Wikipedia contents which are available in both English and Odia, but since multilingual Wikipedia pages are not translations of each other, we aligned and processed them manually to obtain one-to-one sentence mapping.

The major sources we have utilized are i) optical character recognition (OCR) based text extraction from the available grammar, study, literature books, ii) Odia Wikipedia, iii) online resources. iv) available corpora as shown in the below Figure.

The performance of the translation quality using OdiEnCorp2.0 is better compared to OdiEnCorp1.0. Scores on Dev and Test sets of OdiEnCorp 1.0 for the baseline NMT models trained on OdiEnCorp 1.0 vs. OdiEnCorp 2.0 [2].

We are thankful to the organizing committee members of Workshop on Asian Translation (WAT), who organizes machine translation shared task competition for the Asian languages and many teams from academic to major companies (e.g. Facebook AI, Microsoft Research, Systran) participating in it and presenting their best translation systems. In WAT 2019, 25 teams participated and submitted 400 translation results [3].The workshop was hosted in conjunction with EMNLP-IJCNLP 2019 at HongKong.

WAT included Odia to English and English to Odia machine translation task using OdiEnCorp 2.0 this year in WAT 2020. To the best of our knowledge, this is the first time the language pair (Odia-English) is running as a shared task in any machine translation competition and we expect a warm response from participants for this machine translation task. This task is particularly interesting for the NLP researchers focusing towards i) low-resource machine translation, ii) Indian language machine translation, iii) Multilingual translation.

OdiEnCorp 2.0 parallel corpus covers many domains: the Bible, other literature, Wiki data relating to many topics, Government policies, and general conversation and available for research and noncommercial use under a Creative Commons AttributionNonCommercial-ShareAlike 4.0 License, CC-BY-NC-SA12 at: http://hdl.handle.net/11234/1-3211.

We would always be grateful to the volunteers who supported in the development process of corpora (OdiEnCorp 1.0, and OdiEnCorp 2.0). The released OdiEnCorp 2.0 is freely available for research and non-commercial purpose and helps many NLP researchers for developing machine translation systems utilizing our data and performing research in this direction. As more researchers showed their interest and joined with us, we expect to enrich the available resources and build new NLP resources for Odia language.

The members involved in OdiEnCorp1.0 and OdiEnCorp2.0 are mentioned below.

Team

References

[1] Parida, S., Dash, S. R., Bojar, O., Motlıcek, P., Pattnaik, P., & Mallick, D. K. OdiEnCorp 2.0: Odia-English Parallel Corpus for Machine Translation. In LREC 2020 Workshop Language Resources and Evaluation Conference 11–16 May 2020 (p. 14).
[2] Parida, S., Bojar, O., & Dash, S. R. (2019, September). OdiEnCorp: Odia–English and Odia-Only Corpus for Machine Translation. In Smart Intelligent Computing and Applications: Proceedings of the Third International Conference on Smart Computing and Informatics (Vol. 1, p. 495). Springer Nature.
[3] Nakazawa, T., Doi, N., Higashiyama, S., Ding, C., Dabre, R., Mino, H., ... & Bojar, O. (2019, November). Overview of the 6th workshop on Asian translation. In Proceedings of the 6th Workshop on Asian Translation (pp. 1-35).