EKTB11: Neural network based text analysis models for Estonian (2018-2022)

During the last few years, the natural language processing field has experienced a considerable technological shift due to the rapid developments in artificial neural networks technology. For many text analysis tasks concerned with automatic analysis of linguistic structure, neural models have proved to be more successful than the previously used statistical models. The automatic text analysis tools developed so far for Estonian are either rule-based or statistical. Based on the literature it can be expected that by adopting neural models their performance can be improved. The goal of this project is to bring the automatic text analysis tools for Estonian up to date by transferring them to neural technologies with the goal of improving their accuracy and quality.

Principal Investigator: Kairit Sirts, PhD
Institution: Institute of Computer Science, University of Tartu
Funding: The national program “Estonian Language Technology 2018-2027”

Datasets

Estonian Web Treebank with manually annotated sentences and word boundaries. This is a subset of EWT available at UD.
New Estonian NER dataset: it consists of ca 130K new texts from news and web domains annotated with a rich annotation scheme.
Reannotated Estonian NER dataset: this is the Estonian NER dataset reannotated with the same rich annotation scheme as the New Estonian NER dataset.

Models

EstBERT 128 and EstBERT 512: Estonian-specific BERT models with maximum sequence lengths of 128 and 512 tokens respectively.
EstBERT NER: EstBERT 128 model finetuned on Estonian NER dataset. The model can predict PER, ORG and LOC entities.
EstBERT NER v2: EstBERT 128 model finetuned on the New Estonian NER dataset and the Reannotated Estonian NER dataset. The model can predict 11 entities present in the rich label scheme used to annotate these datasets.

Demos

Named Entity Recognition: this demo is based on the EstBERT NER model.
Vabamorph enhanced Lemmatiser: this is a neural lemmatizer that uses the lemmas proposed by the Vabamorph morphological analyzer to enhance the predictions.

Papers

Milintsevich, K., & Sirts, K. (2021). Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 3112-3122).
Tanvir, H., Kittask, C., Eiche, S., & Sirts, K. (2021). EstBERT: A Pretrained Language-Specific BERT for Estonian. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa) (pp. 11-19).
Sirts, K., & Peekman, K. (2020). Evaluating Sentence Segmentation and Word Tokenization Systems on Estonian Web Texts. In Human Language Technologies–The Baltic Perspective (pp. 174-181). IOS Press.
Milintsevich, K., & Sirts, K. (2020). Lexicon-Enhanced Neural Lemmatization for Estonian. In Human Language Technologies–The Baltic Perspective (pp. 158-165). IOS Press.
Kittask, C., Milintsevich, K., & Sirts, K. (2020). Evaluating Multilingual BERT for Estonian. In Human Language Technologies–The Baltic Perspective (pp. 19-26). IOS Press.
Tkachenko, A., & Sirts, K. (2018). Modeling Composite Labels for Neural Morphological Tagging. In Proceedings of the 22nd Conference on Computational Natural Language Learning (pp. 368-379).
Tkachenko, A., & Sirts, K. (2018). Neural Morphological Tagging for Estonian. In Human Language Technologies–The Baltic Perspective (pp. 166-174). IOS Press.

Contributors

Aleksei Dorkin, MA (2022)
ChengHan Chung, Bsc (2021-2022)
Hasan Tanvir, Msc (2020-2021)
Kairit Peekman, Bsc (2020, 2022)
Claudia Kittask, Msc (2019 -2021)
Laura-Katrin Leman, MA (2019-2020)
Kirill Milintsevich, Msc (2018-2020)