IRTUM – Institutional Repository of the Technical University of Moldova

Automatic Detection of Arabicized Berber and Arabic Varieties

Show simple item record

dc.contributor.author ADOUANE, Wafia
dc.contributor.author SEMMAR, Nasredine
dc.contributor.author JOHANSSON, Richard
dc.contributor.author BOBICEV, Victoria
dc.date.accessioned 2021-04-08T12:01:54Z
dc.date.available 2021-04-08T12:01:54Z
dc.date.issued 2016
dc.identifier.citation ADOUANE, Wafia, SEMMAR, Nasredine, JOHANSSON, Richard et al. Automatic Detection of Arabicized Berber and Arabic Varieties. In: Proceedings of the third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), Dec. 2016, Osaka, Japan, 2016, pp. 63–72. Anthology ID W16-4809. en_US
dc.identifier.uri https://www.aclweb.org/anthology/W16-4809
dc.identifier.uri http://repository.utm.md/handle/5014/14046
dc.description Access full text: https://www.aclweb.org/anthology/W16-4809 en_US
dc.description.abstract Automatic Language Identification (ALI) is the detection of the natural language of an input text by a machine. It is the first necessary step to do any language-dependent natural language processing task. Various methods have been successfully applied to a wide range of languages, and the state-of-the-art automatic language identifiers are mainly based on character n-gram models trained on huge corpora. However, there are many languages which are not yet automatically processed, for instance minority and informal languages. Many of these languages are only spoken and do not exist in a written format. Social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. The latter are not well represented on the Web, commonly referred to as under-resourced languages, and the current available ALI tools fail to properly recognize them. In this paper, we revisit the problem of ALI with the focus on Arabicized Berber and dialectal Arabic short texts. We introduce new resources and evaluate the existing methods. The results show that machine learning models combined with lexicons are well suited for detecting Arabicized Berber and different Arabic varieties and distinguishing between them, giving a macro-average F-score of 92.94%. en_US
dc.language.iso en en_US
dc.publisher The COLING 2016 Organizing Committee en_US
dc.rights Attribution-NonCommercial-NoDerivs 3.0 United States *
dc.rights.uri http://creativecommons.org/licenses/by-nc-nd/3.0/us/ *
dc.subject language identification en_US
dc.subject automatic language identification en_US
dc.subject natural language en_US
dc.subject machine learning en_US
dc.title Automatic Detection of Arabicized Berber and Arabic Varieties en_US
dc.type Article en_US


Files in this item

The following license files are associated with this item:

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 United States Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States

Search DSpace


Browse

My Account