crta
Hrvatska znanstvena Sekcija img
bibliografija
3 gif
 Naslovna
 O projektu
 FAQ
 Kontakt
4 gif
Pregledavanje radova
Jednostavno pretraživanje
Napredno pretraživanje
Skupni podaci
Upis novih radova
Upute
Ispravci prijavljenih radova
Ostale bibliografije
Slični projekti
 Bibliografske baze podataka

Pregled bibliografske jedinice broj: 914740

Časopis

Autori: Beliga, Slobodan; Ipšić, Ivo; Martinčić-Ipšić, Sanda
Naslov: Evaluation of Language Models over Croatian Newspaper Texts
Izvornik: Information Technology and Control (1392-124X) 46 (2017), 4; 425-444
Vrsta rada: članak
Ključne riječi: Statistical language model ; Natural language regularity ; Word-based language model ; Category-based language model ; Brown algorithm ; POS class ; N-gram ; Perplexity ; Croatian corpora
Sažetak:
Statistical language modeling involves techniques and procedures that assign probabilities to word sequences or, said in other words, estimate the regularity of the language. This paper presents basic characteristics of statistical language models, reviews their use in the large set of speech and language applications, explains their formal definition and shows different types of language models. Detailed overview of n-gram and class- based models (as well as their combinations) is given chronologically, by type and complexity of models, and in aspect of their use in different NLP applications for different natural languages. The proposed experimental procedure compares three different types of statistical language models: n-gram models based on words, categorical models based on automatically determined categories and categorical models based on POS tags. In the paper, we propose a language model for contemporary Croatian texts, a procedure how to determine the best n-gram and the optimal number of categories, which leads to significant decrease of language model perplexity, estimated from the Croatian News Agency articles (HINA) corpus. Using different language models estimated from the HINA corpus, we show experimentally that models based on categories contribute to a better description of the natural language than those based on words. These findings of the proposed experiment are applicable, except for Croatian, for similar highly inflectional languages with rich morphology and non-mandatory sentence word order.
Izvorni jezik: ENG
Rad je indeksiran u
bazama podataka:
Scopus
SCI-EXP, SSCI i/ili A&HCI
Science Citation Index Expanded (SCI-EXP) (sastavni dio Web of Science Core Collectiona)
Kategorija: Znanstveni
Znanstvena područja:
Računarstvo,Informacijske i komunikacijske znanosti
Puni text rada: 914740.Beliga_ITC_2017_EvaluationOfLanguageModelsOverCroatianNewsTexts.pdf (tekst priložen 23. Pro. 2017. u 21:00 sati)
URL Internet adrese: http://www.itc.ktu.lt/index.php/ITC/article/download/18367/9137
http://itc.ktu.lt/index.php/ITC/article/view/18367
http://itc.ktu.lt/index.php/ITC/article/view/18367/9137
Broj citata:
Altmetric:
DOI: 10.5755/j01.itc.46.4.18367
URL cjelovitog teksta: http://www.itc.ktu.lt/index.php/ITC/article/download/18367/9137
Google Scholar: Evaluation of Language Models over Croatian Newspaper Texts
Upisao u CROSBI: Slobodan Beliga (sbeliga@inf.uniri.hr), 23. Pro. 2017. u 21:00 sati



  Verzija za printanje   za tiskati


upomoc
foot_4