Croatian Language N-Gram System (CROSBI ID 184676)
Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Dembitz, Šandor ; Blašković, Bruno ; Gledec, Gordan
engleski
Croatian Language N-Gram System
Large-scale n-gram models are available for a small number of languages. So far, Croatian was not one of them. The research presented in this paper describes the development of n-gram database system suitable for large-scale language modeling in Croatian. The process of n-gram collection relies on Croatian academic online spellchecker Hascheck, which has been publicly available since 1993, and is today a popular language service, with average daily traffic exceeding million tokens. The approach demonstrated in this paper eliminated the need of n-gram data cleaning in the post-processing phase, which is a serious issue in other languages. The spellchecker dynamics allowed Heaps’ law modeling to be applied to Croatian n-grams, which enabled the prediction of n-gram count growth.
Croatian; lexical n-gram; language modeling; Heaps’ law
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o izdanju
243
2012.
696-705
objavljeno
0922-6389
10.3233/978-1-61499-105-2-696