Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi

Optimizing Sentence Boundary Detection for Croatian (CROSBI ID 186502)

Prilog u časopisu | izvorni znanstveni rad | međunarodna recenzija

Šarić, Frane ; Šnajder, Jan ; Dalbelo Bašić, Bojana Optimizing Sentence Boundary Detection for Croatian // Lecture notes in computer science, 7499 (2012), 105-111. doi: 10.1007/978-3-642-32790-2_12

Podaci o odgovornosti

Šarić, Frane ; Šnajder, Jan ; Dalbelo Bašić, Bojana

engleski

Optimizing Sentence Boundary Detection for Croatian

A number of natural language processing tasks depend on segmenting text into sentences. Tools that perform sentence boundary detection achieve excellent performance for some languages. We have tried to train a few publicly available language independent tools to perform sentence boundary detection for Croatian. The initial results show that off-the-shelf methods used for English do not work particularly well for Croatian. After performing error analysis, we propose additional features that help in resolving some of the most common boundary detection errors. We use unsupervised methods on a large Croatian corpus to collect likely sentence starters, abbreviations, and honorifics. In addition to some commonly used features, we use these lists of words as features for classifier that is trained on a smaller corpus with manually annotated sentences. The method we propose advances the state-of-the art accuracy for Croatian sentence boundary detection on news corpora to 99.5%.

sentence boundary; croatian language; logistic regression

Rad je prezentiran na skupu 15th International Conference Text, Speech and Dialogue (TSD 2012), održanom u rujnu 2012.g., Brno, Republika Česka.

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o izdanju

7499

2012.

105-111

objavljeno

0302-9743

10.1007/978-3-642-32790-2_12

Povezanost rada

Računarstvo

Poveznice
Indeksiranost