Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi !

Correcting Word Merge Errors in Croatian Texts (CROSBI ID 567343)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Mikša, Mladen ; Šnajder, Jan ; Dalbelo Bašić ; Bojana Correcting Word Merge Errors in Croatian Texts // Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages / Tadić, Marko ; Dimitrova-Vulchanova, Mila ; Koeva, Svetla (ur.). Zagreb: Hrvatsko društvo za jezične tehnologije, 2010. str. 67-75

Podaci o odgovornosti

Mikša, Mladen ; Šnajder, Jan ; Dalbelo Bašić ; Bojana

engleski

Correcting Word Merge Errors in Croatian Texts

In many text processing tasks character-level errors (due to mistyping, OCR, etc.) typically lead to performance degradation. Most approaches to error correction are dictionary based and cannot be used to correct word boundary errors. Word boundary errors are quite common in OCR- generated texts, especially the word merge errors. In this paper we describe an approach to correcting word merge errors in texts written in Croatian language. The approach is based on combinatorial optimization with beam search strategy that determines the most plausible segmentation of the input token. The plausibility of the segmentation is assessed using a statistical language model and several heuristics. We evaluate the performance of our approach on a sample of artificially generated word merge errors. The achieved results are comparable to the results of the approaches found in the literature.

word merge errors; OCR errors; combinatorial optimization; language modeling; natural language processing; Croatian language

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o prilogu

67-75.

2010.

objavljeno

Podaci o matičnoj publikaciji

Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages

Tadić, Marko ; Dimitrova-Vulchanova, Mila ; Koeva, Svetla

Zagreb: Hrvatsko društvo za jezične tehnologije

978-953-55375-2-6

Podaci o skupu

Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages

predavanje

04.10.2010-06.10.2010

Dubrovnik, Hrvatska

Povezanost rada

Računarstvo

Poveznice