Correcting Word Merge Errors in Croatian Texts (CROSBI ID 567343)
Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Mikša, Mladen ; Šnajder, Jan ; Dalbelo Bašić ; Bojana
engleski
Correcting Word Merge Errors in Croatian Texts
In many text processing tasks character-level errors (due to mistyping, OCR, etc.) typically lead to performance degradation. Most approaches to error correction are dictionary based and cannot be used to correct word boundary errors. Word boundary errors are quite common in OCR- generated texts, especially the word merge errors. In this paper we describe an approach to correcting word merge errors in texts written in Croatian language. The approach is based on combinatorial optimization with beam search strategy that determines the most plausible segmentation of the input token. The plausibility of the segmentation is assessed using a statistical language model and several heuristics. We evaluate the performance of our approach on a sample of artificially generated word merge errors. The achieved results are comparable to the results of the approaches found in the literature.
word merge errors; OCR errors; combinatorial optimization; language modeling; natural language processing; Croatian language
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o prilogu
67-75.
2010.
objavljeno
Podaci o matičnoj publikaciji
Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages
Tadić, Marko ; Dimitrova-Vulchanova, Mila ; Koeva, Svetla
Zagreb: Hrvatsko društvo za jezične tehnologije
978-953-55375-2-6
Podaci o skupu
Seventh International Conference on Formal Approaches to South Slavic and Balkan Languages
predavanje
04.10.2010-06.10.2010
Dubrovnik, Hrvatska