Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

Agić, Željko; Ljubešić, Nikola; Merkler, Danijela

izvor podataka: crosbi !

Lemmatization and Morphosyntactic Tagging of Croatian and Serbian (CROSBI ID 599073)

Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija

Agić, Željko ; Ljubešić, Nikola ; Merkler, Danijela Lemmatization and Morphosyntactic Tagging of Croatian and Serbian // Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing. Sofija: Association for Computational Linguistics (ACL), 2013. str. 48-57

Podaci o odgovornosti

Autori

Agić, Željko ; Ljubešić, Nikola ; Merkler, Danijela

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

Lemmatization and Morphosyntactic Tagging of Croatian and Serbian

Sažetak

We investigate state-of-the-art statistical models for lemmatization and morphosyntactic tagging of Croatian and Serbian. The models stem from a new manually annotated SETIMES.HR corpus of Croatian, based on the SETimes parallel corpus. We train models on Croatian text and evaluate them on samples of Croatian and Serbian from the SETimes corpus and the two Wikipedias. Lemmatization accuracy for the two languages reaches 97.87% and 96.30%, while full morphosyntactic tagging accuracy using a 600-tag tagset peaks at 87.72% and 85.56%, respectively. Part of speech tagging accuracies reach 97.13% and 96.46%. Results indicate that more complex methods of Croatian-to- Serbian annotation projection are not required on such dataset sizes for these particular tasks. The SETIMES.HR corpus, its resulting models and test sets are all made freely available .

Ključne riječi

lemmatization; tagging; Croatian; Serbian

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Stranice rada

48-57.

Godina izdavanja

2013.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

Proceedings of the 4th Biennial International Workshop on Balto-Slavic Natural Language Processing

Izdavač

Sofija: Association for Computational Linguistics (ACL)

Podaci o skupu

Skup

4th Biennial International Workshop on Balto-Slavic Natural Language Processing (BSNLP 2013)

Vrsta sudjelovanja

predavanje

Datum održavanja skupa

08.08.2013-09.08.2013

Mjesto održavanja skupa

Sofija, Bugarska

Povezanost rada

Povezane osobe

Nikola Ljubešić (CroRIS ID: 4119; MBZ: 272820) (autor/i)

Željko Agić (CroRIS ID: 27179; MBZ: 291312) (autor/i)

Povezane ustanove

Filozofski fakultet u Zagrebu (130) (autorova ustanova)

Povezani projekti

Računalna sintaksa hrvatskoga jezika (rezultat rada na projektu)

Područje

Informacijske i komunikacijske znanosti

Poveznice

aclweb.org