crta
Hrvatska znanstvena Sekcija img
bibliografija
3 gif
 Naslovna
 O projektu
 FAQ
 Kontakt
4 gif
Pregledavanje radova
Jednostavno pretraživanje
Napredno pretraživanje
Skupni podaci
Upis novih radova
Upute
Ispravci prijavljenih radova
Ostale bibliografije
Slični projekti
 Bibliografske baze podataka

Pregled bibliografske jedinice broj: 700038

Zbornik radova

Autori: Beliga, Slobodan; Martinčić-Ipšić, Sanda
Naslov: Non-Standard Words as Features for Text Categorization
Izvornik: MIPRO-CIS / Ribarić, Slobodan ; Budin, Andrea (ur.). - Opatija : MIPRO , 2014. 1415-1419 (ISBN: 978-953-233-078-6).
Skup: IEEE 37th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO 2014)
Mjesto i datum: Opatija, Croatia, 26-30.5.2014.
Ključne riječi: text categorization; non-standard words; collection representation; features; accuracy
Sažetak:
This paper presents the categorization of Croatian texts using Non-Standard Words (NSW) as features. Non-Standard Words are: numbers, dates, acronyms, abbreviations, currency, etc. NSWs in Croatian language are determined according to Croatian NSW taxonomy. For the purpose of this research, 390 text documents were collected and formed the SKIPEZ collection with 6 classes: official, literary, informative, popular, educational and scientific. Text categorization experiment was conducted on three different representations of the SKIPEZ collection: in the first representation, the frequencies of NSWs are used as features ; in the second representation, the statistic measures of NSWs (variance, coefficient of variation, standard deviation, etc.) are used as features ; while the third representation combines the first two feature sets. Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms were used in text categorization experiments. The best categorization results are achieved using the first feature set (NSW frequencies) with the categorization accuracy of 87%. This suggests that the NSWs should be considered as features in highly inflectional languages, such as Croatian. NSW based features reduce the dimensionality of the feature space without standard lemmatization procedures, and therefore the bag-of-NSWs should be considered for further Croatian texts categorization experiments.
Vrsta sudjelovanja: Predavanje
Vrsta prezentacije u zborniku: Cjeloviti rad (više od 1500 riječi)
Vrsta recenzije: Međunarodna recenzija
Izvorni jezik: ENG
Kategorija: Znanstveni
Znanstvena područja:
Računarstvo,Informacijske i komunikacijske znanosti
URL Internet adrese: http://docs.mipro-proceedings.com/cis/CIS_15_2777.pdf
Upisao u CROSBI: Sanda Martinčić - Ipšić (smarti@inf.uniri.hr), 4. Lip. 2014. u 14:20 sati



Verzija za printanje   za tiskati


upomoc
foot_4