The applicability of lemmatisation in translation equivalents detection (CROSBI ID 28289)
Prilog u knjizi | izvorni znanstveni rad
Podaci o odgovornosti
Tadić, Marko ; Fulgosi, Sanja ; Šojat, Krešimir
engleski
The applicability of lemmatisation in translation equivalents detection
The aim of the research is to help in identification of TEs in 1:1 aligned sentences at the level of single-word units. The research is based on the Croatian-English parallel corpus compiled at the University of Zagreb. The method is based entirely on a statistical approach with no linguistic filter applied before or after the processing which has 3 steps: 1) generation of all possible pairs of tokens from 1:1 aligned sentences (Carthesius product) ; 2) application of mutual information to generated pairs in order to detect candidates for real TE ; 3) sorting the pairs according to calculated MI and choosing real TE for further use. The same method was applied to nonlemmatized and lemmatized material. The latter demonstrated 4.5 % higher precision and it has proven our hypothesis that for Croatian-English pair (and possibly other morphologically rich languages like Croatian) the lemmatized form of corpus data helps the statistical methods of TE detection.
Croatian Language, English Language, Croatian-English Parallel Corpus, parallel corpus, lemmatization, translation equivalents, translation equivalents detection
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o prilogu
195-206-x.
objavljeno
Podaci o knjizi
Barnbrook, Geoff ; Danielsson, Pernilla ; Mahlberg, Michaela
London : New York (NY): Continuum International Publishing Group
2004.
082647490X