Nalazite se na CroRIS probnoj okolini. Ovdje evidentirani podaci neće biti pohranjeni u Informacijskom sustavu znanosti RH. Ako je ovo greška, CroRIS produkcijskoj okolini moguće je pristupi putem poveznice www.croris.hr
izvor podataka: crosbi !

Functional classification of Adenylation domains by Latent Semantic Indexing (LSI) (CROSBI ID 365101)

Ocjenski rad | diplomski rad

Baranašić, Damir Functional classification of Adenylation domains by Latent Semantic Indexing (LSI) / Starčević, Antonio (mentor); Žučko, Jurica (neposredni voditelj). Zagreb, Prehrambeno-biotehnološki fakultet, . 2011

Podaci o odgovornosti

Baranašić, Damir

Starčević, Antonio

Žučko, Jurica

engleski

Functional classification of Adenylation domains by Latent Semantic Indexing (LSI)

Latent semantic indexing (LSI) is an information retrieval method which has relatively recently been introduced into computational biology. In this work, LSI was adapted for prediction of the amino acid substrates which are activated by adenylation domains (A-domains). A-domains are obligatory subunits of non-ribosomally synthesised peptide synthetases (NRPS) modules which recognise and activate the amino acid that must be incorporated into the final product, non-ribosomally sythesised peptides. Knowing the specific A-domain substrate for every sequenced A-domain would enable us to predict the final product of linear NRPS and perhaps design novel biologically active natural products. Two methods were used to vectorize A-domain protein sequences and to construct the resulting term-document matrix: “n-grams” method and a novel “tokenization” method. The “n-grams” method finds n-peptides in the protein sequence, and the “tokenization” method creates specific ”tokens”, which couple amino acid residues with the corresponding positions in the multiple sequence alignment. LSI uses a mathematical method called singular value decomposition (SVD) to reduce the unreliable information from the term-document matrix. The number of dimensions used in analysis was obtained computationally and was found to be in accordance with the empirically obtained optimal number of dimensions. Predictions obtained were satisfactory using both “n-grams” and “tokenization” as vectorization methods. “Tokenization” method generally showed better precision and robustness. A novel clustering method based on LSI was also developed. It showed satisfactory clustering results without the need to guess the numbers of clusters in advance which methods such as k-means clustering require.

LSI; A-domains; protein tokenization; protein clustering; SVD; dimension reduction; specificity prediction

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

nije evidentirano

Podaci o izdanju

37

01.07.2011.

obranjeno

Podaci o ustanovi koja je dodijelila akademski stupanj

Prehrambeno-biotehnološki fakultet

Zagreb

Povezanost rada

Biotehnologija