An XML format for peptide datasets

Repar, Jelena; Škunca, Nives; Supek, Fran; Šmuc, Tomislav

izvor podataka: crosbi !

An XML format for peptide datasets (CROSBI ID 543880)

Prilog sa skupa u zborniku | sažetak izlaganja sa skupa | međunarodna recenzija

Repar, Jelena ; Škunca, Nives ; Supek, Fran ; Šmuc, Tomislav An XML format for peptide datasets // ECCB'08 European Conference on Computational Biology. 2008

Podaci o odgovornosti

Autori

Repar, Jelena ; Škunca, Nives ; Supek, Fran ; Šmuc, Tomislav

Osnovni podaci na izvornom jeziku
Osnovni podaci na ostalim jezicima

Jezik

engleski

Naslov

An XML format for peptide datasets

Sažetak

There is a number of papers in the recent scientific literature that deal with a set of seemingly dissimilar bioinformatic problems. Their connecting motif is that they all have sequences of peptides, or fragments of proteins, as the focus of their interest. Generally, the peptide sequences in question have first been characterized by wet lab experiments. The results have been published in a paper, and possibly stored in a highly specialized database . In order to use the sequences for bioinformatics analyses of the processes that generated them a researcher needs to extract peptides from the database, analyze them and form a dataset. As the last step is time-consuming and arduous there is a general tendency for other scientists to re-use such datasets for trying to arrive to the even better bioinformatic models and, hopefully, understanding of the processes. The peptide datasets we have collected so far can be roughly divided into four categories, with the possibility of adding more categories as the need arises: a) posttranslational modification of proteins (e.g. phosphorylation) b) cleavage by broad-specificity proteases (e.g. proteasome, HIV-I protease) c) determining protein secondary structure d) epitope recognition (e.g. T-cell epitope recognition) As many of these processes are not only of biological but also of a medical importance, there is great interest in their further study and explanation. The peptide datasets can all generally be viewed as sets of amino acid sequences to which a class label has been assigned experimentally. However unconnected their underlying problem may seem at the first glance (e.g. prediction of protein secondary structure, prediction of phosphorylation sites in proteins) they all often serve for construction of classification models by similar supervised machine learning approaches. The final aim of the computational approaches is to develop a model that will best serve for the in silico predictions of events occuring in live cells. Most attempted modelling of these problems has met with the question of numerical representations of amino acid sequences, numerical representation being necessary for many classification algorithms. Various approaches to finding the best representation have been taken and they commonly compare amino acid representations on the same problem using the same classification method. Only rarely have researchers compared optimal amino acid representations between different problems. In the light of similarity between various peptide classification problems one can not help wondering whether there is an optimal amino acid representation that would work best for different problems and that would explain the most prominent amino acid features in shaping the natural processes in question. It is the aim of our future research to address this question in more detail. Despite the striking similarity between the peptide modelling problems, there is not an established flow of information. Quite a few of the field-specialized databases are readily available but need to be extensively pre-processed to result in the modelling-appropriate dataset. Additionaly, due to the frequent database updates and vagueness of descriptions of dataset production it is hard for different scientists to arrive to exactly the same dataset, and even slight variations in the datasets would invalidate a comparison of modelling approaches. Although some datasets are available on request, there is not an established format of dataset exchange, resulting in more time wasted on managing different data formats. Therefore, easing the process of data exchange by standardizing peptide dataset formats is a fundamental requirement for the better research in protein structural biology. We have chosen Extensible Markup Language (XML) for the production of such a data format. XML has been shown to be highly efficient in storing data in an orderly, researcher-comprehensible manner. On the one hand it is robust and on the other hand extensible which assures not only correct data distribution, but also adaptability in the instances in which hindsight fails. We hereby propose an xml format for the peptide sequences datasets which we hope will lead to the more efficient information exchange within this specific area of protein structural biology. In the future, we hope to further stimulate the information flow by building a peptide dataset repository.

Ključne riječi

XML; peptide; datasets; standardization

Napomena

nije evidentirano

Jezik

nije evidentirano

Naslov

nije evidentirano

Sažetak

nije evidentirano

Ključne riječi

nije evidentirano

Napomena

nije evidentirano

Podaci o prilogu

Godina izdavanja

2008.

Status objave rada

objavljeno

Podaci o matičnoj publikaciji

Naslov

ECCB'08 European Conference on Computational Biology

Podaci o skupu

Skup

ECCB'08 European Conference on Computational Biology

Vrsta sudjelovanja

poster

Datum održavanja skupa

22.09.2008-26.09.2008

Mjesto održavanja skupa

Cagliari, Italija

Povezanost rada

Povezane osobe

Tomislav Šmuc (autor/i)

Fran Supek (autor/i)

Nives Škunca (autor/i)

Jelena Repar (autor/i)

Povezane ustanove

Institut Ruđer Bošković (098) (autorova ustanova)

Povezani projekti

Strojno učenje prediktivnih modela u računalnoj biologiji (rezultat rada na projektu)

Molekularni mehanizmi rekombinacije i popravka DNA (rezultat rada na projektu)

Područje

Računarstvo, Biotehnologija, Biologija

Poveznice

eccb08.org