Improvement of Ensemble of Multi-Regression Structure-Toxicity Models by Clustering of Molecules in Descriptor Space (CROSBI ID 549928)
Prilog sa skupa u zborniku | izvorni znanstveni rad | međunarodna recenzija
Podaci o odgovornosti
Bašic, Ivan ; Lučić, Bono ; Nikolić, Sonja ; Papeš-Šokčević, Lidija ; Nadramija, Damir
engleski
Improvement of Ensemble of Multi-Regression Structure-Toxicity Models by Clustering of Molecules in Descriptor Space
For selected data set published by Russom et al. (Environ. Toxicol. Chem. 16, 948-967 (1997)) containing 704 organic molecules with measured acute aquatic toxicity data (96-h LC50 tests) we calculated data set of more than 1400 molecular descriptors by the Dragon 5.0 program.[1] After we excluded descriptors that have almost constant values, and those having very low correlation with the logarithm of LC50 values on the training set, about 620 descriptors remained and were used in the modeling process. Data set of molecules was randomly partitioned into the training and test set containing 560 and 144 molecules, respectively. We developed and compared two kinds of ensemble of both linear and nonlinear multi-regression models (1) normal ensembles and (2) ensembles obtained by the clustering of molecules according to their similarity (clustered ensembles). Clustering of molecules was performed by calculating their Euclidian distances in normalized descriptor space. In this method, the final model was developed only on those molecules from the training set that are close (measured using Euclidian distance in normalized descriptor space) to the selected molecule from the test set. Although results obtained by normal ensembles are very good (e.g. nonlinear ensemble of 8-descriptor models ; rtrain = 0.91, strain = 0.54, rtest = 0.76, rtest = 0.80), significant improvement is obtained by taking into account clustering of molecules in development of ensembles of linear models (e.g. 200 3-descriptor models in ensemble: rtrain = 0.91, strain = 0.53, rtest = 0.836, rtest = 0.70 ; or for 200 5-descriptor models in ensemble rtrain = 0.94, strain = 0.45, rtest = 0.84, rtest = 0.70). These results clearly indicate that the use of information about similarity between molecules can improve structure-toxicity models, and we also expect that this could be valid generally.
Acute aquatic toxicity; Organic molecules; QSAR models; Molecular descriptors; Distance based similarity; Clustering of molecules; Ensemble of multi-regression models; Clustered ensembles
doi:10.1063/1.3225331
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
nije evidentirano
Podaci o prilogu
408-411.
2009.
objavljeno
Podaci o matičnoj publikaciji
International Conference of Computational Methods in Sciences and Engineering 2008 ; Special Volume of the American Institute of Physics (AIP) - Conference Proceedings of ICCMSE 2008. Vol. 1148
Simos, Theodore
Melville (NY): American Institute of Physics (AIP)
978-0-7354-0685-8
Podaci o skupu
Nepoznat skup
poster
29.02.1904-29.02.2096