TY - JOUR
T1 - Uninformative variable elimination assisted by Gram-Schmidt Orthogonalization/successive projection algorithm for descriptor selection in QSAR
AU - Omidikia, Nematollah
AU - Kompany-Zareh, Mohsen
PY - 2013
Y1 - 2013
N2 - Employment of Uninformative Variable Elimination (UVE) as a robust variable selection method is reported in this study. Each regression coefficient represents the contribution of the corresponding variable in the established model, but in the presence of uninformative variables as well as collinearity reliability of the regression coefficient's magnitude is suspicious. Successive Projection Algorithm (SPA) and Gram-Schmidt Orthogonalization (GSO) were implemented as pre-selection technique for removing collinearity and redundancy among variables in the model. Uninformative variable elimination-partial least squares (UVE-PLS) was performed on the pre-selected data set and Cvalue's were calculated for each descriptor. In this case the Cvalue's of UVE assisted by SPA or GSO could be used in order to rank the variables according to their importance. Leave-many-out cross-validation (LMO-CV) was applied to ordered descriptors for selecting optimal number of descriptors. Selwood data including 31 molecules and 53 descriptors, and anti-HIV data including 107 molecules and 160 descriptors were utilized in this study. When GSO pre-selection method is used for the Selwood data and SPA for the anti-HIV data set, obtained results were desired not only in the prediction ability of the constructed model but also in the number of selected informative descriptors. By applying GSO-UVE-PLS to the Selwood data, in an optimized condition, seven descriptors out of 53 were selected with q2=0.769 and R2=0.915. Also applying SPA-UVE-PLS on the anti-HIV data, nine descriptors were selected out of 160 with q2=0.81, R2=0.84 and Q2F3=0.8.
AB - Employment of Uninformative Variable Elimination (UVE) as a robust variable selection method is reported in this study. Each regression coefficient represents the contribution of the corresponding variable in the established model, but in the presence of uninformative variables as well as collinearity reliability of the regression coefficient's magnitude is suspicious. Successive Projection Algorithm (SPA) and Gram-Schmidt Orthogonalization (GSO) were implemented as pre-selection technique for removing collinearity and redundancy among variables in the model. Uninformative variable elimination-partial least squares (UVE-PLS) was performed on the pre-selected data set and Cvalue's were calculated for each descriptor. In this case the Cvalue's of UVE assisted by SPA or GSO could be used in order to rank the variables according to their importance. Leave-many-out cross-validation (LMO-CV) was applied to ordered descriptors for selecting optimal number of descriptors. Selwood data including 31 molecules and 53 descriptors, and anti-HIV data including 107 molecules and 160 descriptors were utilized in this study. When GSO pre-selection method is used for the Selwood data and SPA for the anti-HIV data set, obtained results were desired not only in the prediction ability of the constructed model but also in the number of selected informative descriptors. By applying GSO-UVE-PLS to the Selwood data, in an optimized condition, seven descriptors out of 53 were selected with q2=0.769 and R2=0.915. Also applying SPA-UVE-PLS on the anti-HIV data, nine descriptors were selected out of 160 with q2=0.81, R2=0.84 and Q2F3=0.8.
KW - QSAR
KW - Ordered variable selection
KW - Gram-Schmidt Orthogonalization
KW - Selwood data
KW - Anti-HIV drugs
KW - Uninformative Variable Elimination
U2 - 10.1016/j.chemolab.2013.07.008
DO - 10.1016/j.chemolab.2013.07.008
M3 - Journal article
SN - 0169-7439
VL - 128
SP - 56
EP - 65
JO - Chemometrics and Intelligent Laboratory Systems
JF - Chemometrics and Intelligent Laboratory Systems
ER -