Significance of the structure of data in partial least squares regression predictions involving both natural and human experimental design

Åsmund Rinnan, Lars Munck

    4 Citationer (Scopus)
    11 Downloads (Pure)

    Abstract

    When predicting the chemical composition of food samples from near-infrared spectroscopy using partial least squares regression, deep knowledge of the origin of the information is not present. We are aiming at opening a Pandora's box of how the prediction of protein proceeds in a unique set of chemically diverse barley mutant samples. An external validation of the sources of co-variation in nature that are exploited by chemometric models would give a framework for manipulating the deciding information to make expensive calibration more economical. The barley samples were supplemented by two designed data sets: one mirroring the coarse composition of the barley samples by mixing six main chemical components and one set where the biological covariance between the six chemical components had been reduced. The three original data sets give remarkably comparable prediction models, albeit their regression coefficients are quite different. The origin of the prediction ability of the data is elucidated by splitting the natural barley samples into two parts: one based on simulated biology extracted from a set of chemical mixtures, and the residual after the chemistry has been removed from the raw data. As much as 98.1% of the spectral information in the natural barley data is explained through the simulated biology, leaving as little as 1.9% of the spectral information for the unexplained biological variation and noise. However, unexplained biological variation still gives a fair prediction of protein (RMSECV=1.23 and r2=0.80, compared with RMSECV=0.46 and r2=0.97 for the natural data), and it gives a clear principal component analysis separation of the three genotype classes. The results were interpreted by conducting spectral inspection on the origin of the unique covariate patterns appearing in self-organised biological systems that should motivate researchers and industry to investigate the compressive effect that the model has on the essential deterministic biological data.

    OriginalsprogEngelsk
    TidsskriftJournal of Chemometrics
    Vol/bind26
    Udgave nummer8-9
    Sider (fra-til)487-495
    Antal sider9
    ISSN0886-9383
    DOI
    StatusUdgivet - apr. 2012

    Fingeraftryk

    Dyk ned i forskningsemnerne om 'Significance of the structure of data in partial least squares regression predictions involving both natural and human experimental design'. Sammen danner de et unikt fingeraftryk.

    Citationsformater