Metabolomics: from a chemometric point of view

Maja Hermann Kamstrup-Nielsen

Abstract

Metabolomics is the analysis of the whole metabolome and the focus in
metabolomics studies is to measure as many metabolites as possible. The use
of chemometrics in metabolomics studies is widespread, but there is a clear
lack of validation in the developed models. The focus in this thesis has been
how to properly handle complex metabolomics data, in order to achieve
reliable and valid multivariate models. This has been illustrated by three
case studies with examples of forecasting breast cancer and early detection of
colorectal cancer based on data from nuclear magnetic resonance (NMR)
spectroscopy (Paper II), fluorescence spectroscopy (Paper III) and gas
chromatography coupled to mass spectrometry (GC-MS).
The principles of the three data acquisition techniques have been briefly
described and the methods have been compared. The techniques complement
each other, which makes room for data fusion where data from different
platforms can be combined.
Complex data are obtained when samples are analysed using NMR,
fluorescence and GC-MS. Chemometrics methods which can be used to
extract the relevant information from the obtained data are presented. Focus
has been on principal component analysis (PCA), parallel factor analysis
(PARAFAC), PARAFAC2 and partial least squares discriminant analysis
(PLS-DA) all being described in depth. It can be a challenge to determine the
appropriate number of components in PARAFAC2, since no specific tools have
been developed for this purpose. Paper I is a presentation of a core
consistency diagnostic aiding in determining the number of components in a
PARAFAC2 model. It is of great importance to validate especially PLS-DA
models and if not done properly, the developed models might reveal spurious
groupings. Furthermore, data from metabolomics studies contain many
redundant variables. These have been suggested to be eliminated using an
approach termed reduction of redundant variables (RRV), which is time
consuming but efficient, since the curse of dimensionality is reduced and the
risk of over-fit is decreased.
The use of appropriate multivariate models in metabolomics studies has been
presented in the three case studies. In the first case study, plasma samples
from healthy individuals have been analysed by NMR. Some have developed
breast cancer later in life and these have been separated from healthy
individuals by means of a properly validated PLS-DA model based on NMR
data with RRV and known risk markers. The sensitivity and specificity
values are 0.80 and 0.79, respectively, for a test set validated model.
The second case study is based on plasma samples with verified colorectal
cancer and three types of control samples analysed by fluorescence
spectroscopy. The acquired data have been analysed by PARAFAC models
and the components from the PARAFAC models have been used as variables
in seven PLS-DA models in order to separate the cancer samples from the
control groups. Sensitivity and specificity values of approximately 0.75 make
fluorescence spectroscopy a potential tool in early detection of colorectal
cancer.
Finally, plasma samples have been analysed using GC-MS. The method
requires extensive sample preparation and therefore the study can only be
considered a feasibility study with room for optimization. However, 14 plasma
samples were analysed and the results indicate that GC-MS-based
metabolomics in combination with PARAFAC2 modelling is applicable for
extracting relevant biological information from the plasma samples.
Overall, the work in this thesis shows that suitable and properly validated
chemometrics models used in metabolomics are very useful in forecasting and
early detection of cancer. The use of chemometrics in metabolomics can e.g.
increase the understanding of the underlying etiology of cancer and could be
extended to cover other diseases as well.
Original languageEnglish
PublisherDepartment of Food Science, University of Copenhagen
Number of pages141
Publication statusPublished - 2013

Cite this