Towards Improved Biomarker Research: Data Analytical Challenges of High-dimensional Biological Data

Karin Kjeldahl

Abstract

This thesis takes a look at the data analytical challenges associated with the search for biomarkers in large-scale biological data such as transcriptomics, proteomics and metabolomics data. These studies aim to identify genes, proteins or metabolites which can be associated with e.g. a diet, disease (e.g. cancer), drug response or physiological status.
The value of these omics studies has to some extent been questioned as it is often observed that the validity of claimed biomarkers has been very difficult to verify in other studies. On the other hand, in many studies it is difficult to actually identify strong biomarkers when strict validation is applied; the latter phenomenon is to some extentmasked by a publication bias, but has been widely observed among researchers working with omics data.
In this thesis, the background of this apparent small effect size of the biomarkers is investigated and followed by some suggestions which can potentially improve the chances of a successful outcome of an omics study. A method widely applied in the analysis of omics studies is Partial Least Squares (PLS) regression which is one of the work horses within the chemometrics tool box; a method which is used both for regression and classification purposes. This method has proven its strong worth in the multivariate data analysis throughout
an enormous range of applications; a very classic data type is near infrared (NIR) data, but many similar data types have also be very successful.
On that background, the general characteristics of omics data are described and related to the characteristics of classical NIR-type data. This shows that omics data, which are generally much bigger data sets than classical data, are not just simple extensions of NIR data. The sample type, analytical method and the application types are different and introduce a larger complexity, weaker signals andmany potential sources of experimental and analytical bias and errors. The risk of the latter is further increased by the complexity of the entire omics experimental setup which often involves various project partners with very specific competencies.
In order to optimize the basis of a sound and fruitful data analysis, suggestions are givenwhich focus on (1) collection of good data, (2) preparation of data for the data analysis and (3) a sound data analysis. If these steps are optimized, PLS is a also a very goodmethod for the analysis of omics data.
The five research papers included in the thesis touch upon different aspects
of the issues discussed in the thesis.
Original languageEnglish
PublisherDepartment of Food Science, Faculty of Science, University of Copenhagen
Number of pages146
Publication statusPublished - 2013

Cite this