Towards Improved Biomarker Research: Data Analytical Challenges of High-dimensional Biological Data

Karin Kjeldahl

Towards Improved Biomarker Research: Data Analytical Challenges of High-dimensional Biological Data

Karin Kjeldahl

Abstract

This thesis takes a look at the data analytical challenges associated with the search for biomarkers in large-scale biological data such as transcriptomics, proteomics and metabolomics data. These studies aim to identify genes, proteins or metabolites which can be associated with e.g. a diet, disease (e.g. cancer), drug response or physiological status.
The value of these omics studies has to some extent been questioned as it is often observed that the validity of claimed biomarkers has been very difficult to verify in other studies. On the other hand, in many studies it is difficult to actually identify strong biomarkers when strict validation is applied; the latter phenomenon is to some extentmasked by a publication bias, but has been widely observed among researchers working with omics data.
In this thesis, the background of this apparent small effect size of the biomarkers is investigated and followed by some suggestions which can potentially improve the chances of a successful outcome of an omics study. A method widely applied in the analysis of omics studies is Partial Least Squares (PLS) regression which is one of the work horses within the chemometrics tool box; a method which is used both for regression and classification purposes. This method has proven its strong worth in the multivariate data analysis throughout
an enormous range of applications; a very classic data type is near infrared (NIR) data, but many similar data types have also be very successful.
On that background, the general characteristics of omics data are described and related to the characteristics of classical NIR-type data. This shows that omics data, which are generally much bigger data sets than classical data, are not just simple extensions of NIR data. The sample type, analytical method and the application types are different and introduce a larger complexity, weaker signals andmany potential sources of experimental and analytical bias and errors. The risk of the latter is further increased by the complexity of the entire omics experimental setup which often involves various project partners with very specific competencies.
In order to optimize the basis of a sound and fruitful data analysis, suggestions are givenwhich focus on (1) collection of good data, (2) preparation of data for the data analysis and (3) a sound data analysis. If these steps are optimized, PLS is a also a very goodmethod for the analysis of omics data.
The five research papers included in the thesis touch upon different aspects
of the issues discussed in the thesis.

Originalsprog	Engelsk

Forlag	Department of Food Science, Faculty of Science, University of Copenhagen
Antal sider	146
Status	Udgivet - 2013

Adgang til dokumentet

Phd thesis - Karin KjeldahlAccepteret manuskript, 7,05 MB

https://rex.kb.dk/primo-explore/fulldisplay?docid=KGL01009087031&context=L&vid=NUI&search_scope=KGL&tab=default_tab&lang=da_DK

Andre filer og links

Sign in to request a library copy

Citationsformater

@phdthesis{2081dea1ea2c40f7a3e782d9799b0125,

title = "Towards Improved Biomarker Research: Data Analytical Challenges of High-dimensional Biological Data",

abstract = "This thesis takes a look at the data analytical challenges associated with the search for biomarkers in large-scale biological data such as transcriptomics, proteomics and metabolomics data. These studies aim to identify genes, proteins or metabolites which can be associated with e.g. a diet, disease (e.g. cancer), drug response or physiological status.The value of these omics studies has to some extent been questioned as it is often observed that the validity of claimed biomarkers has been very difficult to verify in other studies. On the other hand, in many studies it is difficult to actually identify strong biomarkers when strict validation is applied; the latter phenomenon is to some extentmasked by a publication bias, but has been widely observed among researchers working with omics data. In this thesis, the background of this apparent small effect size of the biomarkers is investigated and followed by some suggestions which can potentially improve the chances of a successful outcome of an omics study. A method widely applied in the analysis of omics studies is Partial Least Squares (PLS) regression which is one of the work horses within the chemometrics tool box; a method which is used both for regression and classification purposes. This method has proven its strong worth in the multivariate data analysis throughoutan enormous range of applications; a very classic data type is near infrared (NIR) data, but many similar data types have also be very successful.On that background, the general characteristics of omics data are described and related to the characteristics of classical NIR-type data. This shows that omics data, which are generally much bigger data sets than classical data, are not just simple extensions of NIR data. The sample type, analytical method and the application types are different and introduce a larger complexity, weaker signals andmany potential sources of experimental and analytical bias and errors. The risk of the latter is further increased by the complexity of the entire omics experimental setup which often involves various project partners with very specific competencies.In order to optimize the basis of a sound and fruitful data analysis, suggestions are givenwhich focus on (1) collection of good data, (2) preparation of data for the data analysis and (3) a sound data analysis. If these steps are optimized, PLS is a also a very goodmethod for the analysis of omics data.The five research papers included in the thesis touch upon different aspectsof the issues discussed in the thesis.",

author = "Karin Kjeldahl",

year = "2013",

language = "English",

publisher = "Department of Food Science, Faculty of Science, University of Copenhagen",

}

TY - BOOK

T1 - Towards Improved Biomarker Research

T2 - Data Analytical Challenges of High-dimensional Biological Data

AU - Kjeldahl, Karin

PY - 2013

Y1 - 2013

N2 - This thesis takes a look at the data analytical challenges associated with the search for biomarkers in large-scale biological data such as transcriptomics, proteomics and metabolomics data. These studies aim to identify genes, proteins or metabolites which can be associated with e.g. a diet, disease (e.g. cancer), drug response or physiological status.The value of these omics studies has to some extent been questioned as it is often observed that the validity of claimed biomarkers has been very difficult to verify in other studies. On the other hand, in many studies it is difficult to actually identify strong biomarkers when strict validation is applied; the latter phenomenon is to some extentmasked by a publication bias, but has been widely observed among researchers working with omics data. In this thesis, the background of this apparent small effect size of the biomarkers is investigated and followed by some suggestions which can potentially improve the chances of a successful outcome of an omics study. A method widely applied in the analysis of omics studies is Partial Least Squares (PLS) regression which is one of the work horses within the chemometrics tool box; a method which is used both for regression and classification purposes. This method has proven its strong worth in the multivariate data analysis throughoutan enormous range of applications; a very classic data type is near infrared (NIR) data, but many similar data types have also be very successful.On that background, the general characteristics of omics data are described and related to the characteristics of classical NIR-type data. This shows that omics data, which are generally much bigger data sets than classical data, are not just simple extensions of NIR data. The sample type, analytical method and the application types are different and introduce a larger complexity, weaker signals andmany potential sources of experimental and analytical bias and errors. The risk of the latter is further increased by the complexity of the entire omics experimental setup which often involves various project partners with very specific competencies.In order to optimize the basis of a sound and fruitful data analysis, suggestions are givenwhich focus on (1) collection of good data, (2) preparation of data for the data analysis and (3) a sound data analysis. If these steps are optimized, PLS is a also a very goodmethod for the analysis of omics data.The five research papers included in the thesis touch upon different aspectsof the issues discussed in the thesis.

AB - This thesis takes a look at the data analytical challenges associated with the search for biomarkers in large-scale biological data such as transcriptomics, proteomics and metabolomics data. These studies aim to identify genes, proteins or metabolites which can be associated with e.g. a diet, disease (e.g. cancer), drug response or physiological status.The value of these omics studies has to some extent been questioned as it is often observed that the validity of claimed biomarkers has been very difficult to verify in other studies. On the other hand, in many studies it is difficult to actually identify strong biomarkers when strict validation is applied; the latter phenomenon is to some extentmasked by a publication bias, but has been widely observed among researchers working with omics data. In this thesis, the background of this apparent small effect size of the biomarkers is investigated and followed by some suggestions which can potentially improve the chances of a successful outcome of an omics study. A method widely applied in the analysis of omics studies is Partial Least Squares (PLS) regression which is one of the work horses within the chemometrics tool box; a method which is used both for regression and classification purposes. This method has proven its strong worth in the multivariate data analysis throughoutan enormous range of applications; a very classic data type is near infrared (NIR) data, but many similar data types have also be very successful.On that background, the general characteristics of omics data are described and related to the characteristics of classical NIR-type data. This shows that omics data, which are generally much bigger data sets than classical data, are not just simple extensions of NIR data. The sample type, analytical method and the application types are different and introduce a larger complexity, weaker signals andmany potential sources of experimental and analytical bias and errors. The risk of the latter is further increased by the complexity of the entire omics experimental setup which often involves various project partners with very specific competencies.In order to optimize the basis of a sound and fruitful data analysis, suggestions are givenwhich focus on (1) collection of good data, (2) preparation of data for the data analysis and (3) a sound data analysis. If these steps are optimized, PLS is a also a very goodmethod for the analysis of omics data.The five research papers included in the thesis touch upon different aspectsof the issues discussed in the thesis.

UR - https://rex.kb.dk/primo-explore/fulldisplay?docid=KGL01009087031&context=L&vid=NUI&search_scope=KGL&tab=default_tab&lang=da_DK

M3 - Ph.D. thesis

BT - Towards Improved Biomarker Research

PB - Department of Food Science, Faculty of Science, University of Copenhagen

ER -