Probabilistic Methods for Processing High-Throughput Sequencing Signals

Lasse Maretty Sørensen

Probabilistic Methods for Processing High-Throughput Sequencing Signals

Lasse Maretty Sørensen

Abstract

High-throughput sequencing has the potential to answer many of the big questions in biology
and medicine. It can be used to determine the ancestry of species, to chart complex
ecosystems and to understand and diagnose disease. However, going from raw sequencing
data to biological or medical insights is far from trivial.
A key challenge is that these methods cannot read the input sequences in their entirety.
Due to technological constraints, they instead provide the sequences of very many fragments
of the input molecules. Furthermore, not all nucleotides in these fragments are measured
correctly and the final output of a typical experiment thus consists of hundreds of millions
of error-containing sequence fragments.
This thesis concerns the development of methods for transforming such a raw sequencing
signal into a simpler representation from which biological inferences can then be made.
Importantly, the fact that the fragments are short and contain errors implies that there
may be significant uncertainty associated with the signal. By using probabilistic models,
we are able to quantify this uncertainty and propagate it to downstream analyses.
The first chapter describes a new method for reconstructing transcript sequences from RNA
sequencing data. The method is based on a novel sparse prior distribution over transcript
abundances and is markedly more accurate than existing approaches. The second chapter
describes a new method for calling genotypes from a fixed set of candidate variants. The
method queries the reads using a graph representation of the variants and hereby mitigates
the reference-bias that characterise standard genotyping methods. In the last chapter, we
apply this method to call the genotypes of 50 deeply sequencing parent-offspring trios from
the GenomeDenmark project. By estimating the genotypes on a set of candidate variants
obtained from both a standard mapping-based approach as well as de novo assemblies, we
are able to find considerably more structural variation than previous studies

Originalsprog	Engelsk

Forlag	Department of Biology, Faculty of Science, University of Copenhagen
Antal sider	173
Status	Udgivet - 2016

Adgang til dokumentet

PHD-Lasse MarettyForlagets udgivne version, 17,4 MB

https://rex.kb.dk:443/KGL:KGL:KGL01009279906

Andre filer og links

Sign in to request a library copy

Citationsformater

@phdthesis{a6644421e2b8490ba2f5cc31cfee7806,

title = "Probabilistic Methods for Processing High-Throughput Sequencing Signals",

abstract = "High-throughput sequencing has the potential to answer many of the big questions in biologyand medicine. It can be used to determine the ancestry of species, to chart complexecosystems and to understand and diagnose disease. However, going from raw sequencingdata to biological or medical insights is far from trivial.A key challenge is that these methods cannot read the input sequences in their entirety.Due to technological constraints, they instead provide the sequences of very many fragmentsof the input molecules. Furthermore, not all nucleotides in these fragments are measuredcorrectly and the final output of a typical experiment thus consists of hundreds of millionsof error-containing sequence fragments.This thesis concerns the development of methods for transforming such a raw sequencingsignal into a simpler representation from which biological inferences can then be made.Importantly, the fact that the fragments are short and contain errors implies that theremay be significant uncertainty associated with the signal. By using probabilistic models,we are able to quantify this uncertainty and propagate it to downstream analyses.The first chapter describes a new method for reconstructing transcript sequences from RNAsequencing data. The method is based on a novel sparse prior distribution over transcriptabundances and is markedly more accurate than existing approaches. The second chapterdescribes a new method for calling genotypes from a fixed set of candidate variants. Themethod queries the reads using a graph representation of the variants and hereby mitigatesthe reference-bias that characterise standard genotyping methods. In the last chapter, weapply this method to call the genotypes of 50 deeply sequencing parent-offspring trios fromthe GenomeDenmark project. By estimating the genotypes on a set of candidate variantsobtained from both a standard mapping-based approach as well as de novo assemblies, weare able to find considerably more structural variation than previous studies",

author = "S{\o}rensen, {Lasse Maretty}",

year = "2016",

language = "English",

publisher = "Department of Biology, Faculty of Science, University of Copenhagen",

}

TY - BOOK

T1 - Probabilistic Methods for Processing High-Throughput Sequencing Signals

AU - Sørensen, Lasse Maretty

PY - 2016

Y1 - 2016

N2 - High-throughput sequencing has the potential to answer many of the big questions in biologyand medicine. It can be used to determine the ancestry of species, to chart complexecosystems and to understand and diagnose disease. However, going from raw sequencingdata to biological or medical insights is far from trivial.A key challenge is that these methods cannot read the input sequences in their entirety.Due to technological constraints, they instead provide the sequences of very many fragmentsof the input molecules. Furthermore, not all nucleotides in these fragments are measuredcorrectly and the final output of a typical experiment thus consists of hundreds of millionsof error-containing sequence fragments.This thesis concerns the development of methods for transforming such a raw sequencingsignal into a simpler representation from which biological inferences can then be made.Importantly, the fact that the fragments are short and contain errors implies that theremay be significant uncertainty associated with the signal. By using probabilistic models,we are able to quantify this uncertainty and propagate it to downstream analyses.The first chapter describes a new method for reconstructing transcript sequences from RNAsequencing data. The method is based on a novel sparse prior distribution over transcriptabundances and is markedly more accurate than existing approaches. The second chapterdescribes a new method for calling genotypes from a fixed set of candidate variants. Themethod queries the reads using a graph representation of the variants and hereby mitigatesthe reference-bias that characterise standard genotyping methods. In the last chapter, weapply this method to call the genotypes of 50 deeply sequencing parent-offspring trios fromthe GenomeDenmark project. By estimating the genotypes on a set of candidate variantsobtained from both a standard mapping-based approach as well as de novo assemblies, weare able to find considerably more structural variation than previous studies

AB - High-throughput sequencing has the potential to answer many of the big questions in biologyand medicine. It can be used to determine the ancestry of species, to chart complexecosystems and to understand and diagnose disease. However, going from raw sequencingdata to biological or medical insights is far from trivial.A key challenge is that these methods cannot read the input sequences in their entirety.Due to technological constraints, they instead provide the sequences of very many fragmentsof the input molecules. Furthermore, not all nucleotides in these fragments are measuredcorrectly and the final output of a typical experiment thus consists of hundreds of millionsof error-containing sequence fragments.This thesis concerns the development of methods for transforming such a raw sequencingsignal into a simpler representation from which biological inferences can then be made.Importantly, the fact that the fragments are short and contain errors implies that theremay be significant uncertainty associated with the signal. By using probabilistic models,we are able to quantify this uncertainty and propagate it to downstream analyses.The first chapter describes a new method for reconstructing transcript sequences from RNAsequencing data. The method is based on a novel sparse prior distribution over transcriptabundances and is markedly more accurate than existing approaches. The second chapterdescribes a new method for calling genotypes from a fixed set of candidate variants. Themethod queries the reads using a graph representation of the variants and hereby mitigatesthe reference-bias that characterise standard genotyping methods. In the last chapter, weapply this method to call the genotypes of 50 deeply sequencing parent-offspring trios fromthe GenomeDenmark project. By estimating the genotypes on a set of candidate variantsobtained from both a standard mapping-based approach as well as de novo assemblies, weare able to find considerably more structural variation than previous studies

UR - https://rex.kb.dk:443/KGL:KGL:KGL01009279906

M3 - Ph.D. thesis

BT - Probabilistic Methods for Processing High-Throughput Sequencing Signals

PB - Department of Biology, Faculty of Science, University of Copenhagen

ER -

Probabilistic Methods for Processing High-Throughput Sequencing Signals

Abstract

Adgang til dokumentet

Andre filer og links

Fingeraftryk

Citationsformater