Abstract
High-throughput sequencing has the potential to answer many of the big questions in biology
and medicine. It can be used to determine the ancestry of species, to chart complex
ecosystems and to understand and diagnose disease. However, going from raw sequencing
data to biological or medical insights is far from trivial.
A key challenge is that these methods cannot read the input sequences in their entirety.
Due to technological constraints, they instead provide the sequences of very many fragments
of the input molecules. Furthermore, not all nucleotides in these fragments are measured
correctly and the final output of a typical experiment thus consists of hundreds of millions
of error-containing sequence fragments.
This thesis concerns the development of methods for transforming such a raw sequencing
signal into a simpler representation from which biological inferences can then be made.
Importantly, the fact that the fragments are short and contain errors implies that there
may be significant uncertainty associated with the signal. By using probabilistic models,
we are able to quantify this uncertainty and propagate it to downstream analyses.
The first chapter describes a new method for reconstructing transcript sequences from RNA
sequencing data. The method is based on a novel sparse prior distribution over transcript
abundances and is markedly more accurate than existing approaches. The second chapter
describes a new method for calling genotypes from a fixed set of candidate variants. The
method queries the reads using a graph representation of the variants and hereby mitigates
the reference-bias that characterise standard genotyping methods. In the last chapter, we
apply this method to call the genotypes of 50 deeply sequencing parent-offspring trios from
the GenomeDenmark project. By estimating the genotypes on a set of candidate variants
obtained from both a standard mapping-based approach as well as de novo assemblies, we
are able to find considerably more structural variation than previous studies
and medicine. It can be used to determine the ancestry of species, to chart complex
ecosystems and to understand and diagnose disease. However, going from raw sequencing
data to biological or medical insights is far from trivial.
A key challenge is that these methods cannot read the input sequences in their entirety.
Due to technological constraints, they instead provide the sequences of very many fragments
of the input molecules. Furthermore, not all nucleotides in these fragments are measured
correctly and the final output of a typical experiment thus consists of hundreds of millions
of error-containing sequence fragments.
This thesis concerns the development of methods for transforming such a raw sequencing
signal into a simpler representation from which biological inferences can then be made.
Importantly, the fact that the fragments are short and contain errors implies that there
may be significant uncertainty associated with the signal. By using probabilistic models,
we are able to quantify this uncertainty and propagate it to downstream analyses.
The first chapter describes a new method for reconstructing transcript sequences from RNA
sequencing data. The method is based on a novel sparse prior distribution over transcript
abundances and is markedly more accurate than existing approaches. The second chapter
describes a new method for calling genotypes from a fixed set of candidate variants. The
method queries the reads using a graph representation of the variants and hereby mitigates
the reference-bias that characterise standard genotyping methods. In the last chapter, we
apply this method to call the genotypes of 50 deeply sequencing parent-offspring trios from
the GenomeDenmark project. By estimating the genotypes on a set of candidate variants
obtained from both a standard mapping-based approach as well as de novo assemblies, we
are able to find considerably more structural variation than previous studies
Originalsprog | Engelsk |
---|
Forlag | Department of Biology, Faculty of Science, University of Copenhagen |
---|---|
Antal sider | 173 |
Status | Udgivet - 2016 |