Deciphering Transcriptional Regulation: Computational Approaches

Eivind Valen

Deciphering Transcriptional Regulation: Computational Approaches

Eivind Valen

Computational and RNA Biology

Abstract

The myriad of cells in the human body are all made from the same blueprint: the human
genome. At the heart of this diversity lies the concept of gene regulation, the process
in which it is decided which genes are used where and when. Genes do not function
as on/off buttons, but more like a volume control spanning the range from completely
muted to cranked up to maximum. The volume, in this case, is the production rate of
proteins. This production is the result of a two step procedure: i) transcription, in which
a small part of DNA from the genome (a gene) is transcribed into an RNA molecule (an
mRNA); and ii) translation, in which the mRNA is translated into a protein. This thesis
focus on the ¿rst of these steps, transcription, and speci¿cally the initiation of this.
Simpli¿ed, initiation is preceded by the binding of several proteins, known as transcription factors (TFs), to DNA. This takes place mostly near the start of the gene known
as the promoter. This region contains patterns scattered in the DNA that the TFs can recognize and bind to. Such binding can prompt the assembly of the pre-initiation complex
which ultimately leads to transcription of the gene. In order to achieve the regulation
necessary to produce the multitude of tissues we observe, there exists a wide range of
these TFs having different binding preferences and targeting different genes. By activating different TFs in a context dependent manner the organism can produce customized
sets of proteins for each cell resulting in different cell types.
This thesis presents several methods for analysis and description of promoters. We
focus particularly the binding sites of TFs and computational methods for locating these.
We contribute to the ¿eld by compiling a database of binding preferences for TFs which
can be used for site prediction and provide tools that help investigators use these. In
addition, a de novo motif discovery tool was developed that locates these patterns in
DNA sequences. This compared favorably to many contemporary methods.
A novel experimental method, cap-analysis of gene expression (CAGE), was recently
published providing an unbiased overview of the transcription start site (TSS) usage in
a tissue. We have paired this method with high-throughput sequencing technology to
produce a library of unprecedented depth (DeepCAGE) for the mouse hippocampus. We
investigated this in detail and focused particularly on what characterizes a hippocampus
promoter. Pairing CAGE with TF binding site prediction we identi¿ed a likely key
regulator of hippocampus.
Finally, we developed a method for CAGE exploration. While the DeepCAGE library characterized a full 1.4 million transcription initiation events it did not capture
the complete TSS-ome of hippocampus. We ¿tted two statistical models to the CAGE
data and extrapolated how deep sequencing needs to be to capture most of the events.
We concluded that while most genes are discovered, tag clusters and TSSs are not fully
explored

Original language	English

Publisher	Museum Tusculanum
Number of pages	97
Publication status	Published - 2010

Cite this

@phdthesis{2e74014047a311df928f000ea68e967b,

title = "Deciphering Transcriptional Regulation: Computational Approaches",

abstract = "The myriad of cells in the human body are all made from the same blueprint: the humangenome. At the heart of this diversity lies the concept of gene regulation, the processin which it is decided which genes are used where and when. Genes do not functionas on/off buttons, but more like a volume control spanning the range from completelymuted to cranked up to maximum. The volume, in this case, is the production rate ofproteins. This production is the result of a two step procedure: i) transcription, in whicha small part of DNA from the genome (a gene) is transcribed into an RNA molecule (anmRNA); and ii) translation, in which the mRNA is translated into a protein. This thesisfocus on the ¿rst of these steps, transcription, and speci¿cally the initiation of this.Simpli¿ed, initiation is preceded by the binding of several proteins, known as transcription factors (TFs), to DNA. This takes place mostly near the start of the gene knownas the promoter. This region contains patterns scattered in the DNA that the TFs can recognize and bind to. Such binding can prompt the assembly of the pre-initiation complexwhich ultimately leads to transcription of the gene. In order to achieve the regulationnecessary to produce the multitude of tissues we observe, there exists a wide range ofthese TFs having different binding preferences and targeting different genes. By activating different TFs in a context dependent manner the organism can produce customizedsets of proteins for each cell resulting in different cell types.This thesis presents several methods for analysis and description of promoters. Wefocus particularly the binding sites of TFs and computational methods for locating these.We contribute to the ¿eld by compiling a database of binding preferences for TFs whichcan be used for site prediction and provide tools that help investigators use these. Inaddition, a de novo motif discovery tool was developed that locates these patterns inDNA sequences. This compared favorably to many contemporary methods.A novel experimental method, cap-analysis of gene expression (CAGE), was recentlypublished providing an unbiased overview of the transcription start site (TSS) usage ina tissue. We have paired this method with high-throughput sequencing technology toproduce a library of unprecedented depth (DeepCAGE) for the mouse hippocampus. Weinvestigated this in detail and focused particularly on what characterizes a hippocampuspromoter. Pairing CAGE with TF binding site prediction we identi¿ed a likely keyregulator of hippocampus.Finally, we developed a method for CAGE exploration. While the DeepCAGE library characterized a full 1.4 million transcription initiation events it did not capturethe complete TSS-ome of hippocampus. We ¿tted two statistical models to the CAGEdata and extrapolated how deep sequencing needs to be to capture most of the events.We concluded that while most genes are discovered, tag clusters and TSSs are not fullyexplored",

author = "Eivind Valen",

note = "Supervisors: Assoc. Prof. Albin Sandelin Prof. Anders Krogh Assoc. Prof. Ole Winther",

year = "2010",

language = "English",

publisher = "Museum Tusculanum",

}

TY - BOOK

T1 - Deciphering Transcriptional Regulation

T2 - Computational Approaches

AU - Valen, Eivind

N1 - Supervisors: Assoc. Prof. Albin Sandelin Prof. Anders Krogh Assoc. Prof. Ole Winther

PY - 2010

Y1 - 2010

N2 - The myriad of cells in the human body are all made from the same blueprint: the humangenome. At the heart of this diversity lies the concept of gene regulation, the processin which it is decided which genes are used where and when. Genes do not functionas on/off buttons, but more like a volume control spanning the range from completelymuted to cranked up to maximum. The volume, in this case, is the production rate ofproteins. This production is the result of a two step procedure: i) transcription, in whicha small part of DNA from the genome (a gene) is transcribed into an RNA molecule (anmRNA); and ii) translation, in which the mRNA is translated into a protein. This thesisfocus on the ¿rst of these steps, transcription, and speci¿cally the initiation of this.Simpli¿ed, initiation is preceded by the binding of several proteins, known as transcription factors (TFs), to DNA. This takes place mostly near the start of the gene knownas the promoter. This region contains patterns scattered in the DNA that the TFs can recognize and bind to. Such binding can prompt the assembly of the pre-initiation complexwhich ultimately leads to transcription of the gene. In order to achieve the regulationnecessary to produce the multitude of tissues we observe, there exists a wide range ofthese TFs having different binding preferences and targeting different genes. By activating different TFs in a context dependent manner the organism can produce customizedsets of proteins for each cell resulting in different cell types.This thesis presents several methods for analysis and description of promoters. Wefocus particularly the binding sites of TFs and computational methods for locating these.We contribute to the ¿eld by compiling a database of binding preferences for TFs whichcan be used for site prediction and provide tools that help investigators use these. Inaddition, a de novo motif discovery tool was developed that locates these patterns inDNA sequences. This compared favorably to many contemporary methods.A novel experimental method, cap-analysis of gene expression (CAGE), was recentlypublished providing an unbiased overview of the transcription start site (TSS) usage ina tissue. We have paired this method with high-throughput sequencing technology toproduce a library of unprecedented depth (DeepCAGE) for the mouse hippocampus. Weinvestigated this in detail and focused particularly on what characterizes a hippocampuspromoter. Pairing CAGE with TF binding site prediction we identi¿ed a likely keyregulator of hippocampus.Finally, we developed a method for CAGE exploration. While the DeepCAGE library characterized a full 1.4 million transcription initiation events it did not capturethe complete TSS-ome of hippocampus. We ¿tted two statistical models to the CAGEdata and extrapolated how deep sequencing needs to be to capture most of the events.We concluded that while most genes are discovered, tag clusters and TSSs are not fullyexplored

AB - The myriad of cells in the human body are all made from the same blueprint: the humangenome. At the heart of this diversity lies the concept of gene regulation, the processin which it is decided which genes are used where and when. Genes do not functionas on/off buttons, but more like a volume control spanning the range from completelymuted to cranked up to maximum. The volume, in this case, is the production rate ofproteins. This production is the result of a two step procedure: i) transcription, in whicha small part of DNA from the genome (a gene) is transcribed into an RNA molecule (anmRNA); and ii) translation, in which the mRNA is translated into a protein. This thesisfocus on the ¿rst of these steps, transcription, and speci¿cally the initiation of this.Simpli¿ed, initiation is preceded by the binding of several proteins, known as transcription factors (TFs), to DNA. This takes place mostly near the start of the gene knownas the promoter. This region contains patterns scattered in the DNA that the TFs can recognize and bind to. Such binding can prompt the assembly of the pre-initiation complexwhich ultimately leads to transcription of the gene. In order to achieve the regulationnecessary to produce the multitude of tissues we observe, there exists a wide range ofthese TFs having different binding preferences and targeting different genes. By activating different TFs in a context dependent manner the organism can produce customizedsets of proteins for each cell resulting in different cell types.This thesis presents several methods for analysis and description of promoters. Wefocus particularly the binding sites of TFs and computational methods for locating these.We contribute to the ¿eld by compiling a database of binding preferences for TFs whichcan be used for site prediction and provide tools that help investigators use these. Inaddition, a de novo motif discovery tool was developed that locates these patterns inDNA sequences. This compared favorably to many contemporary methods.A novel experimental method, cap-analysis of gene expression (CAGE), was recentlypublished providing an unbiased overview of the transcription start site (TSS) usage ina tissue. We have paired this method with high-throughput sequencing technology toproduce a library of unprecedented depth (DeepCAGE) for the mouse hippocampus. Weinvestigated this in detail and focused particularly on what characterizes a hippocampuspromoter. Pairing CAGE with TF binding site prediction we identi¿ed a likely keyregulator of hippocampus.Finally, we developed a method for CAGE exploration. While the DeepCAGE library characterized a full 1.4 million transcription initiation events it did not capturethe complete TSS-ome of hippocampus. We ¿tted two statistical models to the CAGEdata and extrapolated how deep sequencing needs to be to capture most of the events.We concluded that while most genes are discovered, tag clusters and TSSs are not fullyexplored

M3 - Ph.D. thesis

BT - Deciphering Transcriptional Regulation

PB - Museum Tusculanum

ER -

Deciphering Transcriptional Regulation: Computational Approaches

Abstract

Fingerprint

Cite this