Abstract
The myriad of cells in the human body are all made from the same blueprint: the human
genome. At the heart of this diversity lies the concept of gene regulation, the process
in which it is decided which genes are used where and when. Genes do not function
as on/off buttons, but more like a volume control spanning the range from completely
muted to cranked up to maximum. The volume, in this case, is the production rate of
proteins. This production is the result of a two step procedure: i) transcription, in which
a small part of DNA from the genome (a gene) is transcribed into an RNA molecule (an
mRNA); and ii) translation, in which the mRNA is translated into a protein. This thesis
focus on the ¿rst of these steps, transcription, and speci¿cally the initiation of this.
Simpli¿ed, initiation is preceded by the binding of several proteins, known as transcription factors (TFs), to DNA. This takes place mostly near the start of the gene known
as the promoter. This region contains patterns scattered in the DNA that the TFs can recognize and bind to. Such binding can prompt the assembly of the pre-initiation complex
which ultimately leads to transcription of the gene. In order to achieve the regulation
necessary to produce the multitude of tissues we observe, there exists a wide range of
these TFs having different binding preferences and targeting different genes. By activating different TFs in a context dependent manner the organism can produce customized
sets of proteins for each cell resulting in different cell types.
This thesis presents several methods for analysis and description of promoters. We
focus particularly the binding sites of TFs and computational methods for locating these.
We contribute to the ¿eld by compiling a database of binding preferences for TFs which
can be used for site prediction and provide tools that help investigators use these. In
addition, a de novo motif discovery tool was developed that locates these patterns in
DNA sequences. This compared favorably to many contemporary methods.
A novel experimental method, cap-analysis of gene expression (CAGE), was recently
published providing an unbiased overview of the transcription start site (TSS) usage in
a tissue. We have paired this method with high-throughput sequencing technology to
produce a library of unprecedented depth (DeepCAGE) for the mouse hippocampus. We
investigated this in detail and focused particularly on what characterizes a hippocampus
promoter. Pairing CAGE with TF binding site prediction we identi¿ed a likely key
regulator of hippocampus.
Finally, we developed a method for CAGE exploration. While the DeepCAGE library characterized a full 1.4 million transcription initiation events it did not capture
the complete TSS-ome of hippocampus. We ¿tted two statistical models to the CAGE
data and extrapolated how deep sequencing needs to be to capture most of the events.
We concluded that while most genes are discovered, tag clusters and TSSs are not fully
explored
Original language | English |
---|
Publisher | Museum Tusculanum |
---|---|
Number of pages | 97 |
Publication status | Published - 2010 |