A Probabilistic Genome-Wide Gene Reading Frame Sequence Model

Christian Theil Have; Søren Mørk

A Probabilistic Genome-Wide Gene Reading Frame Sequence Model

Christian Theil Have, Søren Mørk

Abstract

We introduce a new type of probabilistic sequence model, that model the sequential composition of reading frames of genes in a genome.
Our approach extends gene finders with a model of the sequential composition of genes at the genome-level -- effectively producing a sequential genome annotation as output.
The model can be used to obtain the most probable genome annotation based on a combination of i: a gene finder score of each gene candidate and ii: the sequence of the reading frames of gene candidates through a genome.
The model --- as well as a higher order variant --- is developed and tested using the probabilistic logic programming language and machine learning system PRISM - a fast and efficient model prototyping environment, using bacterial gene finding performance as a benchmark of signal strength.
The model is used to prune a set of gene predictions from an underlying gene finder and are evaluated by the effect on prediction performance.
Since bacterial gene finding to a large extent is a solved problem it forms an ideal proving ground for evaluating the explicit modeling of larger scale gene sequence composition of genomes.

We conclude that the sequential composition of gene reading frames is a consistent signal present in bacterial genomes and that it can be effectively modeled with probabilistic sequence models.

Originalsprog	Engelsk
Publikationsdato	apr. 2014
Antal sider	12
Status	Udgivet - apr. 2014
Begivenhed	Internation Work-Conference on Bioinformatics and Biomedical Engineering - Granada, Spanien Varighed: 7 apr. 2014 → 9 apr. 2014 Konferencens nummer: 2

Konference

Konference	Internation Work-Conference on Bioinformatics and Biomedical Engineering
Nummer	2
Land/Område	Spanien
By	Granada
Periode	07/04/2014 → 09/04/2014

Adgang til dokumentet

http://iwbbio.ugr.es/2014/papers/IWBBIO_2014_paper_40.pdf

Citationsformater

@conference{0e8ad58505534672ac8738975db4ac99,

title = "A Probabilistic Genome-Wide Gene Reading Frame Sequence Model",

abstract = "We introduce a new type of probabilistic sequence model, that model the sequential composition of reading frames of genes in a genome.Our approach extends gene finders with a model of the sequential composition of genes at the genome-level -- effectively producing a sequential genome annotation as output.The model can be used to obtain the most probable genome annotation based on a combination of i: a gene finder score of each gene candidate and ii: the sequence of the reading frames of gene candidates through a genome.The model --- as well as a higher order variant --- is developed and tested using the probabilistic logic programming language and machine learning system PRISM - a fast and efficient model prototyping environment, using bacterial gene finding performance as a benchmark of signal strength.The model is used to prune a set of gene predictions from an underlying gene finder and are evaluated by the effect on prediction performance.Since bacterial gene finding to a large extent is a solved problem it forms an ideal proving ground for evaluating the explicit modeling of larger scale gene sequence composition of genomes.We conclude that the sequential composition of gene reading frames is a consistent signal present in bacterial genomes and that it can be effectively modeled with probabilistic sequence models.",

author = "Have, {Christian Theil} and S{\o}ren M{\o}rk",

year = "2014",

month = apr,

language = "English",

note = "Internation Work-Conference on Bioinformatics and Biomedical Engineering, IWBBIO ; Conference date: 07-04-2014 Through 09-04-2014",

}

TY - CONF

T1 - A Probabilistic Genome-Wide Gene Reading Frame Sequence Model

AU - Have, Christian Theil

AU - Mørk, Søren

N1 - Conference code: 2

PY - 2014/4

Y1 - 2014/4

N2 - We introduce a new type of probabilistic sequence model, that model the sequential composition of reading frames of genes in a genome.Our approach extends gene finders with a model of the sequential composition of genes at the genome-level -- effectively producing a sequential genome annotation as output.The model can be used to obtain the most probable genome annotation based on a combination of i: a gene finder score of each gene candidate and ii: the sequence of the reading frames of gene candidates through a genome.The model --- as well as a higher order variant --- is developed and tested using the probabilistic logic programming language and machine learning system PRISM - a fast and efficient model prototyping environment, using bacterial gene finding performance as a benchmark of signal strength.The model is used to prune a set of gene predictions from an underlying gene finder and are evaluated by the effect on prediction performance.Since bacterial gene finding to a large extent is a solved problem it forms an ideal proving ground for evaluating the explicit modeling of larger scale gene sequence composition of genomes.We conclude that the sequential composition of gene reading frames is a consistent signal present in bacterial genomes and that it can be effectively modeled with probabilistic sequence models.

AB - We introduce a new type of probabilistic sequence model, that model the sequential composition of reading frames of genes in a genome.Our approach extends gene finders with a model of the sequential composition of genes at the genome-level -- effectively producing a sequential genome annotation as output.The model can be used to obtain the most probable genome annotation based on a combination of i: a gene finder score of each gene candidate and ii: the sequence of the reading frames of gene candidates through a genome.The model --- as well as a higher order variant --- is developed and tested using the probabilistic logic programming language and machine learning system PRISM - a fast and efficient model prototyping environment, using bacterial gene finding performance as a benchmark of signal strength.The model is used to prune a set of gene predictions from an underlying gene finder and are evaluated by the effect on prediction performance.Since bacterial gene finding to a large extent is a solved problem it forms an ideal proving ground for evaluating the explicit modeling of larger scale gene sequence composition of genomes.We conclude that the sequential composition of gene reading frames is a consistent signal present in bacterial genomes and that it can be effectively modeled with probabilistic sequence models.

M3 - Paper

T2 - Internation Work-Conference on Bioinformatics and Biomedical Engineering

Y2 - 7 April 2014 through 9 April 2014

ER -

A Probabilistic Genome-Wide Gene Reading Frame Sequence Model

Abstract

Konference

Adgang til dokumentet

Fingeraftryk

Citationsformater