Statistical alignment: Computational properties, homology testing and goodness-of-fit

J. Hein; C. Wiuf; B. Knudsen; M. B. Møller; G. Wibling

doi:10.1006/jmbi.2000.4061

Statistical alignment: Computational properties, homology testing and goodness-of-fit

J. Hein^*, C. Wiuf, B. Knudsen, M. B. Møller, G. Wibling

^*Corresponding author af dette arbejde

63 Citationer (Scopus)

Abstract

The model of insertions and deletions in biological sequences, first formulated by Thorne, Kishino, and Felsenstein in 1991 (the TKF91 model), provides a basis for performing alignment within a statistical framework. Here we investigate this model. Firstly, we show how to accelerate the statistical alignment algorithms several orders of magnitude. The main innovations are to confine likelihood calculations to a band close to the similarity based alignment, to get good initial guesses of the evolutionary parameters and to apply an efficient numerical optimisation algorithm for finding the maximum likelihood estimate. In addition, the recursions originally presented by Thorne, Kishino and Felsenstein can be simplified. Two proteins, about 1500 amino acids long, can be analysed with this method in less than five seconds on a fast desktop computer, which makes this method practical for actual data analysis. Secondly, we propose a new homology test based on this model, where homology means that an ancestor to a sequence pair can be found finitely far back in time. This test has statistical advantages relative to the traditional shuffle test for proteins. Finally, we describe a goodness-of-fit test, that allows testing the proposed insertion-deletion (indel) process inherent to this model and find that real sequences (here globins) probably experience indels longer than one, contrary to what is assumed by the model. (C) 2000 Academic Press.

Originalsprog	Engelsk
Tidsskrift	Journal of Molecular Biology
Vol/bind	302
Udgave nummer	1
Sider (fra-til)	265-279
Antal sider	15
ISSN	0022-2836
DOI	https://doi.org/10.1006/jmbi.2000.4061
Status	Udgivet - 8 sep. 2000
Udgivet eksternt	Ja

Adgang til dokumentet

10.1006/jmbi.2000.4061

Andre filer og links

Link to publication in Scopus

Citationsformater

@article{6442235fb4cf412695a927c1a3f088be,

title = "Statistical alignment: Computational properties, homology testing and goodness-of-fit",

abstract = "The model of insertions and deletions in biological sequences, first formulated by Thorne, Kishino, and Felsenstein in 1991 (the TKF91 model), provides a basis for performing alignment within a statistical framework. Here we investigate this model. Firstly, we show how to accelerate the statistical alignment algorithms several orders of magnitude. The main innovations are to confine likelihood calculations to a band close to the similarity based alignment, to get good initial guesses of the evolutionary parameters and to apply an efficient numerical optimisation algorithm for finding the maximum likelihood estimate. In addition, the recursions originally presented by Thorne, Kishino and Felsenstein can be simplified. Two proteins, about 1500 amino acids long, can be analysed with this method in less than five seconds on a fast desktop computer, which makes this method practical for actual data analysis. Secondly, we propose a new homology test based on this model, where homology means that an ancestor to a sequence pair can be found finitely far back in time. This test has statistical advantages relative to the traditional shuffle test for proteins. Finally, we describe a goodness-of-fit test, that allows testing the proposed insertion-deletion (indel) process inherent to this model and find that real sequences (here globins) probably experience indels longer than one, contrary to what is assumed by the model. (C) 2000 Academic Press.",

keywords = "Goodness-of-fit, Homology testing, Statistical alignment",

author = "J. Hein and C. Wiuf and B. Knudsen and M{\o}ller, {M. B.} and G. Wibling",

year = "2000",

month = sep,

day = "8",

doi = "10.1006/jmbi.2000.4061",

language = "English",

volume = "302",

pages = "265--279",

journal = "Journal of Molecular Biology",

issn = "0022-2836",

publisher = "Academic Press",

number = "1",

}

TY - JOUR

T1 - Statistical alignment

T2 - Computational properties, homology testing and goodness-of-fit

AU - Hein, J.

AU - Wiuf, C.

AU - Knudsen, B.

AU - Møller, M. B.

AU - Wibling, G.

PY - 2000/9/8

Y1 - 2000/9/8

N2 - The model of insertions and deletions in biological sequences, first formulated by Thorne, Kishino, and Felsenstein in 1991 (the TKF91 model), provides a basis for performing alignment within a statistical framework. Here we investigate this model. Firstly, we show how to accelerate the statistical alignment algorithms several orders of magnitude. The main innovations are to confine likelihood calculations to a band close to the similarity based alignment, to get good initial guesses of the evolutionary parameters and to apply an efficient numerical optimisation algorithm for finding the maximum likelihood estimate. In addition, the recursions originally presented by Thorne, Kishino and Felsenstein can be simplified. Two proteins, about 1500 amino acids long, can be analysed with this method in less than five seconds on a fast desktop computer, which makes this method practical for actual data analysis. Secondly, we propose a new homology test based on this model, where homology means that an ancestor to a sequence pair can be found finitely far back in time. This test has statistical advantages relative to the traditional shuffle test for proteins. Finally, we describe a goodness-of-fit test, that allows testing the proposed insertion-deletion (indel) process inherent to this model and find that real sequences (here globins) probably experience indels longer than one, contrary to what is assumed by the model. (C) 2000 Academic Press.

AB - The model of insertions and deletions in biological sequences, first formulated by Thorne, Kishino, and Felsenstein in 1991 (the TKF91 model), provides a basis for performing alignment within a statistical framework. Here we investigate this model. Firstly, we show how to accelerate the statistical alignment algorithms several orders of magnitude. The main innovations are to confine likelihood calculations to a band close to the similarity based alignment, to get good initial guesses of the evolutionary parameters and to apply an efficient numerical optimisation algorithm for finding the maximum likelihood estimate. In addition, the recursions originally presented by Thorne, Kishino and Felsenstein can be simplified. Two proteins, about 1500 amino acids long, can be analysed with this method in less than five seconds on a fast desktop computer, which makes this method practical for actual data analysis. Secondly, we propose a new homology test based on this model, where homology means that an ancestor to a sequence pair can be found finitely far back in time. This test has statistical advantages relative to the traditional shuffle test for proteins. Finally, we describe a goodness-of-fit test, that allows testing the proposed insertion-deletion (indel) process inherent to this model and find that real sequences (here globins) probably experience indels longer than one, contrary to what is assumed by the model. (C) 2000 Academic Press.

KW - Goodness-of-fit

KW - Homology testing

KW - Statistical alignment

UR - http://www.scopus.com/inward/record.url?scp=0034623015&partnerID=8YFLogxK

U2 - 10.1006/jmbi.2000.4061

DO - 10.1006/jmbi.2000.4061

M3 - Journal article

C2 - 10964574

AN - SCOPUS:0034623015

SN - 0022-2836

VL - 302

SP - 265

EP - 279

JO - Journal of Molecular Biology

JF - Journal of Molecular Biology

IS - 1

ER -

Statistical alignment: Computational properties, homology testing and goodness-of-fit

Abstract

Adgang til dokumentet

Andre filer og links

Fingeraftryk

Citationsformater