TY - JOUR
T1 - Estimation of allele frequency and association mapping using next-generation sequencing data
AU - Kim, Su Yeon
AU - Lohmueller, Kirk E
AU - Albrechtsen, Anders
AU - Li, Yingrui
AU - Korneliussen, Thorfinn Sand
AU - Tian, Geng
AU - Grarup, Niels
AU - Jiang, Tao
AU - Andersen, Gitte
AU - Witte, Daniel
AU - Jorgensen, Torben
AU - Hansen, Torben
AU - Pedersen, Oluf
AU - Wang, Jun
AU - Nielsen, Rasmus
PY - 2011/6/1
Y1 - 2011/6/1
N2 - Background: Estimation of allele frequency is of fundamental importance in population genetic analyses and in association mapping. In most studies using next-generation sequencing, a cost effective approach is to use medium or low-coverage data (e.g., < 15X). However, SNP calling and allele frequency estimation in such studies is associated with substantial statistical uncertainty because of varying coverage and high error rates.Results: We evaluate a new maximum likelihood method for estimating allele frequencies in low and medium coverage next-generation sequencing data. The method is based on integrating over uncertainty in the data for each individual rather than first calling genotypes. This method can be applied to directly test for associations in case/control studies. We use simulations to compare the likelihood method to methods based on genotype calling, and show that the likelihood method outperforms the genotype calling methods in terms of: (1) accuracy of allele frequency estimation, (2) accuracy of the estimation of the distribution of allele frequencies across neutrally evolving sites, and (3) statistical power in association mapping studies. Using real re-sequencing data from 200 individuals obtained from an exon-capture experiment, we show that the patterns observed in the simulations are also found in real data.Conclusions: Overall, our results suggest that association mapping and estimation of allele frequencies should not be based on genotype calling in low to medium coverage data. Furthermore, if genotype calling methods are used, it is usually better not to filter genotypes based on the call confidence score.
AB - Background: Estimation of allele frequency is of fundamental importance in population genetic analyses and in association mapping. In most studies using next-generation sequencing, a cost effective approach is to use medium or low-coverage data (e.g., < 15X). However, SNP calling and allele frequency estimation in such studies is associated with substantial statistical uncertainty because of varying coverage and high error rates.Results: We evaluate a new maximum likelihood method for estimating allele frequencies in low and medium coverage next-generation sequencing data. The method is based on integrating over uncertainty in the data for each individual rather than first calling genotypes. This method can be applied to directly test for associations in case/control studies. We use simulations to compare the likelihood method to methods based on genotype calling, and show that the likelihood method outperforms the genotype calling methods in terms of: (1) accuracy of allele frequency estimation, (2) accuracy of the estimation of the distribution of allele frequencies across neutrally evolving sites, and (3) statistical power in association mapping studies. Using real re-sequencing data from 200 individuals obtained from an exon-capture experiment, we show that the patterns observed in the simulations are also found in real data.Conclusions: Overall, our results suggest that association mapping and estimation of allele frequencies should not be based on genotype calling in low to medium coverage data. Furthermore, if genotype calling methods are used, it is usually better not to filter genotypes based on the call confidence score.
KW - Gene Frequency
KW - Genetics, Population
KW - Genotype
KW - High-Throughput Nucleotide Sequencing
KW - Humans
KW - Likelihood Functions
KW - Polymorphism, Single Nucleotide
KW - Sequence Analysis, DNA
U2 - 10.1186/1471-2105-12-231
DO - 10.1186/1471-2105-12-231
M3 - Journal article
C2 - 21663684
SN - 1471-2105
VL - 12
SP - 231
JO - B M C Bioinformatics
JF - B M C Bioinformatics
ER -