Abstract
Abstract—Undetected mislabeled samples may affect the
results of genotype studies, particular when rare genetic
variants are investigated. Mislabeled samples are often not
detected during quality control and if they are detected, they
are normally discarded due to a lack of a reliable method to
recover the correct labels.
Here we describe a statistical method which given a few extra
independent genotypes (barcode genotypes) detects mislabeled
samples and recovers the correct labels for sample mix-ups. We
have implemented the method in a program (named
Wunderbar) and we evaluate the reliability of the method on
simulated data. We find that even with only a small number of
barcode genotypes, Wunderbar is capable of identifying
mislabeled samples and sample mix-ups with high sensitivity
and specificity, even with a high genotyping error rate and even
in the presence of dependency between the individual barcode
genotypes.
To detect mislabeled samples we calculate the probability
that the discordance between genotypes in the data and in the
independent genotypes can be attributed to random
(non-mislabeling) genotyping errors. To identify mix-ups we
calculate the probability of identifying the set of identical
genotypes between sample x and sample y by chance. Based on
this we calculate a mix-up confidence score with penalization
for introducing mismatches in the proposed new label and
adjustment for independency among the genotypes. This
confidence score is used to identify probable mix-ups.
results of genotype studies, particular when rare genetic
variants are investigated. Mislabeled samples are often not
detected during quality control and if they are detected, they
are normally discarded due to a lack of a reliable method to
recover the correct labels.
Here we describe a statistical method which given a few extra
independent genotypes (barcode genotypes) detects mislabeled
samples and recovers the correct labels for sample mix-ups. We
have implemented the method in a program (named
Wunderbar) and we evaluate the reliability of the method on
simulated data. We find that even with only a small number of
barcode genotypes, Wunderbar is capable of identifying
mislabeled samples and sample mix-ups with high sensitivity
and specificity, even with a high genotyping error rate and even
in the presence of dependency between the individual barcode
genotypes.
To detect mislabeled samples we calculate the probability
that the discordance between genotypes in the data and in the
independent genotypes can be attributed to random
(non-mislabeling) genotyping errors. To identify mix-ups we
calculate the probability of identifying the set of identical
genotypes between sample x and sample y by chance. Based on
this we calculate a mix-up confidence score with penalization
for introducing mismatches in the proposed new label and
adjustment for independency among the genotypes. This
confidence score is used to identify probable mix-ups.
Original language | English |
---|---|
Article number | 370 |
Journal | International Journal of Bioscience, Biochemistry and Bioinformatics |
Volume | 4 |
Issue number | 5 |
Pages (from-to) | 355-360 |
Number of pages | 5 |
ISSN | 2010-3638 |
DOIs | |
Publication status | Published - 2014 |