Identification of Mislabeled Samples and Sample Mix-ups in Genotype Data using Barcode Genotypes

Christian Theil Have; Emil Vincent Rosenbaum Appel; Niels Grarup; Torben Hansen; Jette Bork-Jensen

doi:10.7763/ijbbb.2014.v4.370

Identification of Mislabeled Samples and Sample Mix-ups in Genotype Data using Barcode Genotypes

Christian Theil Have, Emil Vincent Rosenbaum Appel, Niels Grarup, Torben Hansen, Jette Bork-Jensen

Abstract

Abstract—Undetected mislabeled samples may affect the
results of genotype studies, particular when rare genetic
variants are investigated. Mislabeled samples are often not
detected during quality control and if they are detected, they
are normally discarded due to a lack of a reliable method to
recover the correct labels.
Here we describe a statistical method which given a few extra
independent genotypes (barcode genotypes) detects mislabeled
samples and recovers the correct labels for sample mix-ups. We
have implemented the method in a program (named
Wunderbar) and we evaluate the reliability of the method on
simulated data. We find that even with only a small number of
barcode genotypes, Wunderbar is capable of identifying
mislabeled samples and sample mix-ups with high sensitivity
and specificity, even with a high genotyping error rate and even
in the presence of dependency between the individual barcode
genotypes.
To detect mislabeled samples we calculate the probability
that the discordance between genotypes in the data and in the
independent genotypes can be attributed to random
(non-mislabeling) genotyping errors. To identify mix-ups we
calculate the probability of identifying the set of identical
genotypes between sample x and sample y by chance. Based on
this we calculate a mix-up confidence score with penalization
for introducing mismatches in the proposed new label and
adjustment for independency among the genotypes. This
confidence score is used to identify probable mix-ups.

Original language	English
Article number	370
Journal	International Journal of Bioscience, Biochemistry and Bioinformatics
Volume	4
Issue number	5
Pages (from-to)	355-360
Number of pages	5
ISSN	2010-3638
DOIs	https://doi.org/10.7763/ijbbb.2014.v4.370
Publication status	Published - 2014

Access to Document

10.7763/ijbbb.2014.v4.370

Cite this

Identification of Mislabeled Samples and Sample Mix-ups in Genotype Data using Barcode Genotypes. / Have, Christian Theil; Appel, Emil Vincent Rosenbaum; Grarup, Niels et al.

In: International Journal of Bioscience, Biochemistry and Bioinformatics, Vol. 4, No. 5, 370, 2014, p. 355-360.

Research output: Contribution to journal › Journal article › Research › peer-review

@article{a4415ffc01e34b529875eab08079a354,

title = "Identification of Mislabeled Samples and Sample Mix-ups in Genotype Data using Barcode Genotypes",

abstract = "Abstract—Undetected mislabeled samples may affect theresults of genotype studies, particular when rare geneticvariants are investigated. Mislabeled samples are often notdetected during quality control and if they are detected, theyare normally discarded due to a lack of a reliable method torecover the correct labels.Here we describe a statistical method which given a few extraindependent genotypes (barcode genotypes) detects mislabeledsamples and recovers the correct labels for sample mix-ups. Wehave implemented the method in a program (namedWunderbar) and we evaluate the reliability of the method onsimulated data. We find that even with only a small number ofbarcode genotypes, Wunderbar is capable of identifyingmislabeled samples and sample mix-ups with high sensitivityand specificity, even with a high genotyping error rate and evenin the presence of dependency between the individual barcodegenotypes.To detect mislabeled samples we calculate the probabilitythat the discordance between genotypes in the data and in theindependent genotypes can be attributed to random(non-mislabeling) genotyping errors. To identify mix-ups wecalculate the probability of identifying the set of identicalgenotypes between sample x and sample y by chance. Based onthis we calculate a mix-up confidence score with penalizationfor introducing mismatches in the proposed new label andadjustment for independency among the genotypes. Thisconfidence score is used to identify probable mix-ups.",

author = "Have, {Christian Theil} and Appel, {Emil Vincent Rosenbaum} and Niels Grarup and Torben Hansen and Jette Bork-Jensen",

year = "2014",

doi = "10.7763/ijbbb.2014.v4.370",

language = "English",

volume = "4",

pages = "355--360",

journal = "International Journal of Bioscience, Biochemistry and Bioinformatics",

issn = "2010-3638",

publisher = "International Academy Publishing",

number = "5",

}

TY - JOUR

T1 - Identification of Mislabeled Samples and Sample Mix-ups in Genotype Data using Barcode Genotypes

AU - Have, Christian Theil

AU - Appel, Emil Vincent Rosenbaum

AU - Grarup, Niels

AU - Hansen, Torben

AU - Bork-Jensen, Jette

PY - 2014

Y1 - 2014

N2 - Abstract—Undetected mislabeled samples may affect theresults of genotype studies, particular when rare geneticvariants are investigated. Mislabeled samples are often notdetected during quality control and if they are detected, theyare normally discarded due to a lack of a reliable method torecover the correct labels.Here we describe a statistical method which given a few extraindependent genotypes (barcode genotypes) detects mislabeledsamples and recovers the correct labels for sample mix-ups. Wehave implemented the method in a program (namedWunderbar) and we evaluate the reliability of the method onsimulated data. We find that even with only a small number ofbarcode genotypes, Wunderbar is capable of identifyingmislabeled samples and sample mix-ups with high sensitivityand specificity, even with a high genotyping error rate and evenin the presence of dependency between the individual barcodegenotypes.To detect mislabeled samples we calculate the probabilitythat the discordance between genotypes in the data and in theindependent genotypes can be attributed to random(non-mislabeling) genotyping errors. To identify mix-ups wecalculate the probability of identifying the set of identicalgenotypes between sample x and sample y by chance. Based onthis we calculate a mix-up confidence score with penalizationfor introducing mismatches in the proposed new label andadjustment for independency among the genotypes. Thisconfidence score is used to identify probable mix-ups.

AB - Abstract—Undetected mislabeled samples may affect theresults of genotype studies, particular when rare geneticvariants are investigated. Mislabeled samples are often notdetected during quality control and if they are detected, theyare normally discarded due to a lack of a reliable method torecover the correct labels.Here we describe a statistical method which given a few extraindependent genotypes (barcode genotypes) detects mislabeledsamples and recovers the correct labels for sample mix-ups. Wehave implemented the method in a program (namedWunderbar) and we evaluate the reliability of the method onsimulated data. We find that even with only a small number ofbarcode genotypes, Wunderbar is capable of identifyingmislabeled samples and sample mix-ups with high sensitivityand specificity, even with a high genotyping error rate and evenin the presence of dependency between the individual barcodegenotypes.To detect mislabeled samples we calculate the probabilitythat the discordance between genotypes in the data and in theindependent genotypes can be attributed to random(non-mislabeling) genotyping errors. To identify mix-ups wecalculate the probability of identifying the set of identicalgenotypes between sample x and sample y by chance. Based onthis we calculate a mix-up confidence score with penalizationfor introducing mismatches in the proposed new label andadjustment for independency among the genotypes. Thisconfidence score is used to identify probable mix-ups.

U2 - 10.7763/ijbbb.2014.v4.370

DO - 10.7763/ijbbb.2014.v4.370

M3 - Journal article

SN - 2010-3638

VL - 4

SP - 355

EP - 360

JO - International Journal of Bioscience, Biochemistry and Bioinformatics

JF - International Journal of Bioscience, Biochemistry and Bioinformatics

IS - 5

M1 - 370

ER -

Identification of Mislabeled Samples and Sample Mix-ups in Genotype Data using Barcode Genotypes

Abstract

Access to Document

Fingerprint

Cite this