TY - JOUR
T1 - Introduction of the Python script STRinNGS for analysis of STR regions in FASTQ or BAM files and expansion of the Danish STR sequence database to 11 STRs
AU - Friis, Susanne L
AU - Buchard, Anders
AU - Rockenbauer, Eszter
AU - Børsting, Claus
AU - Morling, Niels
N1 - Copyright © 2015 Elsevier Ireland Ltd. All rights reserved.
PY - 2016/3/1
Y1 - 2016/3/1
N2 - This work introduces the in-house developed Python application STRinNGS for analysis of STR sequence elements in BAM or FASTQ files. STRinNGS identifies sequence reads with STR loci by their flanking sequences, it analyses the STR sequence and the flanking regions, and generates a report with the assigned SNP-STR alleles. The main output file from STRinNGS contains all sequences with read counts above 1% of the total number of reads per locus. STR sequences are automatically named according to the nomenclature used previously and according to the repeat unit definitions in STRBase (http://www.cstl.nist.gov/strbase/). The sequences are named with (1) the locus name, (2) the length of the repeat region divided by the length of the repeat unit, (3) the sequence(s) of the repeat unit(s) followed by the number of repeats and (4) variations in the flanking regions. Lower case letters in the main output file are used to flag sequences with previously unknown variations in the STRs. SNPs in the flanking regions are named by their "rs" numbers and the nucleotides in the SNP position. Data from 207 Danes sequenced with the Ion Torrent™ HID STR 10-plex that amplified nine STRs (CSF1PO, D3S1358, D5S818, D7S820, D8S1179, D16S539, TH01, TPOX, vWA), and Amelogenin was analysed with STRinNGS. Sequencing uncovered five common SNPs near four STRs and revealed 20 new alleles in the 207 Danes. Three short homopolymers in the D8S1179 flanking regions caused frequent sequencing errors. In 29 of 3726 allele calls (0.8%), sequences with homopolymer errors were falsely assigned as true alleles. An in-house developed script in R compensated for these errors by compiling sequence reads that had identical STR sequences and identical nucleotides in the five common SNPs. In the output file from the R script, all SNP-STR haplotype calls were correct. The 207 samples and six additional samples were sequenced for D3S1358, D12S391, and D21S11 using the 454 GS Junior platform in this and a previous work. Overall, next generation sequencing (NGS) of the 11 STRs lowered the mean match probability 386 times and increased the typical paternity indexes (i.e. the geometric mean) for trios and duos 47 and 23 times, respectively, compared to the traditional PCR-CE typing of the same population.
AB - This work introduces the in-house developed Python application STRinNGS for analysis of STR sequence elements in BAM or FASTQ files. STRinNGS identifies sequence reads with STR loci by their flanking sequences, it analyses the STR sequence and the flanking regions, and generates a report with the assigned SNP-STR alleles. The main output file from STRinNGS contains all sequences with read counts above 1% of the total number of reads per locus. STR sequences are automatically named according to the nomenclature used previously and according to the repeat unit definitions in STRBase (http://www.cstl.nist.gov/strbase/). The sequences are named with (1) the locus name, (2) the length of the repeat region divided by the length of the repeat unit, (3) the sequence(s) of the repeat unit(s) followed by the number of repeats and (4) variations in the flanking regions. Lower case letters in the main output file are used to flag sequences with previously unknown variations in the STRs. SNPs in the flanking regions are named by their "rs" numbers and the nucleotides in the SNP position. Data from 207 Danes sequenced with the Ion Torrent™ HID STR 10-plex that amplified nine STRs (CSF1PO, D3S1358, D5S818, D7S820, D8S1179, D16S539, TH01, TPOX, vWA), and Amelogenin was analysed with STRinNGS. Sequencing uncovered five common SNPs near four STRs and revealed 20 new alleles in the 207 Danes. Three short homopolymers in the D8S1179 flanking regions caused frequent sequencing errors. In 29 of 3726 allele calls (0.8%), sequences with homopolymer errors were falsely assigned as true alleles. An in-house developed script in R compensated for these errors by compiling sequence reads that had identical STR sequences and identical nucleotides in the five common SNPs. In the output file from the R script, all SNP-STR haplotype calls were correct. The 207 samples and six additional samples were sequenced for D3S1358, D12S391, and D21S11 using the 454 GS Junior platform in this and a previous work. Overall, next generation sequencing (NGS) of the 11 STRs lowered the mean match probability 386 times and increased the typical paternity indexes (i.e. the geometric mean) for trios and duos 47 and 23 times, respectively, compared to the traditional PCR-CE typing of the same population.
U2 - 10.1016/j.fsigen.2015.12.006
DO - 10.1016/j.fsigen.2015.12.006
M3 - Journal article
C2 - 26722765
SN - 1872-4973
VL - 21
SP - 68
EP - 75
JO - Forensic Science International: Genetics
JF - Forensic Science International: Genetics
ER -