Assembly-free and alignment-free sample identification using genome skims

Shahab Sarmashghi; Kristine Bohmann; M. Thomas P. Gilbert; Vineet Bafna; Siavash Mirarab

doi:10.1007/978-3-319-89929-9

Assembly-free and alignment-free sample identification using genome skims

Shahab Sarmashghi^*, Kristine Bohmann, M. Thomas P. Gilbert, Vineet Bafna, Siavash Mirarab

^*Corresponding author af dette arbejde

Abstract

The ability to quickly and inexpensively describe the taxonomic diversity in an environment is critical in this era of rapid climate and biodiversity changes. The currently preferred molecular technique, barcoding, is low-cost and widely used, but has drawbacks. As sequencing costs continue to fall, an alternative approach based on genome-skimming has been proposed [1, 2]. This approach first applies low-pass (100 Mb – several Gb per sample) sequencing to voucher and/or query samples and then recovers marker genes and/or organelle genomes computationally. In contrast, we suggest the use of the unassembled sequence data for taxonomic identification using an alignment-free approach based on the k-mer decomposition of the sequencing reads. Specifically, we first estimate the average sequencing depth and error rate for each genome skim, by comparing our derived theoretical distribution of k-mers’ multiplicity and the histogram of k-mer counts computed using Jellyfish [3]. The genome length is also estimated from the average sequencing depth accordingly. Then, the similarity of two genome skims is measured by the Jaccard index between their corresponding k-mer collections. Finally, the hamming distance between genomes is estimated from the Jaccard index, using the following formula obtained by modeling the impact of low sequencing coverage, sequencing error, and differing genome lengths on the similarity of genome skims: D_1/k2(ζ1 L₁ + ζ₂ L₂)J D = 1 ™. η₁ η₂ (L₁ + L₂)(1 + J) In this equation, when coverage is low, we use all k-mers and set: η_i = 1 ™ e^{™ci(1™k/ℓ)(1™ɛi)k}, ζ_i = η_i + c_i (1 ™ k/ℓ)(1 ™ (1 ™ ɛ_i)^k). For higher coverages, we remove k-mers with multiplicity below a threshold m, and set: m™1 ∑ (c_i (1 ™ k/ℓ)(1 ™ ɛ_i)^k)^t ζ_i = η_i = 1 ™ e^{™ci(1™k/ℓ)(1™ɛi)k}. t! t=0 In these equations, k and ℓ are k-mer and read length, respectively, and c_i, ɛ_i, and L_i are substituted from the estimates of coverage, error rate, and genome length for each genome skim. The Jaccard index between two genome skims, J, is computed by Mash [4] efficiently using a hashing technique. We have tested our tool, Skmer, on genome skims simulated from assemblies of 90 species from two genera of insects (Anopheles and Drosophila) and across the avian tree of life. We test the accuracy of the distances computed by Skmer, and subsequently use the distances to find the exact/closest match to a query sample in a reference set of genome skims. Comparing to the other k-mer based tools, Skmer shows excellent performance in our simulation studies, especially when the coverage is below 4X [5]. Skmer makes the assembly-free approach to genome-skimming a viable alternative to the traditional barcoding. The software is made publicly available on Github (https://github.com/shahab-sarmashghi/Skmer.git).

Originalsprog	Engelsk
Bogserie	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Vol/bind	10812
Sider (fra-til)	276-277
Antal sider	2
ISSN	0302-9743
DOI	https://doi.org/10.1007/978-3-319-89929-9
Status	Udgivet - 1 jan. 2018
Begivenhed	22nd International Conference on Research in Computational Molecular Biology, RECOMB 2018 - Paris, Frankrig Varighed: 21 apr. 2018 → 24 apr. 2018

Konference

Konference	22nd International Conference on Research in Computational Molecular Biology, RECOMB 2018
Land/Område	Frankrig
By	Paris
Periode	21/04/2018 → 24/04/2018

Adgang til dokumentet

10.1007/978-3-319-89929-9

Andre filer og links

Link to publication in Scopus

Citationsformater

Assembly-free and alignment-free sample identification using genome skims. / Sarmashghi, Shahab; Bohmann, Kristine ; Gilbert, M. Thomas P. et al.
I: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Bind 10812, 01.01.2018, s. 276-277.

Publikation: Bidrag til tidsskrift › Konferenceabstrakt i tidsskrift › Forskning › peer review

@article{deb207b07bdf499ba243d3b49d347c67,

title = "Assembly-free and alignment-free sample identification using genome skims",

abstract = "The ability to quickly and inexpensively describe the taxonomic diversity in an environment is critical in this era of rapid climate and biodiversity changes. The currently preferred molecular technique, barcoding, is low-cost and widely used, but has drawbacks. As sequencing costs continue to fall, an alternative approach based on genome-skimming has been proposed [1, 2]. This approach first applies low-pass (100 Mb – several Gb per sample) sequencing to voucher and/or query samples and then recovers marker genes and/or organelle genomes computationally. In contrast, we suggest the use of the unassembled sequence data for taxonomic identification using an alignment-free approach based on the k-mer decomposition of the sequencing reads. Specifically, we first estimate the average sequencing depth and error rate for each genome skim, by comparing our derived theoretical distribution of k-mers{\textquoteright} multiplicity and the histogram of k-mer counts computed using Jellyfish [3]. The genome length is also estimated from the average sequencing depth accordingly. Then, the similarity of two genome skims is measured by the Jaccard index between their corresponding k-mer collections. Finally, the hamming distance between genomes is estimated from the Jaccard index, using the following formula obtained by modeling the impact of low sequencing coverage, sequencing error, and differing genome lengths on the similarity of genome skims: D1/k2(ζ1 L1 + ζ2 L2)J D = 1 {\texttrademark}. η1 η2 (L1 + L2)(1 + J) In this equation, when coverage is low, we use all k-mers and set: ηi = 1 {\texttrademark} e{\texttrademark}ci(1{\texttrademark}k/ℓ)(1{\texttrademark}ɛi)k, ζi = ηi + ci (1 {\texttrademark} k/ℓ)(1 {\texttrademark} (1 {\texttrademark} ɛi)k). For higher coverages, we remove k-mers with multiplicity below a threshold m, and set: m{\texttrademark}1 ∑ (ci (1 {\texttrademark} k/ℓ)(1 {\texttrademark} ɛi)k)t ζi = ηi = 1 {\texttrademark} e{\texttrademark}ci(1{\texttrademark}k/ℓ)(1{\texttrademark}ɛi)k. t! t=0 In these equations, k and ℓ are k-mer and read length, respectively, and ci, ɛi, and Li are substituted from the estimates of coverage, error rate, and genome length for each genome skim. The Jaccard index between two genome skims, J, is computed by Mash [4] efficiently using a hashing technique. We have tested our tool, Skmer, on genome skims simulated from assemblies of 90 species from two genera of insects (Anopheles and Drosophila) and across the avian tree of life. We test the accuracy of the distances computed by Skmer, and subsequently use the distances to find the exact/closest match to a query sample in a reference set of genome skims. Comparing to the other k-mer based tools, Skmer shows excellent performance in our simulation studies, especially when the coverage is below 4X [5]. Skmer makes the assembly-free approach to genome-skimming a viable alternative to the traditional barcoding. The software is made publicly available on Github (https://github.com/shahab-sarmashghi/Skmer.git).",

author = "Shahab Sarmashghi and Kristine Bohmann and Gilbert, {M. Thomas P.} and Vineet Bafna and Siavash Mirarab",

year = "2018",

month = jan,

day = "1",

doi = "10.1007/978-3-319-89929-9",

language = "English",

volume = "10812",

pages = "276--277",

journal = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

issn = "0302-9743",

publisher = "Springer",

note = "22nd International Conference on Research in Computational Molecular Biology, RECOMB 2018 ; Conference date: 21-04-2018 Through 24-04-2018",

}

TY - ABST

T1 - Assembly-free and alignment-free sample identification using genome skims

AU - Sarmashghi, Shahab

AU - Bohmann, Kristine

AU - Gilbert, M. Thomas P.

AU - Bafna, Vineet

AU - Mirarab, Siavash

PY - 2018/1/1

Y1 - 2018/1/1

N2 - The ability to quickly and inexpensively describe the taxonomic diversity in an environment is critical in this era of rapid climate and biodiversity changes. The currently preferred molecular technique, barcoding, is low-cost and widely used, but has drawbacks. As sequencing costs continue to fall, an alternative approach based on genome-skimming has been proposed [1, 2]. This approach first applies low-pass (100 Mb – several Gb per sample) sequencing to voucher and/or query samples and then recovers marker genes and/or organelle genomes computationally. In contrast, we suggest the use of the unassembled sequence data for taxonomic identification using an alignment-free approach based on the k-mer decomposition of the sequencing reads. Specifically, we first estimate the average sequencing depth and error rate for each genome skim, by comparing our derived theoretical distribution of k-mers’ multiplicity and the histogram of k-mer counts computed using Jellyfish [3]. The genome length is also estimated from the average sequencing depth accordingly. Then, the similarity of two genome skims is measured by the Jaccard index between their corresponding k-mer collections. Finally, the hamming distance between genomes is estimated from the Jaccard index, using the following formula obtained by modeling the impact of low sequencing coverage, sequencing error, and differing genome lengths on the similarity of genome skims: D1/k2(ζ1 L1 + ζ2 L2)J D = 1 ™. η1 η2 (L1 + L2)(1 + J) In this equation, when coverage is low, we use all k-mers and set: ηi = 1 ™ e™ci(1™k/ℓ)(1™ɛi)k, ζi = ηi + ci (1 ™ k/ℓ)(1 ™ (1 ™ ɛi)k). For higher coverages, we remove k-mers with multiplicity below a threshold m, and set: m™1 ∑ (ci (1 ™ k/ℓ)(1 ™ ɛi)k)t ζi = ηi = 1 ™ e™ci(1™k/ℓ)(1™ɛi)k. t! t=0 In these equations, k and ℓ are k-mer and read length, respectively, and ci, ɛi, and Li are substituted from the estimates of coverage, error rate, and genome length for each genome skim. The Jaccard index between two genome skims, J, is computed by Mash [4] efficiently using a hashing technique. We have tested our tool, Skmer, on genome skims simulated from assemblies of 90 species from two genera of insects (Anopheles and Drosophila) and across the avian tree of life. We test the accuracy of the distances computed by Skmer, and subsequently use the distances to find the exact/closest match to a query sample in a reference set of genome skims. Comparing to the other k-mer based tools, Skmer shows excellent performance in our simulation studies, especially when the coverage is below 4X [5]. Skmer makes the assembly-free approach to genome-skimming a viable alternative to the traditional barcoding. The software is made publicly available on Github (https://github.com/shahab-sarmashghi/Skmer.git).

AB - The ability to quickly and inexpensively describe the taxonomic diversity in an environment is critical in this era of rapid climate and biodiversity changes. The currently preferred molecular technique, barcoding, is low-cost and widely used, but has drawbacks. As sequencing costs continue to fall, an alternative approach based on genome-skimming has been proposed [1, 2]. This approach first applies low-pass (100 Mb – several Gb per sample) sequencing to voucher and/or query samples and then recovers marker genes and/or organelle genomes computationally. In contrast, we suggest the use of the unassembled sequence data for taxonomic identification using an alignment-free approach based on the k-mer decomposition of the sequencing reads. Specifically, we first estimate the average sequencing depth and error rate for each genome skim, by comparing our derived theoretical distribution of k-mers’ multiplicity and the histogram of k-mer counts computed using Jellyfish [3]. The genome length is also estimated from the average sequencing depth accordingly. Then, the similarity of two genome skims is measured by the Jaccard index between their corresponding k-mer collections. Finally, the hamming distance between genomes is estimated from the Jaccard index, using the following formula obtained by modeling the impact of low sequencing coverage, sequencing error, and differing genome lengths on the similarity of genome skims: D1/k2(ζ1 L1 + ζ2 L2)J D = 1 ™. η1 η2 (L1 + L2)(1 + J) In this equation, when coverage is low, we use all k-mers and set: ηi = 1 ™ e™ci(1™k/ℓ)(1™ɛi)k, ζi = ηi + ci (1 ™ k/ℓ)(1 ™ (1 ™ ɛi)k). For higher coverages, we remove k-mers with multiplicity below a threshold m, and set: m™1 ∑ (ci (1 ™ k/ℓ)(1 ™ ɛi)k)t ζi = ηi = 1 ™ e™ci(1™k/ℓ)(1™ɛi)k. t! t=0 In these equations, k and ℓ are k-mer and read length, respectively, and ci, ɛi, and Li are substituted from the estimates of coverage, error rate, and genome length for each genome skim. The Jaccard index between two genome skims, J, is computed by Mash [4] efficiently using a hashing technique. We have tested our tool, Skmer, on genome skims simulated from assemblies of 90 species from two genera of insects (Anopheles and Drosophila) and across the avian tree of life. We test the accuracy of the distances computed by Skmer, and subsequently use the distances to find the exact/closest match to a query sample in a reference set of genome skims. Comparing to the other k-mer based tools, Skmer shows excellent performance in our simulation studies, especially when the coverage is below 4X [5]. Skmer makes the assembly-free approach to genome-skimming a viable alternative to the traditional barcoding. The software is made publicly available on Github (https://github.com/shahab-sarmashghi/Skmer.git).

UR - http://www.scopus.com/inward/record.url?scp=85046131357&partnerID=8YFLogxK

U2 - 10.1007/978-3-319-89929-9

DO - 10.1007/978-3-319-89929-9

M3 - Conference abstract in journal

AN - SCOPUS:85046131357

SN - 0302-9743

VL - 10812

SP - 276

EP - 277

JO - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

JF - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

T2 - 22nd International Conference on Research in Computational Molecular Biology, RECOMB 2018

Y2 - 21 April 2018 through 24 April 2018

ER -

Assembly-free and alignment-free sample identification using genome skims

Abstract

Konference

Adgang til dokumentet

Andre filer og links

Fingeraftryk

Citationsformater