SPIN - Species by Proteome INvestigation: Code, databases, and example data

Patrick Leopold Rüther (Ophavsmand)
Immanuel Mirnes Husic (Ophavsmand)
Pernille Bangsgaard (University of Copenhagen) (Ophavsmand)
Kristian Murphy-Gregersen (Ophavsmand)
Pernille Pantmann (Ophavsmand)
Milena Carvalho (Ophavsmand)
Ricardo Miguel Godinho (Ophavsmand)
Lukas Friedl (Ophavsmand)
João Cascalheira (Ophavsmand)
Alberto John Taurozzi (Ophavsmand)
Marie Louise Schjellerup Jørkov (Ophavsmand)
Michael M. Benedetti (Ophavsmand)
Jonathan Haws (Ophavsmand)
Nuno Bicho (Ophavsmand)
Frido Welker (Ophavsmand)
Enrico Cappellini (Ophavsmand)
Jesper Velgaard Olsen (Ophavsmand)

Data set

Beskrivelse

Scripts and configuration files for species identification: The scripts for species identification were designed to work with RStudio 1.3.1093 on a Windows 10 machine. Small adjustments will be necessary to migrate them to other operating systems or environments. Due to the different search engine output formats, there are two separate projects for DDA data analyzed with Maxquant (1.6.0.17) and DIA data analyzed with Spectronaut (14.5.200813). The analysis is ideally but not necessarily done with the provided protein database. For species determination based on DIA data, the raw files are searched with library based and DirectDIA in Spectronaut. Output files an be generated with the Spectronaut export schemes provided in the Configuration” folder. The raw files should be specified and labeled based on the ”Configuration/Experimental annotation.csv” example. If other libraries than the ones provided with the SPIN article are used, the respective species should be included in the "Configuration/Library list.csv”. The SPIN protein databases are already in the Databases folder and can be extended with aligned protein sequences by aligning them with the other sequences for the same gene. The Spectronaut output of the DirectDIA and library-based DIA need to be placed in the respective ”Spectronaut output” folders. Lastly, the scripts need to be executed from RStudio, by opening ”R-Project/R-Project.Rproj” or from another program with adjusted working directories. The script ”R-Project/scripts/main.R” will execute the species identification pipeline by calling functions from the other scripts provided in the same folder. If executed successfully, the script will produce a species identification table in .csv format along with a collection of consensus sequences of the analyzed samples. Species identification based on DDA follows the same scheme with few changes. The data analysis needs to be done in Maxquant using the provided gapless protein database. The output files ”evidence.txt” and ”summary.txt” need to be moved to ”DDA-based/MQ output”. The procedure for running the species inference scripts is identical to DIA-based species identification. Databases: PR210107 Merged Top20 aligned.fasta Aligned protein database used for species identification by SPIN. Sequences for each gene have been subjected to a multiple sequence alignment using Muscle and saved in .fasta format including the gaps. The database contains predicted and experimental protein sequences from Uniprot and NCBI spanning the 20 most common bone genes across all available mammalian species. When adding more sequences, they should be aligned within the respective gene group and named following the Uniprot ”fasta header” format: “>NCBIj[protein ID]j[protein ID] [gene alias] [protein description] OS=[species name] OX=[species ID] GN=[gene name]”. PR210107 Merged Top20 gapless.fasta Gapless protein database used for species identification by SPIN. Generated by removing gaps caused by the multiple sequence alignment. This database is compatible with most search engines and can be configured with Uniprot file parsing rules. PR200512 HumanCons.fasta Contaminants protein database. The contaminant protein sequences in this list was inspired by the ”contaminants.fasta” provided with Maxquant (Tyanova, Temu, & Cox, 2016). Contaminants that are only relevant for samples from cell culture, such as bovine serum albumin and collagen, were removed because they can lead to false contaminant annotations in the bone proteome context. The remaining contaminant sequences are mostly from human keratins and common proteases used in bottom-up proteomics. The annotation of protein sequences was updated to the current Uniprot format. This database should be used in conjunction with the main gapless database for setting up a database search for SPIN in Maxquant or Spectronaut. Species identification helper files PR201105 Manual SpeciesFineStructure Peptides.csv Fine grouping peptides. Collection of manually selected peptide sequences, which are robust markers for identifying species from hardly-distinguishable relatives. Species are grouped in ”clusters”, which describe the group of closely related species that can be distinguished using the selected peptides. Amino acid variants and peptide sequences are given for every species within each cluster. The ”Site” column refers to the position in the global sequence alignment obtained by pasting all bone genes in alphabetical order and the ”Comment” indicates the identification frequency in the reference samples. For expanding the list, it would be sufficient to provide the cluster, species, and peptide sequences for every species in the cluster. Library list.csv Library list. Simple list of all species with available spectral libraries. Closely-related species without available species-specific libraries, such as the American bison or buffalo, were included as well. The list was used for merging library-based and Direct DIA results. For every DirectDIA species call that did not appear in this list, the library-DIA species was replaced with the DirectDIA species. FASTAfix.csv Missing gene annotations. Small helper file to add gene annotations that were missing or inconsistent in Uniprot.

Dato for tilgængelighed	2022
Forlag	Zenodo

DOI
10.5281/zenodo.6406044

Adgang til datasæt

open

Citationsformater

Rüther, P. L., Husic, I. M., Bangsgaard, P., Murphy-Gregersen, K., Pantmann, P., Carvalho, M., Godinho, R. M., Friedl, L., Cascalheira, J., Taurozzi, A. J., Jørkov, M. L. S., Benedetti, M. M., Haws, J., Bicho, N., Welker, F., Cappellini, E., Olsen, J. V. (2022): SPIN - Species by Proteome INvestigation: Code, databases, and example data. Zenodo. 10.5281/zenodo.6406044

SPIN - Species by Proteome INvestigation: Code, databases, and example data

Beskrivelse

DOI

Adgang til datasæt

Citationsformater