Hominid Palaeoproteomic Reference Dataset

Dataset

Description

This dataset contains the 'Hominid Palaeoproteomic Reference Dataset'.We used PaleoProPhyler ( https://github.com/johnpatramanis/Proteomic_Pipeline ) to generate a palaeoproteomic reference dataset of protein sequences from ancient and present-day hominids. Using the first two modules of PaleoProPhyler, we translated 176 publicly available whole genomes from extant and extinct hominid groups. We also translated 8 ancient hominin genomes from VCF files, including those of 3 Neanderthals and one Denisovan. Since the dataset is tailored for palaeoproteomic analyses, we chose to translate proteins that have previously been reported as present in either teeth or bones. We compiled a list of 1,696 proteins from previous works and successfully translated 1,543 of them. For each protein, both the canonical and all alternative protein coding isoforms were translated, leading to a total of around 10,058 protein sequences for each individual in the dataset. Details on the processing of the sequences can be found in the supplementary materials of PaleoProPhyler ( https://github.com/johnpatramanis/Proteomic_Pipeline/blob/main/GitHub_Tutorial/Supplementary.pdf ). The full list of the proteins translated can be found here: https://github.com/johnpatramanis/Proteomic_Pipeline/blob/main/Reference_Protein_List.txt and a table with information on each sample included in the dataset can be found here: https://github.com/johnpatramanis/Proteomic_Pipeline/blob/main/Reference_Sample_List.csv Content: The zipped file contains 5 files: one txt file, two fasta files as well as two additional folders: - PalaeoProPhyler_Publication_Data_for_Tree.fa contains all of the sequences used to generate the phylogenetic tree presented at PalaeoProPhylers manuscript. - ALL_PROT_REFERENCE.fa contains all of the sequences generated as part of the Hominid Palaeoproteomic Reference Dataset described above, all in a single fasta. - PER_PROTEIN is a folder containing one fasta file for each protein within the Hominid Palaeoproteomic Reference Dataset. Each protein fasta file has the sequences of all individuals for that particular protein. - PER_SAMPLE is a folder containing one fasta file for each sample/individual within the Hominid Palaeoproteomic Reference Dataset. Each sample fasta file has the sequences of all proteins for that particular sample. -Reference_Protein_List.txt is a txt file containing two columns. The first column is a list of all the proteins selected to be translated. The second column describes where each of these proteins was mentioned or identified. If a protein was identified in a publication, the title of the publication is given. If a protein was identified in one of the publications of our group (E.Cappellini group) the identifier 'our samples' is given. If multiple publications supported a protein they are all given and seperated by comma. ~ NOTE (!) ~ Depending on which samples you use we highly encourage you to cite the original publication(s) from which we got the DNA data and translate the proteins from: Modern Humans: M Byrska-Bishop et al. “High Coverage Whole Genome Sequencing of the Expanded 1000 Genomes Project Cohort Including 602 Trios. bioRxiv. 2021”. Non-human great apes: Javier Prado-Martinez et al. “Great ape genetic diversity and population history”. In: Nature 499.7459 (2013), pp. 471–475. Pongos: Alexander Nater et al. “Morphometric, behavioral, and genomic evidence for a new orangutan species”. In: Current Biology 27.22 (2017), pp. 3487–3498. Neanderthal, Denisovan and other ancient anatomically modern humans: Kay Pr¨ufer et al. “A high-coverage Neandertal genome from Vindija Cave in Croatia”. In: Science 358.6363 (2017), pp. 655–658. Fabrizio Mafessoni et al. “A high-coverage Neandertal genome from Chagyrskaya Cave”. In: Proceedings of the National Academy of Sciences 117.26 (2020), pp. 15132–15136.
Date made available2022
PublisherZenodo

Cite this