Home About Software Publications Posts MicroBinfie Podcast

MicroBinfie Podcast, 81 The people behind the benchmark datasets for SARS-CoV-2

Released on April 29, 2022

Podcast Episode: Benchmark Datasets for SARS-CoV-2

In this episode, we welcome Lingzi Xiaoli and Jill Hagey to discuss their development of benchmark datasets for SARS-CoV-2.

Links and Resources

Explore the datasets: CDC SARS-CoV-2 Datasets
Check out a previous related episode for part 1 of the conversation.

Related Works

Previous paper on bacterial datasets: PeerJ Article

Connect with Our Guests

Jill Hagey:
- Twitter: @JillHagey
- Website: jvhagey.github.io
Lingzi Xiaoli:
- LinkedIn: Lingzi Xiaoli

Stay tuned to learn more about the insights and implications of these datasets in the field of virology and genomics.

Extra notes

The discussion revolves around creating benchmark datasets for SARS-CoV-2 and the methods used in acquiring and processing these datasets.
Jill explored using the Python package Selenium to automate the process of downloading sequences by remotely interacting with web browsers, although this approach was later abandoned in favor of more efficient methods.
Checking and verifying metadata for datasets was crucial. There was an emphasis on ensuring sample consistency, such as making sure they were all Illumina reads, paired-end, and used Arctic_primers.
Challenges were noted with handling incorrect or inconsistent metadata entries on platforms like SRA, particularly when metadata did not specify the sequencing technology used or if there were discrepancies in naming conventions for primers (e.g., multiple ways of indicating "Arctic V3").
Initial quality control (QC) on sequence data involved fast QC, depth of coverage analysis using SAMtools, and running the Titan pipeline, which provided various QC metrics including Pangolin lineage assignments and detection of amino acid mutations.
Selection of representative samples for variants of concern (VOCs) involved using tools like Snippy to minimize SNP differences compared to CDC internal references, ensuring key mutations were present primarily in the spike protein.
The filtering process was enhanced by a linkage between GISAID assemblies and SRA records, ensuring only high-quality datasets were selected for further comparison.
Decisions on which viral lineages to include in the study were based on CDC-defined variants of interest or concern due to the absence of a WHO nomination at the time.

Key technical tools and processes mentioned include:

Selenium for web automation.
SAMtools for depth of coverage analysis.
Snippy for SNP comparison.
Titan pipeline for detailed quality control.

Challenges highlighted pertained predominantly to automatic data retrieval, metadata verification, and naming conventions in datasets, reflecting common difficulties in the field of microbial bioinformatics.

Episode 81 transcript