Home About Software Publications Posts MicroBinfie Podcast

MicroBinfie Podcast, 123 The Revolution of Hash Databases in cgMLST

Released on March 21, 2024

In this episode of the Micro Binfie Podcast, hosts Dr. Andrew Page and Dr. Lee Katz explore the captivating world of hash databases and their usage in cgMLST (core genome Multilocus Sequence Typing) for microbial bioinformatics.

Challenges with Siloed MLST Databases

The episode begins by addressing the obstacles faced by bioinformaticians due to fragmented MLST databases worldwide. These silos impede synchronization and effective genomic surveillance. To overcome these challenges, the concept of using hash databases for allele identification is introduced.

Hashing in Genomics

Hashing is explained as a technique for creating unique identifiers for genetic sequences, which simplifies database synchronization without requiring extensive system support. Dr. Katz elaborates on the principles of hashing in genomics, noting how even a single nucleotide polymorphism (SNP) can generate a distinct hash, making it an ideal solution for differentiating alleles.

Popular Hashing Algorithms

The hosts discuss various hashing algorithms, such as MD5 and SHA-256, covering their benefits and the potential risks of hash collisions. They point out that employing more complex hashes can markedly diminish the chances of such collisions.

Practical Implementation in Bioinformatics

The episode also delves into the practical aspects of integrating hash databases into bioinformatics software, emphasizing the necessity of exact matching algorithms due to the nature of hashing. Examples of current and upcoming tools, like eToKi, illustrate how hash databases can be utilized.

Sequence Types in cgMLST

The conversation includes the concept of sequence types in cgMLST and the complexities involved in naming and standardizing them within a decentralized database system. Alternatives, such as allele codes, are suggested as potential simplifications for representing sequence types.

Adoption by Larger Bioinformatics Organizations

The potential for larger bioinformatics organizations, such as PHA4GE or GMI, to adopt the hashing approach is discussed, stressing the importance of a standardized and community-supported framework to ensure the continued effectiveness and longevity of hash databases in microbial genomics.

This episode presents a comprehensive overview of how hash databases could revolutionize microbial genomics. By addressing longstanding issues of database synchronization and allele identification, hash databases pave the way for more efficient and collaborative genomic surveillance worldwide.

Extra notes

Key Points Related to Microbial Bioinformatics:

Siloed Databases in Genomic Surveillance: There's a significant challenge in synchronizing MLST databases across different regions (e.g., England and the U.S.), which leads to issues with allele identification due to independent naming systems.
Hashing in Bioinformatics:
- Hashing is considered as a potential solution for synchronizing genomic data. It involves using one-way algorithms to convert allele sequences into unique integers, making it easier to synchronize databases globally.
- Different hashing algorithms exist, such as MD5, SHA-256, and CRC32. While MD5 and SHA-256 seem to avoid collisions effectively in this context, CRC32 was found to have some collisions when hashing large datasets.
Bioinformatics Tools and Software:
- Current MLST software often employs loose matching programs like BLAST, but exact matching through hashes requires different software approaches.
- Tools such as ITOKI and a new anticipated software might utilize hashing for allele identification and database management.
- Chewbacca software, although using CRC32, serves as a hash database example, with potential for improvement.
Technical Challenges and Solutions:
- A risk associated with hashing is hash collisions, where different sequences might produce the same hash. Increasing hash complexity can mitigate this risk.
- New methodologies involve using exact matching with hashes, though this requires adapting existing software and possibly creating new tools to handle the unique identifiers effectively.
Ongoing Research and Development:
- The development of allele codes, similar to SNP addresses, aims to decentralize allele identification further, though challenges remain in standardizing and adopting these systems.
- A specification for hashing methodologies is being refined, with standardization efforts needed for broader adoption by international bodies like PulseNet International or other organizations such as GMI and phage.
Community and Future Directions:
- There's an emphasis on involving international organizations to adopt and maintain these specifications, ensuring they become a widely accepted standard.
- The conversation highlights the need for new practices in microbial bioinformatics that advance beyond traditional genomic typing, especially for larger-scale schemes like CGMLST compared to traditional 7-gene MLST.

This discussion emphasizes the importance of adapting traditional methodologies and tools in microbial bioinformatics to better manage and synchronize genomic data globally using innovative methods such as hashing.

Episode 123 transcript