Released on March 21, 2024
Back to episode listIn this episode of the Micro Binfie Podcast, hosts Dr. Andrew Page and Dr. Lee Katz explore the captivating world of hash databases and their usage in cgMLST (core genome Multilocus Sequence Typing) for microbial bioinformatics.
The episode begins by addressing the obstacles faced by bioinformaticians due to fragmented MLST databases worldwide. These silos impede synchronization and effective genomic surveillance. To overcome these challenges, the concept of using hash databases for allele identification is introduced.
Hashing is explained as a technique for creating unique identifiers for genetic sequences, which simplifies database synchronization without requiring extensive system support. Dr. Katz elaborates on the principles of hashing in genomics, noting how even a single nucleotide polymorphism (SNP) can generate a distinct hash, making it an ideal solution for differentiating alleles.
The hosts discuss various hashing algorithms, such as MD5 and SHA-256, covering their benefits and the potential risks of hash collisions. They point out that employing more complex hashes can markedly diminish the chances of such collisions.
The episode also delves into the practical aspects of integrating hash databases into bioinformatics software, emphasizing the necessity of exact matching algorithms due to the nature of hashing. Examples of current and upcoming tools, like eToKi, illustrate how hash databases can be utilized.
The conversation includes the concept of sequence types in cgMLST and the complexities involved in naming and standardizing them within a decentralized database system. Alternatives, such as allele codes, are suggested as potential simplifications for representing sequence types.
The potential for larger bioinformatics organizations, such as PHA4GE or GMI, to adopt the hashing approach is discussed, stressing the importance of a standardized and community-supported framework to ensure the continued effectiveness and longevity of hash databases in microbial genomics.
This episode presents a comprehensive overview of how hash databases could revolutionize microbial genomics. By addressing longstanding issues of database synchronization and allele identification, hash databases pave the way for more efficient and collaborative genomic surveillance worldwide.
Key Points Related to Microbial Bioinformatics:
Siloed Databases in Genomic Surveillance: There's a significant challenge in synchronizing MLST databases across different regions (e.g., England and the U.S.), which leads to issues with allele identification due to independent naming systems.
Hashing in Bioinformatics:
Bioinformatics Tools and Software:
Technical Challenges and Solutions:
Ongoing Research and Development:
Community and Future Directions:
This discussion emphasizes the importance of adapting traditional methodologies and tools in microbial bioinformatics to better manage and synchronize genomic data globally using innovative methods such as hashing.