Hello, and thank you for listening to the Microbid Key Podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the head of informatics at the Quadram Institute in Norwich, UK. Andrew is the director of technical innovation for Theogen in Cambridge, UK. I am Dr. Lee Katz, and I am a senior bioinformatician at Centers for Disease Control and Prevention in Atlanta in the United States. Welcome to the Microbid Key Podcast. I'm your host, Andrew Page, and I'm joined today with Lee Katz. Nabil isn't here today, unfortunately, but he sends his best wishes. So we're going to be talking about hash databases. And if you don't know what these are, listen to us for the next few minutes, and we will absolutely tell you all about these and why they are kind of cool and kind of awesome, both at the same time, but also kind of dangerous. So I know, Lee, you've had a little bit of a dabbling in hashes and also in the blockchain and things like that. So do you want to kind of give us a little primer on what this is, what the concept is? Yeah, maybe I'll start with like, why? Because it's kind of out of frustration. We have MLST databases around the world, and they're used for genomic surveillance or population structure or maybe other exercises or applications. And everyone is kind of siloed. And it sucks because one database in England could be, just using you as an example, sorry. Could be, you know, getting all these wonderful genomes and getting MLST alleles, and they increment those alleles by integers. And then over here in the US, we might be doing surveillance or other exercises, and we have our own siloed database. And those integers are going up, and we might have the same alleles across the pond, but they're named differently. And it's frustrating. How do we synchronize those databases better? Or, you know, if it isn't a well-funded institution, what if it's just, you know, some place and they just stand up a database, how do they synchronize easily without good system support or without resources? So some people around the world have floated the idea of hashing the alleles. And for those who don't know, hashing is just a one-way algorithm to take a string and turn it into a large integer. Or usually large. And that integer could be, that hash could be the allele sequence. And so that would be a unique identifier that if I found it and you found it in your institution, we'd have the same identifier and we could synchronize our databases much more easily. And I guess in this case, MLST works on alleles. If there's a single difference, then it is a different number. So a hash is perfect because if there's a single difference, you get a different hash. Yeah, absolutely. It's a totally different hash. I do not understand personally, like what the actual underlying algorithm is. There are several different hashing algorithms out there, but once you introduce even one snip, the whole string, the whole integer, sorry, turns out totally different. And so people would have heard of say MD5, which is an old school way of hashing. And as far as I know, what it does is you take say a block of text, imagine a page of text and you format it into a matrix. Okay, so kind of a little squashed down matrix of a slim width. And then you do mathematical function just along the columns. And the idea is basically is one way you compress it, you lose data, but you get a number at the end, which more or less cannot be, you can't get unless you have the same text. It is possible to get key clashes. So it's used heavily in cryptography. So you can, in theory, come up with a different way of different input data, which would give you the same hash. But these numbers are so big and the space is so large that it doesn't happen very often. So it's quite robust. And for what we do, it's grand. There is a theoretical risk that you can have two different alleles, which will result in the same hash, but it's minusculely small. And if you put in extra safeguards, actually, it becomes even smaller. So have you heard of the birthday attack in cryptography? Barely. You have to remind me about that one. So if you, the probability, I think of a safe room for people, the probability of two people sharing the same birthday, I think is like the square root number of people in the room. Or something like that. Oh, God, no, I got it totally wrong. It's something like, is it there's a 50% chance at a certain level? Anyway, basically, if you have a random pool of people and you take two out, there's a higher chance that randomly two of those will have the same birthday, rather than saying you and you, like who has a birthday on the 1st of May? That's a harder problem. Whereas is who, what any two people in this room have the same birthday, that's a much easier problem on a smaller space. And so that can happen with hashes. And as a computer scientist, I'm failing to explain this very, very well. And I totally apologize. But let's get back to away from hashes and back to the actual databases themselves. So you're talking about people in England, incrementing these numbers and what happens if they're not well funded. So the, those are obviously your centralized databases with gatekeepers and sometimes committees who, you know, and there's verification and validation and things like that. So is there not a risk that you'll have, you know, database will become polluted if you just allow anyone and everyone to have their own hashes and then say, yes, this is an allele. Yeah. So taken together, like there's a, there's a risk of collision. So this idea that you have any two DNA sequences that they might collide, like there's a risk that you have a, a two sets that have the same hash. And the way that we get around that is we can increase the complexity of the hash. So there's a, an algorithm in, at least I found it in Python called CRC32 that produces a hash, but it's a small integer. And when I hashed a whole MLST database, I did find some collisions. So that means that some of the sequences did have the same hash. So that could have consequences. You might not have the distances right between any two genomes. They might have different DNA sequences, but the same hash. So there would be a zero distance with that locus versus a one distance. But then there's some more complex hashes. A lot of the world seems to be coalescing around MD5SUM, but I mean, I guess it's not really definitive at this point. And MD5SUM, when I hashed a whole database, I did not find collisions. And I guess- How much did you hash on HyperAquazon? I hashed all of the databases I could find on Chewbacca.online. And so those are a bunch of whole genome MLST databases. I think that's some CGMLST databases. And that includes Salmonella, E. coli, Listeria, Campylobacter- The lesson. Yeah. And well, I tried it on some other hash sums. So another one that a lot of people know about is SHA-256. And also I didn't find any collisions there. And I thought, well, what if people started adding more and more? So you're asking about polluting the database. What if people add more and more and more? So I took the biggest database I could find, Salmonella, and I just started introducing random mutations. And I did that 10 times per allele. So I made a database 11 times the size of the original database. What you just using random mutations. And I hash summed all of it. And of course, the CRC32, I found collisions. Both mt5sum and SHA-256, I did not find any collisions. And I showed my work on the, on the GitHub repo. So I guess we can put that into show notes. I guess the other thing is if you restrict, if you say that you can have duplicate hashes, but for different alleles, that's fine. Sorry, for different genes. And then say for Salmonella, I don't know, whatever it is, is it 3000 genes in the scheme? It doesn't really matter at that point, you know, because you kind of split, you know, 3000 copies, potentially of the same thing, which makes life a bit easier. True. I started doing that experiment with CRC32, but I actually didn't need to do that experiment to worry about within loci. I didn't find any collisions across any alleles across any loci. Oh, okay. So it was even better than expected. Yeah, that's really cool. Because that means that you no longer then have to wait, you know, you submit stuff, you wait for your sequence to be accepted. And then you have, you know, you go through that whole loop. And then people would say, can you prove it? can you do Sanger sequencing? You know, there's a few barriers to some of these databases where because they're very suspicious and they want high quality calls in there only. Obviously, that's more for MLST, but you can't do the same kind of QC on CGMLST itself, because obviously there's so many more genes. But that's really cool. And how are people going to use this in practice? Is there software then that people are going to use? Because no one is going to go and make their own. Yeah, to do this, you actually need to change the idea behind a lot of software. So a lot of good MLST software uses BLAST or other loose matching programs. It doesn't have to be an exact match, but with hashes, even one snip changes it to a different hash. So you have exact or not exact matching. There was an idea out there that I tried, but it sort of failed to produce many hashes for each allele. And then you could have kind of a loose matching, like how many of the 10 hashes for the allele that you made did it match on. But it started getting way too complicated, way too fast. It made a huge database, and it ultimately didn't matter. So you still need some software that's either exact or not exact. So some software that does that right now is Itoki. Another one is, I believe, coming out, we're going to see this in the next few months, I believe, is the new BMURU caller that we're trying out. I don't know what it's called yet. I don't know where the software is. I don't know anything. I've just seen like little hints that it might be coming out. And I believe some other software out there. So there's a version of Chewbacca that uses a hash database. Unfortunately, it uses CRC32, which, it does produce collisions, but they are rare still. But I just, I want to be careful about that kind of stuff. So Chewbacca, it's called Chewy Snake by our colleagues in Germany. And I'm forgetting right now. And I think, you know, this is sort of notable. I was looking at Torsten's MLST7gene site, and he does allow for people to do hashes on those alleles too. And I was like, really? Wow. Like, when did he do that? And I, you know, on GitHub you can, or in Git, you can click on blame, which I think is the funniest subcommand. So if you put, if you push blame, it'll tell you line by line who made the change and when. And Torsten made that change for hashing five years ago. So again, ahead of his time. Very, very ahead of his time. He always was. So how are the sequence types allocated? Because those are integers as well. And that's a representation of the collection of alleles. Yeah. So you're touching on something where it got complicated real fast. So when I was designing this, I was thinking, okay, I'll just hash all of the alleles, but wait, what if you guys produce a sequence type and you name it sequence type 10, but over here, we get the sequence type at a later date and we call it sequence type 30. Like now we have different databases. So now there's a method in the specification, which again, we'll have them show notes. The specification shows how to hash the hashes of alleles to create the sequence type. However, like if you are talking to people like a medic or something and you say, oh, you know, I have sequence, this patient has sequence type 11 versus this patient has sequence type ABC 1589201. They'd be like, really? What is that? Right. So yeah, maybe it'll go the way of COVID where you guys had really cool names for different SNPs, like a laser. Maybe we'll have something like that. Well, I mean, there's the COVID naming is still done by essentially, you know, GitHub issues and stuff like that, where it's manually allocated, you know, like BA1, BA2, dot 1, 2, 3, 4, 5 in a very nice structured manner, but it is still very manual and effectively that's kind of a committee informal or otherwise that is the gatekeeper there. Yeah. There's one other notable thing happening right now because sequence types are very difficult, I guess I would say they're, they're less useful for CGMLST than it was for 7-genome LST because every single genome basically gives you a new sequence type. So people rarely use it. We have something else kind of happening in the world right now called allele codes. And we described this in our paper, Stevens et al from a couple of years ago. I guess we'll have to put that in show notes too. Is that like SNP address? It's like SNP address. That's right. We actually copied it from that. Oh, very good. So, but that, I don't know how to decentralize that yet. So it's a little outside the scope of this. I guess, you know, one step at a time, like if we can just get the alleles to be decentralized, then we're probably 90% away there. Yeah. So there's more hashing in this database that makes it more and more and more complicated. I won't get into it. There are, I basically took a few days to make a whole specification to draw it out and it's working at this point. I think it could be used with certain callers like eToki and I think that other callers could probably adapt to it. We'll see how it goes. Can I ask, right? You've made a specification there, which looks really cool, but this is in your personal GitHub account. And would it not make more sense for longevity to have say phage or some kind of organization like that to take it over where you have that weight of the international community, weight of the international community kind of feeding in and maintaining it and expanding it to make it a true standard? Yeah. One day I think I'm going to take this to phage and, or maybe another organization. We'll see where I can, where I can go and just sort of ask them to adopt me on this. I'm looking for a parent to adopt me on this specification. There are, there are a few organizations that I agree probably could use it and could probably take it to fruition. Probably phage. Another organization is GMI. That's a possibility. And, and yet one more organization that I actually made this for is PulseNet International. So there are some very viable candidates. Since I made it for PulseNet International, that's possibly more of a place to go to with this, but you know, I really, this is, this is me just like brainstorming aloud. So maybe I shouldn't be doing that recording, but okay. I guess if I, if I said it and it's in the recording, we can't edit it. So it's there. Grand. Well, I look forward to seeing this in production. You know, it's a super useful thing. And I know it's something people have been talking about for a long time. So it's great to see that you've actually gone on and done something about it rather than just, you know, sitting back and riffing on it in a podcast and not doing anything. You've actually got code there. So fair play, Lee. I think we'll leave it there and we'll see you guys in a week or two. Thank you so much for listening to us at home. If you liked this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at MicroBinfy. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.