Hello and thank you for listening to the MicroBitKey podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the head of informatics at the Quadram Institute in Norwich, UK. Andrew is the director of technical innovation for Theogen in Cambridge, UK. I am Dr. Lee Katz, and I am a senior bioinformatician at Centers for Disease Control and Prevention in Atlanta in the United States. Hey, welcome to the MicroBinFeed podcast. Today, we have Andrew here. Nabil is off taming horses or something. I don't know. He's just taking a little vacation, I guess. We're going to talk about some software that Andrew was involved with called Gambit. And he basically brought me into this saying, I'll riff on it, and I don't care if you don't know anything about it. And I'm just going to enjoy this. I'm going to ask a lot of questions, I guess. So what is Gambit? What have you brought me into? Okay, so Gambit is a tool for typing bacteria, right? So, and eukaryotes. So it uses cameras under the hood, but it's done in a very clever way. I've come into this very late in the day. It was originally written by Jared Lumpy. Sorry if I pronounce the name wrong. Over the past few years. And now it's kind of maintained by Nevada Public Health State Lab and David Hess. So I've come into this late in the day basically to expand upon it and find new uses for it and build databases and all this kind of jazz, you know, to keep it alive. Because any software project, you need to have this extra continuous development, you know, to keep it relevant and to keep it good. So Gambit, right? It is absolutely awesome. And I'm a bit converted to it now. It takes, it does a targeted camera search. So you take cameras with a very particular prefix. So imagine you take the cameras that are usually, the prefix is usually found at the beginning of a gene, right? And then you take another, say, 11 bases after that. So you have a fixed bit, like your anchor, and then you have the variable bit. And from there, you can actually build up a signature. And with bacteria, the really interesting thing is if you have, say, I don't know, say Salmonella, which is about a five meg genome, you get about 10,000 cameras out of that. And it's because it's roughly correlated with the size of genome is roughly the number of cameras you get. And they're all different. And actually, when you start looking at these, you can very quickly boil down a genome, a species genome of cameras in a targeted fashion. And you get a nice little set. And that defines the species. And what's really nice about it is that, unlike, say, MASH, where you randomly subsample, this is targeted. So actually, you can give all the cameras of this particular target. And they've gone and built databases, you know, say, from all of RAS-Seq, and then looked at the diameters. So that is, you look at all the cameras that occur in each species, in each genome, and then for species, say, well, these cameras are in this species. And that gives you a rough idea, you know, basically how diverse it is or how concise it is. And it also gives you a qualitative measure of how close you are to that species or within that species. And actually, if you were to graph it all out, they're quite diverse. So E. coli and Salmonella are quite far apart. And then they've gone one step further, which is to clinically validate this in the lab. So actual validated pipelines. The problem with metagenomics, it's not metagenomics, the problem with genome sequencing is that calling a species a species is straightforward, right? But often people do Kraken or something else like that, you know, and it'll say, well, you know, you got 80% Salmonella, you got 10% E. coli. You've got all these other things, you know, you might get 2,000 species in there. And that's, you know, that's bioinformatics noise. But if you absolutely want to say, bang, bang, bang, this is a Salmonella, it's nothing else. I'm absolutely confident and then have a value, a number associated with that, with how confident you are. That's where this comes in. So it's very, very accurate at calling a species a species. And if it can't, then it says, actually, I think there's a genus level or I'm not going to call it at all. So it's actually more conservative in species calls, which is what you need in public health. You know, you absolutely want to be certain. And while Salmonella is easy, once you start getting into the more kind of, I guess, harder parts of speciation, you know, because a lot of things are, let's say, take TB and bovis and whatever, those are like 99% similar, you know? And, you know, calling those separately can be difficult, bioinformatically, particularly if you're looking at reads, it can be, well, is it really, is it not? And so it's actually very good at those more edge cases as well. And so, yeah, I'm a big convert. I love it. I've been using it quite a lot. I know the lab in Nevada and a few other labs have been using it as well. And within theagen, within our pipelines, the tiaproc pipeline, which is kind of our standard bacterial processing pipeline that has been using it like as a traffic cop, you know, to put, you know, to decide which sub pipelines to run. You're like, do you run sister or do you run easy type or whatever, you know, depending on what species it is. And it's like actually validated properly in the lab, which is a very, very rare thing for bioinformatics. So I'm super impressed by this and I've been doing a lot of work on it. So to actually validate something in the lab for clinical use is a big freaking deal. So I don't know if people know this, but like, at least in the U S to validate something for clinical use, it has to go through like a regulatory body. Like it, it actually has to be like, at least on a federal level. I don't know about Nevada. Wait, wait, hopefully I've said that right. Obviously I'm not, you I'm not an American as in it's been validated for public health use. Okay. Okay. So if it's, if it's valid for public health use and not for clinical, that's different. And it's still a big deal. It is a big deal. Yeah. What, what is clear? Do you know? He has clinical that's, that's the big freaking deal. One. Okay. Grudge. I must, I should have double checked this for a sub talking about it, you know, which has been a bit embarrassing, but it is super cool. And I've been doing more work on this than recently. So I'm trying to use in different ways. Actually, bioinformatically it's really well architected. So it stores all the cameras in, in a database, which is a HDF five format, which is super, super compressed. And it's a very nice format for very large data sets. It's hard to work with, but if you remember, like the pack bio files used to be an HDF five and Nanopore raw data files are in HDF five. So, you know, it is used, but it's usually used for like the heavy duty stuff. And so at the core, it's that, and then it has a database on top of my scale database on top of that. And where's the Postgres? I know it has a, a data, an SQL database for the metadata to kind of link it all together. And that means that you can very rapidly get in and out. So you can actually process the data very quickly because you can jump to different points within the files. And so it's super, super fast. And by fast, I mean, really, really fast at typing stuff. I've made some changes or some enhancements. So I've got a gambit tool suite and that has things like, you know, we can actually build database with any stuff you actually download. It's been used then for eukaryotes. So for the Camtas Iris outbreak, it has been used to type the different candida in that case. And of course, that's obviously an emerging pathogen of concern because of the multi-drug resistance. And yeah, it's just really, really cool. And then I've been working. Oh, I've got some extra things that I've thrown in. Like you can do kind of pan genomics and build core genomes because you can just look at the set of common k-mers and that works super well. So you can actually kind of compress it all day and you can have a very small database that really, you know, nails at what it actually is. So anyway, Sarah, this is more like an advertisement. I apologize, Lee, but I just think it's a really nice way to work with k-mers, you know, where it's targeted and it's just so fast and oh, it's memory efficient and yeah. All right, let's go into it. Let's go into it now because you've put in a lot of technical stuff and a lot of computational stuff I'm kind of curious about. So I'm going to go a little bit further back. It sounds like it has a database of, of what is it called? The k-mers with the snip in the middle, kind of like k-snip. Is that right? How'd you get this database? And what is it made of it? Did I understand what the k-mers are? What you do is you take your, your faster value genome sequence. You look for a target prefix. So that's a, say five bases that you're looking for. And then you take the next 11 bases after that. Okay. And you store those bases in a database. Actually, what you do is you convert them to numbers. So you have a space. of numbers from 1 to, I don't know, let's say 4 million, right? Each number representing a different k-merit. It's basically the binary representation of that suffix. And that's what gets put into the database. So then you're just storing a lot of numbers and you store them in chunks. So for each genome, you store, say, I don't know, say there's 100. You store 100 for one genome and 100 for the next. And then it's just the offset that is stored into a database separately. And so you can then jump straight into database at exactly the right point. And then it says, it might say start to do offset of 100 and go for 200 down. And so you can extract that array out very, very fast. So you can actually do these operations like super quick. And you're just pulling numbers out. You don't actually have that. You don't have to work with your representation like nucleotides. You're just working with numbers. And of course, working with an integer or a long is a lot easier than working with strings of text, because you can do comparisons and stuff like that, or just simple intersections. So as a computer scientist, I think this is just super cool. It's the way that I would have loved to have done many things. That's awesome. So the database stores the integer representation of that offset for each k-mer. And the other part that I didn't quite understand is it has a reference database, I think, because it's going to tell you what the species is. So how does it have a list of different species and what those k-mers are? So you basically download everything from RefSeq and build a database. And so we have some standard databases that I built as well as being in the original publication. But the way you can do any species you want or any collection of genomes you want, and the important thing is the metadata. So originally, it was all based on the NCBI taxonomy, which is very poor. So one thing I've done actually is to base it off GTDB, which actually, it's already using ANI to get rid of all the quirks. And my first step is actually to look at that giant GTDB spreadsheet, which does kind of QC, what's got good check-in, low contamination, good completeness, say lower numbers of k-mers. Who likes 600 k-mers in a genome assembly? I don't. So you can kind of threshold things there to do a bit of a quality control. And they've kind of sorted out species as well. So GTDB has already speciated things and put in novel species where it thinks it should go or novel genus. And that helps an awful lot to sort out all of the noise that's in NCBI. Because as you and I both well know, you can put in a genome and say it's anything and it'll get accepted. And, you know, if you misclassify any coli as a salmonella, that could throw things off. And it can throw things off for years. So yeah, you got to be careful. That's really cool. And then I'm going to zoom ahead from from your description before into this description a little bit. And so HDF5, that's, that's been like a curiosity for me, but also like a headache at the same time. Like how do you parse those things? It seems like you can only analyze things with that like single threaded. It just seems like a huge beast. So maybe this is a little bit critical, but like, why did you guys decide to go that way? And what advantages does HDF5 give you? Like should, should everybody be considering that? So it's the software itself, the analysis it does can be paralyzed. So and it paralyzes very, very well in terms of extracting data from a file. It works very quick. So it's, that's not the bottleneck at all. It is the comparisons of those numbers then, which is a comparison, which is the bottleneck and the IO. So yeah, I I've seen when I've run it that if, you know, you give 32 threads, it will run pretty much maxing out that. So obviously the parallelization is very, very good of the bits that need paralyzing. And this is, I guess, an eternal question. Like my PhD was in, was in heterogeneous supercomputing. So parallel, paralyzing stuff. And so I know firsthand that you can spend a lot of time where you can waste a lot of time paralyzing stuff that doesn't need to be paralyzed. And in this case, I think the balance is gotten right. It's just the stuff that needs, that is very slow. It has been done. And, and otherwise now like building databases on the other hand is actually really intensive. And so like, I've recently been trying to build like a database of everything that's in RefSeq, you know, just without QC. And yeah, after like 10 days I had to stop it and it was, you know, it was using a huge amount of RAM and all this. It was my bad, you know, I thought, Oh yeah, I'll see, see what happens. And so I've had to come up with a different strategy, which is more incremental building. So, you know, do one species at a time and add to database rather than just doing a big bang, which gives the same results in the end, but it gives less memory and, and whatnot. And yeah. Anyway, it's a nice little way of doing things. And that's what I like about mathematics, you know, where you can take the same concepts and same ideas and then just change them slightly and you can actually get quite different and better results. And yeah, I think we should all be doing this for everything. Yeah, I agree with that. All right. So just to cap this off, how do I get into Gambit? If I want to try it out, do you have like a demo somehow or a slideshow? What should, and what should I even try it on? So there is a dedicated GitHub repository called Gambit Suite, which is where not just the Gambit software is, but all the extra programs you built on top of that. And you can just download it and run it. There's databases available. You just download those. So basically you have a command, you have a database, which is two files you download, kind of like cracking, and then you just give it some BASTA files and it just runs. And then it'll give you, it'll tell you what say the closest matches are or what the distance is, that kind of stuff. And it's very fast. So I'd say just check it out. And there's papers as well. There's multiple papers. Excellent. Because academics love papers. That's true. All right. Well, thank you so much for showing us Gambit. I came into this totally ignorant and there's actually a huge universe here that I want to explore on this now. So I appreciate it. Awesome. Thank you very much. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at microbinfee. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.