Hello and thank you for listening to the MicroBitKey podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody really writes it down. There's no
manual, and it's assumed you'll pick it up. We hope to fill in a few of these
gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the
head of informatics at the Quadram Institute in Norwich, UK. Andrew is the
director of technical innovation for Theogen in Cambridge, UK. I am Dr. Lee
Katz, and I am a senior bioinformatician at Centers for Disease Control and
Prevention in Atlanta in the United States. Hey, welcome to the MicroBinFeed
podcast. Today, we have Andrew here. Nabil is off taming horses or something. I
don't know. He's just taking a little vacation, I guess. We're going to talk
about some software that Andrew was involved with called Gambit. And he
basically brought me into this saying, I'll riff on it, and I don't care if you
don't know anything about it. And I'm just going to enjoy this. I'm going to ask
a lot of questions, I guess. So what is Gambit? What have you brought me into?
Okay, so Gambit is a tool for typing bacteria, right? So, and eukaryotes. So it
uses cameras under the hood, but it's done in a very clever way. I've come into
this very late in the day. It was originally written by Jared Lumpy. Sorry if I
pronounce the name wrong. Over the past few years. And now it's kind of
maintained by Nevada Public Health State Lab and David Hess. So I've come into
this late in the day basically to expand upon it and find new uses for it and
build databases and all this kind of jazz, you know, to keep it alive. Because
any software project, you need to have this extra continuous development, you
know, to keep it relevant and to keep it good. So Gambit, right? It is
absolutely awesome. And I'm a bit converted to it now. It takes, it does a
targeted camera search. So you take cameras with a very particular prefix. So
imagine you take the cameras that are usually, the prefix is usually found at
the beginning of a gene, right? And then you take another, say, 11 bases after
that. So you have a fixed bit, like your anchor, and then you have the variable
bit. And from there, you can actually build up a signature. And with bacteria,
the really interesting thing is if you have, say, I don't know, say Salmonella,
which is about a five meg genome, you get about 10,000 cameras out of that. And
it's because it's roughly correlated with the size of genome is roughly the
number of cameras you get. And they're all different. And actually, when you
start looking at these, you can very quickly boil down a genome, a species
genome of cameras in a targeted fashion. And you get a nice little set. And that
defines the species. And what's really nice about it is that, unlike, say, MASH,
where you randomly subsample, this is targeted. So actually, you can give all
the cameras of this particular target. And they've gone and built databases, you
know, say, from all of RAS-Seq, and then looked at the diameters. So that is,
you look at all the cameras that occur in each species, in each genome, and then
for species, say, well, these cameras are in this species. And that gives you a
rough idea, you know, basically how diverse it is or how concise it is. And it
also gives you a qualitative measure of how close you are to that species or
within that species. And actually, if you were to graph it all out, they're
quite diverse. So E. coli and Salmonella are quite far apart. And then they've
gone one step further, which is to clinically validate this in the lab. So
actual validated pipelines. The problem with metagenomics, it's not
metagenomics, the problem with genome sequencing is that calling a species a
species is straightforward, right? But often people do Kraken or something else
like that, you know, and it'll say, well, you know, you got 80% Salmonella, you
got 10% E. coli. You've got all these other things, you know, you might get
2,000 species in there. And that's, you know, that's bioinformatics noise. But
if you absolutely want to say, bang, bang, bang, this is a Salmonella, it's
nothing else. I'm absolutely confident and then have a value, a number
associated with that, with how confident you are. That's where this comes in. So
it's very, very accurate at calling a species a species. And if it can't, then
it says, actually, I think there's a genus level or I'm not going to call it at
all. So it's actually more conservative in species calls, which is what you need
in public health. You know, you absolutely want to be certain. And while
Salmonella is easy, once you start getting into the more kind of, I guess,
harder parts of speciation, you know, because a lot of things are, let's say,
take TB and bovis and whatever, those are like 99% similar, you know? And, you
know, calling those separately can be difficult, bioinformatically, particularly
if you're looking at reads, it can be, well, is it really, is it not? And so
it's actually very good at those more edge cases as well. And so, yeah, I'm a
big convert. I love it. I've been using it quite a lot. I know the lab in Nevada
and a few other labs have been using it as well. And within theagen, within our
pipelines, the tiaproc pipeline, which is kind of our standard bacterial
processing pipeline that has been using it like as a traffic cop, you know, to
put, you know, to decide which sub pipelines to run. You're like, do you run
sister or do you run easy type or whatever, you know, depending on what species
it is. And it's like actually validated properly in the lab, which is a very,
very rare thing for bioinformatics. So I'm super impressed by this and I've been
doing a lot of work on it.  So to actually validate something in the lab for
clinical use is a big freaking deal. So I don't know if people know this, but
like, at least in the U S to validate something for clinical use, it has to go
through like a regulatory body. Like it, it actually has to be like, at least on
a federal level. I don't know about Nevada. Wait, wait, hopefully I've said that
right. Obviously I'm not, you I'm not an American as in it's been validated for
public health use. Okay. Okay. So if it's, if it's valid for public health use
and not for clinical, that's different. And it's still a big deal. It is a big
deal. Yeah. What, what is clear? Do you know? He has clinical that's, that's the
big freaking deal. One. Okay. Grudge. I must, I should have double checked this
for a sub talking about it, you know, which has been a bit embarrassing, but it
is super cool. And I've been doing more work on this than recently. So I'm
trying to use in different ways. Actually, bioinformatically it's really well
architected. So it stores all the cameras in, in a database, which is a HDF five
format, which is super, super compressed. And it's a very nice format for very
large data sets. It's hard to work with, but if you remember, like the pack bio
files used to be an HDF five and Nanopore raw data files are in HDF five. So,
you know, it is used, but it's usually used for like the heavy duty stuff. And
so at the core, it's that, and then it has a database on top of my scale
database on top of that. And where's the Postgres? I know it has a, a data, an
SQL database for the metadata to kind of link it all together. And that means
that you can very rapidly get in and out. So you can actually process the data
very quickly because you can jump to different points within the files. And so
it's super, super fast. And by fast, I mean, really, really fast at typing
stuff. I've made some changes or some enhancements. So I've got a gambit tool
suite and that has things like, you know, we can actually build database with
any stuff you actually download. It's been used then for eukaryotes. So for the
Camtas Iris outbreak, it has been used to type the different candida in that
case. And of course, that's obviously an emerging pathogen of concern because of
the multi-drug resistance. And yeah, it's just really, really cool. And then
I've been working. Oh, I've got some extra things that I've thrown in. Like you
can do kind of pan genomics and build core genomes because you can just look at
the set of common k-mers and that works super well. So you can actually kind of
compress it all day and you can have a very small database that really, you
know, nails at what it actually is. So anyway, Sarah, this is more like an
advertisement. I apologize, Lee, but I just think it's a really nice way to work
with k-mers, you know, where it's targeted and it's just so fast and oh, it's
memory efficient and yeah. All right, let's go into it. Let's go into it now
because you've put in a lot of technical stuff and a lot of computational stuff
I'm kind of curious about. So I'm going to go a little bit further back. It
sounds like it has a database of, of what is it called? The k-mers with the snip
in the middle, kind of like k-snip. Is that right? How'd you get this database?
And what is it made of it? Did I understand what the k-mers are? What you do is
you take your, your faster value genome sequence. You look for a target prefix.
So that's a, say five bases that you're looking for. And then you take the next
11 bases after that. Okay. And you store those bases in a database. Actually,
what you do is you convert them to numbers. So you have a space.  of numbers
from 1 to, I don't know, let's say 4 million, right? Each number representing a
different k-merit. It's basically the binary representation of that suffix. And
that's what gets put into the database. So then you're just storing a lot of
numbers and you store them in chunks. So for each genome, you store, say, I
don't know, say there's 100. You store 100 for one genome and 100 for the next.
And then it's just the offset that is stored into a database separately. And so
you can then jump straight into database at exactly the right point. And then it
says, it might say start to do offset of 100 and go for 200 down. And so you can
extract that array out very, very fast. So you can actually do these operations
like super quick. And you're just pulling numbers out. You don't actually have
that. You don't have to work with your representation like nucleotides. You're
just working with numbers. And of course, working with an integer or a long is a
lot easier than working with strings of text, because you can do comparisons and
stuff like that, or just simple intersections. So as a computer scientist, I
think this is just super cool. It's the way that I would have loved to have done
many things. That's awesome. So the database stores the integer representation
of that offset for each k-mer. And the other part that I didn't quite understand
is it has a reference database, I think, because it's going to tell you what the
species is. So how does it have a list of different species and what those
k-mers are? So you basically download everything from RefSeq and build a
database. And so we have some standard databases that I built as well as being
in the original publication. But the way you can do any species you want or any
collection of genomes you want, and the important thing is the metadata. So
originally, it was all based on the NCBI taxonomy, which is very poor. So one
thing I've done actually is to base it off GTDB, which actually, it's already
using ANI to get rid of all the quirks. And my first step is actually to look at
that giant GTDB spreadsheet, which does kind of QC, what's got good check-in,
low contamination, good completeness, say lower numbers of k-mers. Who likes 600
k-mers in a genome assembly? I don't. So you can kind of threshold things there
to do a bit of a quality control. And they've kind of sorted out species as
well. So GTDB has already speciated things and put in novel species where it
thinks it should go or novel genus. And that helps an awful lot to sort out all
of the noise that's in NCBI. Because as you and I both well know, you can put in
a genome and say it's anything and it'll get accepted. And, you know, if you
misclassify any coli as a salmonella, that could throw things off. And it can
throw things off for years. So yeah, you got to be careful. That's really cool.
And then I'm going to zoom ahead from from your description before into this
description a little bit. And so HDF5, that's, that's been like a curiosity for
me, but also like a headache at the same time. Like how do you parse those
things? It seems like you can only analyze things with that like single
threaded. It just seems like a huge beast. So maybe this is a little bit
critical, but like, why did you guys decide to go that way? And what advantages
does HDF5 give you? Like should, should everybody be considering that? So it's
the software itself, the analysis it does can be paralyzed. So and it paralyzes
very, very well in terms of extracting data from a file. It works very quick. So
it's, that's not the bottleneck at all. It is the comparisons of those numbers
then, which is a comparison, which is the bottleneck and the IO. So yeah, I I've
seen when I've run it that if, you know, you give 32 threads, it will run pretty
much maxing out that. So obviously the parallelization is very, very good of the
bits that need paralyzing. And this is, I guess, an eternal question. Like my
PhD was in, was in heterogeneous supercomputing. So parallel, paralyzing stuff.
And so I know firsthand that you can spend a lot of time where you can waste a
lot of time paralyzing stuff that doesn't need to be paralyzed. And in this
case, I think the balance is gotten right. It's just the stuff that needs, that
is very slow. It has been done. And, and otherwise now like building databases
on the other hand is actually really intensive. And so like, I've recently been
trying to build like a database of everything that's in RefSeq, you know, just
without QC. And yeah, after like 10 days I had to stop it and it was, you know,
it was using a huge amount of RAM and all this. It was my bad, you know, I
thought, Oh yeah, I'll see, see what happens. And so I've had to come up with a
different strategy, which is more incremental building. So, you know, do one
species at a time and add to database rather than just doing a big bang, which
gives the same results in the end, but it gives less memory and, and whatnot.
And yeah. Anyway, it's a nice little way of doing things. And that's what I like
about mathematics, you know, where you can take the same concepts and same ideas
and then just change them slightly and you can actually get quite different and
better results. And yeah, I think we should all be doing this for everything.
Yeah, I agree with that. All right. So just to cap this off, how do I get into
Gambit? If I want to try it out, do you have like a demo somehow or a slideshow?
What should, and what should I even try it on? So there is a dedicated GitHub
repository called Gambit Suite, which is where not just the Gambit software is,
but all the extra programs you built on top of that. And you can just download
it and run it. There's databases available. You just download those. So
basically you have a command, you have a database, which is two files you
download, kind of like cracking, and then you just give it some BASTA files and
it just runs. And then it'll give you, it'll tell you what say the closest
matches are or what the distance is, that kind of stuff. And it's very fast. So
I'd say just check it out. And there's papers as well. There's multiple papers.
Excellent. Because academics love papers. That's true. All right. Well, thank
you so much for showing us Gambit. I came into this totally ignorant and there's
actually a huge universe here that I want to explore on this now. So I
appreciate it. Awesome. Thank you very much. Thank you so much for listening to
us at home. If you like this podcast, please subscribe and rate us on iTunes,
Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at
microbinfee. And if you don't like this podcast, please don't do anything. This
podcast was recorded by the Microbial Bioinformatics Group. The opinions
expressed here are our own and do not necessarily reflect the views of CDC or
the Quadrant Institute.