Hello and thank you for listening to the Microbidkey podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody really writes it down. There's no
manual, and it's assumed you'll pick it up. We hope to fill in a few of these
gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the
head of informatics at the Quadram Institute in Norwich, UK. Andrew is the
director of technical innovation for Theagen in Cambridge, UK. I am Dr. Lee
Katz, and I am a senior bioinformatician at Centers for Disease Control and
Prevention in Atlanta in the United States. Hey, welcome back. We're talking
with Titus Brown. We have a lot to talk about. We wanted to get a little bit
more into SourMash. Andrew, I think you had a question about it. Yeah, so
SourMash is based on references, but there's a lot of very poor quality
references out there. There's even some we call out are like 10 megs. So I was
wondering, how do you get around that? You know, what is your strategy for
avoiding all the crud out there in the real world? Yeah, really good question.
So I have declared that that is not my problem. And there's actually a lot of-
You can't just say it's not your problem. But I have tenure. This is Michael
Scott declaring bankruptcy. So my first very real answer is thank God for GTDB.
So, you know, we have been doing a lot of investigation recently into taxonomy,
for example. And before that, large-scale genome databases. One of the things
that SourMash lets us do is search all of GenBank or all of GenBank microbial,
which is now up to around 1.3 million bacterial genomes. And one of the things
we wanted to do with SourMash was make use of all available reference sequences,
because I think that that's really important for bioinformaticians. And one of
my objections to most- One of my objections is maybe a strong word. One of the
problems I think is really endemic to microbiome bioinformatics is that most
software packages do some sort of curated subselection of the available genomes
too. And that curated subselection, I'm going to be sort of mean here, that
curated subselection is usually decided by what their software can handle. And I
thought, well, I'd rather have it be decided by what somebody wants to include,
whoever that may be. So how do you, even with GTDB, like I've noticed in the
spreadsheets that they provide that they run CheckM, do you do any like
filtering at that point where to get rid of the stuff that's really obviously,
you know, very poor quality, or do you just take- So GTDB has very transparent.
So we just take the GTDB databases that they release. They have very transparent
quality metrics. And, you know, I think they do a nice job. From what I can
tell, they do an excellent and amazing job of sort of trying to mediate between,
we want as broad a representation of bacterial and archaeal tree of life as
possible, while we don't want to include too many bad sequences. And so there's
some, you know, most of our knowledge of about a dozen different phyla all comes
from mags, metagenome assembled genomes. And some of those just aren't going to
be very good quality, but it's important to have them in the database so that
you can do a good job of analyzing things. So we just take the GTDB database as
it is. We do have a side effort, and it's been on the back burner for a couple
of years, where we are finding problems in GenBank genomes and GTDB genomes
specifically using a software called Charcoal, which takes advantage of SourMash
to look for various contigs that are very similar between very widely distant
taxonomic units. And so, you know, yeah, there's some contamination and some
messiness in there. I will say, and this is going to be a little bit of an
advertisement, a little bit of a teaser, I will say that SourMash itself seems
to be relatively robust to that kind of contamination for reasons that I would
be happy to get into in terms of taxonomic assignment. But basically, my design
principles, our design principles for SourMash are whatever databases people
make available, we should be able to search. And if that database is bad, that's
the problem. That's a problem for the people doing the searches. It's not our
problem in SourMash. Like we're trying to do the, what's that? There's that
tools, small tools manifesto. Each tool should do one thing and one thing well.
SourMash searches things well. And if you give it the wrong database, it's going
to search the wrong database just fine. So I love that GTDB is around and is
doing a good job, a good systematic job of curating GenBank. I love that GenBank
is around because then you can go find things that may be bad. And then you get
to go decide for yourself. I don't have to, I don't decide if it's too bad to
report basically within SourMash. So that's sort of my attitude towards the
crud, if that makes sense. Right. Okay. Because I take the complete opposite
approach where it's like, kill all the contamination as quick as possible. You
know, if an E. coli has 3000 contigs, it's probably very, very poor quality and
all that. Right. But yeah, it's kind of cool that you just deal with it. Well,
so go ahead, Lee. I feel like we're on a different level, me and Andrew, because
we're looking at the supplied stuff and I agree, like I'm taking Andrew's
approach. I need to get rid of all the crud. Yeah. So like, I'll have a SNP
pipeline and I need to get rid of all the SNPs that might be noise, even if some
of them are real. Or I need to get rid of some of the k-mers, even if some of
them are real, just because I need to make sure I have the highest fidelity
phylogeny. But we're on a different level. I feel like you're on the base level
where you're describing like you have a library of sour mash, like the sour mash
library, and you want to make sure that it's just as versatile as possible. And
it does that thing well. Right, right. So there is another philosophy to bring
into this. I'm bringing in a lot of philosophical statements today, I guess. So
I was trained by physicists. And there's sort of a physics style, I'm going to
call it a physics style approach. I don't actually know if it's a physics style
approach, but if you look at what the CERN or the various particle physics
colliders do, what they do is they generate masses of data. And then they go
through a multi-stage triage pipeline where they basically say, ah, okay, this
is clearly noise. We're going to get rid of that. Okay. And this next, this is
signal we already know how to understand. And we're not interested in stuff we
already understand. And so you go through these multiple stages of filtering.
And I think a principle to bring that I, my group has brought to sour mash is
once you get rid of something, you can never get it back. So our job for that
first search is you have the biggest, messiest, noisiest database ever. You have
the biggest, messiest, noisiest data ever. We're going to give you all of that.
And then if you want to filter after that, we'll make that easy as well. So with
sour mash, you're never locked into a particular database. You're never locked
into the public database that we have downloaded. And you can only use that. You
can always have your own curated versions of things. We've actually made it very
easy to use our databases and then sub-select just the things you want from
them. There's always ways to pull out all of the matches, not just the refined
ones we give you. And so, you know, basically like we want to give you that
flexibility and we don't want to make decisions for you. If you want to make
decisions, we'll make about what to include in your matches. We'll help you with
that. But it's not our responsibility to tell you what's good or not. And part
of the reason for that, I mean, I think we're also robust in the face of certain
kinds of noise. But part of the reason for that is that that messy E. coli with
the 3000 contigs might actually really matter and might be that hot lead that
you dive into. This can't possibly be right, what's going on? And then you
discover that, yeah, it's not right, but parts of it are interesting. And then
you can take that and run with it. And if we had made the decision that that's a
bad E. coli to begin with, then you would be denied that opportunity. I guess
Lee and myself work more in a public health end, which is much nearer to making
decisions. You need to have more of a confidence in what you are actually
presenting to people at the end of the day, then I guess the big bag of
everything and all possible results are on the table. Whereas when we're
working, it's much more like we're absolutely confident this is a salmonella and
it's in this outbreak and this is kind of what it looks like. What do you think,
Lee? Yeah, I think I agree. And we need this situation. We need a situation
where someone like Titus or someone is developing that base level tool that
follows the Unix manifesto and it makes that tool. Where would we be if I was
the person who developed SAM tools? It would be shit. We need someone who makes
the tool and we can develop around it. We need that. Well, thanks, Lee. I like
that. I like that perspective. So I'm trying to figure out, there's a
tendentious topic that I kind of want to move that I would like to bring up.  is
that most of our tools for bacterial assignment and metagenomes are quite
terrible. So I sort of, and you can tell that you can see that with
benchmarking, right? If you do benchmarking just at the species level, most
tools are either terrible at sensitivity or terrible at specificity. And there's
only like three or four tools that are in that golden spot of like specific
enough that you can sort of trust the results and sensitive enough that you're
not throwing away most of the data. SourMash, obviously, since I'm saying it, is
one of them, somewhat by, somewhat surprisingly, because we didn't develop it
for that, for taxonomy purposes in the beginning. MOTU is another, and then I
think Megan and there's one more that I'm missing, another alignment-based one
are pretty good. So I guess, I guess what I'm saying is, is I don't think, I
don't think any of the public, most of the public tools out there get, get you
where you want it to go in the first place. So I'm puzzled by the, not puzzled,
I'm sort of raising this question of if we've been so worried about the end
product being high quality, how did we end up in a situation where most of the
tools out there aren't actually meeting your requirements in the first place?
Because a lot of them are funded by three-year grants and the postdoc moves on
and then the software just kind of disappears. And then we reinvent the wheel by
starting from scratch because we think it's an easy problem and so on and so
forth. Perfect. Yeah, yeah, yeah, yeah. Okay. I mean, we know, we know, we sort
of know why, but I guess I would posit that maybe this is, I don't want this to
be like the pull quote or the Buzzfeed headline, but like bioinformatics is
terrible. Everybody knows bioinformatics is terrible. We're jury rigging
together piles of software that aren't, weren't really designed to work with
each other, use terrible interchange formats, have unknown specificity and
sensitivity. Ground truth is often unknown. And then we're, we're like slapping
a web interface on top of that and going, well, I guess we're done here. And
it's, it's, it's not, it's not a good situation that we're in. Yeah, I would, I
would, sorry, go ahead. So I would, I would go ahead and say that, that it's
also like that, like another XKCD where it's like dependency management hell.
And like, there's that one person who's developing that one software since
whenever ago. Yeah. And I would, I would say that again, like, like we need the
people at the base level developing things like SourMash or SamTools. We need
that and we need to build on a very solid foundation. And I think that we, I'm
going to use your, your, your slang there. We've been jury rigging a bunch of
stuff together, but it's not going to work unless we just keep going back to the
foundation and keep making that good. Yeah. But we, but the reality is, yeah,
we're in public health and we need health now. We're going to, we're going to
keep addressing that. So we're going to, we're going to Jerry rate things
together as best we can to, to get to that, the public health solutions that we
can. Yep. Yep. I mean, I think so, so, you know, I'm very, I'll say a couple
things. So I'm very proud of SourMash because I put a lot of, a lot of time and
effort and love into it as, as have a lot of people in my, my group and
elsewhere. And I would not be there. There is a chance just like with cage where
that tomorrow I will wake up and be like, you know what I'm going to go work on
something else. And I wrote a blog post about this, I think for cage more like
10 or so years ago where I said, what would, what, how, how should I think about
sun setting software? And for me, there's another, there's another way I think
about this, which is that like, I don't actually care if people use SourMash
particularly like it's great. Like it's foundational to my own research program.
I have multiple grad students in the lab doing cool biology stuff with SourMash
and I'm helping support them. But for me, the important bit of SourMash is that
this is, this is going to be really pretentious sounding. And I, and I, I, I
mean it for all software, not just SourMash. I wanted to raise the bar. I love
when you preface things like that. Sorry, go ahead. I wanted to raise the bar,
right? Whatever replaces SourMash, I want to be no worse than SourMash. I don't
want it to be a situation where things that we've learned in implementing
SourMash are forgotten because we didn't write them down. We didn't publish
them. We didn't talk about them. And so someone comes along and says, you know
what? I'm a grad student on a three-year grant and I can do this better than
Titus did. They probably can. And then they do do it 80% better, but then they,
they, they mess up and it's, it's 20% worse in a particular way. Like that would
be a bad, that would be a bad way for science to go. So for me, the things that
I'm actually proudest of in SourMash are things we've learned. And that's where
my publication efforts go. Like FracMinHash, pretty cool. If you don't mind
throwing away 99.9% of your data, it, it works for some things really well.
MinSetCov, I haven't really talked about that, but we have a really nice way of,
of taking extremely redundant sets of matches to, to the microbial genomes and
distilling them down to, you know, a shortest list of matches that covers all of
the known matches. It's, it's actually turns out it's a algorithm that they
teach in first year grad, grad studies and probably third year CS undergrad. And
we sort of reinvented it by hook or by crook because it solved a particular
problem. And then I, you know, I presented it to computer scientists. They're
like, yeah, that's great. That's been known for, you know, 50 years, good work
putting it into your software. And, and it's, but it works really well. And so
we've, we've, we've learned that. And then, you know, something else I've
learned, and this is, this is another nice teaser and maybe a pull quote for you
that can go into your Buzzfeed style advertisements for the, for the podcast.
Like I know how to write a virtually perfect taxonomic classifier that will
classify at the species level with a hundred percent accuracy. We've gotten
pretty close with sour mash. If I cared about solving that problem, we could
write something that's perfect because we understand the problem now. I'm not
convinced it's the right problem to solve. I think if you guys are in public
health, my guess is you don't really care about species level that much. You
care more about strain level. You're disagreeing with me heavily. Go ahead.
Species can be quite hard depending on the species. Like if you get into micro
bacterium tuberculosis, for example, that's a really hard problem because
historically people have classified species based on what they've seen, not what
it actually is genomically. Sure. Sorry. Let me rephrase. Let me rephrase my,
my, my quote, given a taxonomy on a date, on a reference database, taxonomy may
be wrong. The reference database may be bad, but given a taxonomy and a
reference database, we can write a perfect species level taxonomic classifier. I
I'd argue it'd be difficult and it, and that's fine. Cause I, I think I like
this, this theme where we can discuss things at the jury rate public health
level versus the ground level of computer science. Yeah. So how, how confident
are you? Would you stand up in a court of law, you know, and say, absolutely. I
would, I would hedge it a little bit. I would hedge it and say that I have run
across. So we've done the analysis. I would hedge it and say, there's, I think
about a dozen genomes. I can give you their names where based on cameras alone,
they are indistinguishable from each other. But as long as you can find me one
Kramer or sorry, any combination of cameras that, that uniquely distinguishes
every element of a species, every member of species from every other genome out
there, we can do that taxonomic assignment with extremely high accuracy. You
need to write that up if you haven't already. So, so it's yeah. So cool. So we
we're, we're close to it and I'll just, I just want to shout out to Leighton
Pritchard here. I was, I was gassing on about this in a, in one of our
collaboration meetings and he, and I said, look, it's something about
combinations of cameras, this, that, the other thing. And he said, Oh, you
should look up this, this. I think it's a cryptography term called unicity.
Unicity is the, the unicity measure of a cryptographic message is the shortest
number of symbol, the shortest set of symbols that allows you to uniquely decode
a cryptographic message. And you can apply it to genomes. It's the shortest, the
smallest number of cameras that uniquely distinguishes this genome from any
others. And it won't be a big surprise to you that the unicity distance for most
genomes in GTDB is one. There's almost always one camera that distinguishes that
genome from everything else. And there's, I forget the number, but at the space,
at the genus level, there's only a few genomes that can't be uniquely
distinguished from each other by cameras at the taxonomic level, sorry, at the
genus level. And underneath, it turns out that what SourMash is doing is
implicitly calculating this unicity distance, using this unicity distance to
track down genomes. That is actually really cool. And it is really cool that
there's so much out there that we can do and use to make stuff even better. So,
yeah, what I would love to see in the very near future is...  an amazing way of
just calling a species a species with 100% accuracy, because that is really
needed for public health. It sounds so easy, but a lot of algorithms just work
by taking the top hit out of a list and saying, you know, more or less we think
this, you know, this is tuberculosis, or it's not, or this is a mitochondrial
bovis. But, you know, we're 99.99% sure, but we're not 100% sure. So maybe it's
something we can have in the near future. Definitely. This is actually a paper
we're going to publish on soon, too, just like how much we care about species,
and that's under review, too. Not to talk about our lab too much, but it's an
incredible problem. And there are, I've come across species that our reference
lab has told me that is different than the other species by one gene, or there
are different lineages of listeria that, okay, we're all calling them listeria
minus cytogenes, but really they act as though they're four different species,
and we have to call them all the same species because they all have to be
treated equally under the regulatory framework. Right, right. And so I would
just love to see if that could be maneuvered with Kamer somehow. There's like so
many ins and outs. That's why I think that you probably do need to hedge your
bets on that, but it'd be interesting to see how close you could get to it. And
I hope you're taking notes here because this is our shopping list that we want,
you know? Yeah, well, I'll just say we, you know, there are some limits, and I
think the challenge that I've had in writing this up is finding a way to talk
about the limits in such a way that it's, like, I understand them intuitively. I
even understand them in a mathematical formalism, but conveying, communicating
that to people so that they know the difference between what the computers can
do and what biology is saying is, what's the right way to put this, an ongoing
challenge. Biology is hard. Yeah, biology is really hard. Who would have known?
All right, that's all we have time for today, but thank you so much, Titus, for
having a chat with us again about all these really interesting things and about
all the future stuff you're going to do. And we will be back next time. Thank
you so much for listening to us at home. If you like this podcast, please
subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your
choice. Follow us on Twitter at Microbinfee. And if you don't like this podcast,
please don't do anything. This podcast was recorded by the Microbial
Bioinformatics Group. The opinions expressed here are our own and do not
necessarily reflect the views of CDC or the Quadram Institute.