Hello and thank you for listening to the MicroBitKey podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody really writes it down. There's no
manual, and it's assumed you'll pick it up. We hope to fill in a few of these
gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the
head of informatics at the Quadram Institute in Norwich, UK. Andrew is the
director of technical innovation for Theogen in Cambridge, UK. I am Dr. Lee
Katz, and I am a senior bioinformatician at Centers for Disease Control and
Prevention in Atlanta in the United States. All right. So hey, we're here with
Titus Brown. Andrew and I are your hosts today, and we wanted to get a little
bit into SourMash. We were just having an offline conversation and just how well
SourMash scales, and especially in the context of metagenomics. I don't know if
you want to give a little intro on what SourMash actually is. I think we've kind
of skirted around on the previous one. Sure, yeah, yeah, yeah. So SourMash
itself is a command line Python package that sits on top of a Rust library and
does very fast lossy compression of Kmer datasets with the goal of enabling
overlap analysis between large collections of Kmers. So you can go grab it from
Conda, or you can do a pip install, and it will let you do things like take two
samples, two DNA or RNA samples, sketch them into much smaller datasets, and
then determine things like Jaccard similarity. Similarity of the two datasets
overlap between the datasets in either direction, and also lets you do things
like search very large databases of these sketches for matches. So that's like
80% of what SourMash does, is it sketches things and lets you compare them to
each other in a variety of ways. And the cool thing is that the coolest thing
about it, one of the two coolest things about it is that it scales extremely
well. So we typically, with the default parameters, we typically compress
sequences by around a factor of a thousand. So you can take an E. coli genome
that's five megabases in size, and you end up representing that with 5,000
hashes that are sitting in a cute little JSON file on disk. So, you know, in the
neighborhood of 30 KB, and that lets you do things like find similar genomes,
discover whether that E. coli is in a metagenome, and basically those two things
do taxonomic assignments, that kind of stuff. Yeah, so I originally came across
you just mimicking MASH with a library, but it sounds like with the 5,000
hashes, like you might have a few different default parameters that are
different at this point. Is that what I'm picking up on? Well, we use a
different sketching technique. So our sketches are, to some extent, we use this
compatible sketching technique, but rather than retaining a fixed number of
hashes, like what MASH does, where MASH says, you pick how many hashes you want,
and that's how many we're going to extract. And that's going to be forever more
what you use for comparisons. We extract hashes at a, what I would call a fixed
sampling rate. So if you have a three gigabase genome, say human, and you use
our standard technique for downsampling, which is this fracament hash approach,
you would end up with about 3 million hashes, whereas MASH would still give you
whatever, you know, 500 or 1,000. The downsides of this, of course, for
SourMasher, that your sketches get bigger as your data sets get bigger. So for
metagenomes, if you have a very large metagenome, you're going to get a very
large sketch for it. But the upsides are that you can then do containment
analysis. So you can say, oh, for this metagenome, what genomes are in this
metagenome? And you can answer that question with just the sketches. You don't
need to go back to the raw data, which is how most of the other techniques,
including the MASH screen technique, work. So SourMash is really focused on
metagenomics and containment and overlap analyses, rather than on Descartes
similarity comparisons. Okay, incredible. So my personal story with SourMash is,
I thought it would be great for a genomic epidemiology platform. And I started
developing this with my student at UGA, at the University of Georgia. And I
think this is a compliment to you, that your code is so elegant, we figured out
how to code it up, make a nice platform where we could query anything in under a
thousand lines of code. And that number is important, because we submitted it to
JOS. Oh, no. And they desk rechecked it. Oh, no. Because it was under a thousand
lines of code. And so we've been spending the last half a year just adding in
more features, so we could have more code in there. I, that says something good
about SourMash. Thank you. And it, I wonder, you know, I'm worried about, I'm
worried about what it says about JOS. A little bit. We'll have a better paper
for it. You know, at the end of every review process, it's painful and then you
have a better paper. So I think there's that. But then I think it's definitely
an arbitrary number. Wow. I guess it's to avoid the, you know, the 20 line bash
scripts, you know, which I'm sure they've gotten. Yeah. So I, I serve on the, I
guess I'm a guest editor still. I don't know if I've officially joined the
editorial board for PyOpenSci, which is sort of, which is sort of focused on
Python, you know, Python and open science. And they have a review process that
sits on top of JOS's. So the idea is that they, they will review your package
with their own standards for coding and tests and stuff. And then as part of
that, you'll get an expedited review process for JOS. And I will say that
consistent topics of discussion on the PyOpenSci editorial channel are, is this
enough? Is this big enough? Is this different enough? And I'll just say, you
know, my philosophy, which I have espoused openly on the editorial board is we
shouldn't be worried about if it's different enough. We should, we should maybe
big enough. Yes. But the difference is in the eye of the beholder. And, you
know, if there's a lot of diversity out there, we should let the environment
select rather than, than prejudging. But it would add to the review burden. So I
don't know. I take a very microbiome, take a very microbiome approach to this,
right? Everything is everywhere in the environment selects. Yeah. Yeah. This is
like your software. There might be noise out there, but you know, let, it should
be robust enough. Yeah. But so, so since I led you to this topic of scaling, I
want to, I want to tell you about something that we're kind of, we're kind of
absurdly excited about, despite the fact that we don't have that many great use
cases for it. And that is this thing called the branch water. And if you go to
branchwater.sourmash.bio, I'm hoping this makes your heads explode in a nice
way. This is a real-time search for a genome within all of the SRA metagenomes.
So go to the examples and click on, this is my favorite search here. Click on
the bottom one, SAR11. And now click submit. And if you'd count out loud, did
you click submit? Okay. And if you could just count out loud. 1, 2, 3, 4, 5, 6,
7, 8, 9, 10. Did you actually click submit? Oh, there we are. Let me try it
again. Scroll down, scroll down. All right. Oh, let's try that again. All right.
There we are. Okay. So now you can count. 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
11. Okay. So what we just did was we did a search for SAR11, which is a
ubiquitous marine microbe. And it returned 9,000 accession IDs. Those are 9,000
different SRA metagenomes that contain SAR11. But it gets better. Scroll down.
Oh, okay. And you can see a map of all of the samples where this genome is
found. And SAR11 is everywhere where there's a marine. Everywhere marine. So
basically what this map is, is it's a map of everywhere people have done ocean
sampling and deposited the results in the sequence read archive. Oh, this is
incredible. Oh. I guess for me, is it in any human gut microbiomes or anything
of that? Like, could you see it as contamination? So can you scroll up a little
bit? I don't remember if in this one we have done a way to filter. Go over to
organism on the right. At the top of the spreadsheet. So you can see activated
sludge. We don't have a good way to do. Can you do a keyword search for gut? I
don't know. There. So it's in some bovine gut metagenomes. And because often
you'll see. Do a search for human. Do a search for human. Often human will be,
or often contamination will be like marine and soil bacteria happen to be in.
sterile body sites and stuff like that. Well, it's in one gut and one oral
metagenome. Cool. Wow. So, so there's a funny story behind this and we're going
to come back to Luis again. So Luis Erber, my co-conspirator on SourMash, who
saw me, you know, watched his advisors, GitHub, saw me contributing stuff to
this repository called SourMash. And then was like, Ooh, this could use better
code and more tests and, you know, continuous integration and a release process.
And just started adding pull and, you know, better data structures just started
adding, adding, uh, adding code. Um, this became a core SourMash became a core
part of his thesis. In fact, his thesis is basically all about SourMash and then
the data structures in SourMash. And then, um, how to monitor GenBank and
automatically construct signature sketches, SourMash sketches for every new
sequence that came into GenBank and then how to search all of those, all of
those things. Um, and so the story behind Branchwater, this website that I'm
showing you, um, is that the underlying code was written because this is my
interpretation. Luis may disagree. The underlying code was written because Luis
didn't want to write his thesis. So Luis had convinced me in, in, he had a very,
he has a, he had a very sneaky way of convincing. He has a very sneaky way of
convincing me to do things. He had convinced me that we should swap out the C++
extension library under SourMash for Rust. And he basically said, look, Rust is
just a better piece of software and it'll let us do a better language and it'll
let us do a multi-threading very easily. And I was like, great, because the
biggest problem in SourMash was that we didn't do multi-threading. We still
don't do multi-threading, I'll maybe tell you more about that in a second. So
what Luis did, I was like, well, I don't know. My biggest worry is I don't know
Rust, or I didn't know Rust and I didn't want to have to learn it in order to
maintain SourMash. And something I've learned over the years is that everybody
leaves the lab but me. So I'm going, I'm the maintainer of last resort. So
anyway, so what he did was he put in, I don't know how many, I don't know how
many weeks or months of work. He wrote a pull request that swapped out the C++
layer for a Rust library. And then he's like, hey, it's all working. And I was
like, okay, he's like, can I push the button? I was like, I guess. And so we
pushed the button and swapped out and it's between SourMash 2 and SourMash 3,
the underlying extension library was switched over to Rust. And all the, we had
great tests on the Python side. I love writing automated tests for things. So
all of the tests passed, everything worked great. And then Luis was like, well,
I don't want to write my thesis. I want to play more with Rust. So what he did
was he wrote a multi-threaded Rust front end to SourMash called BranchWater. And
his use case for this was, hey, I've been sketching all of the metagenomes in
the sequence read archive for a while now. We're up to like 500,000 at this
point. And I can't search them using SourMash because it's too slow and single-
threaded. So I'm going to write a Rust front end that lets me use all of the
underlying ideas in SourMash, but just isn't as easy to use because it's a Rust
command line front end rather than a Python, you know, a fully developed Python
front end. But it will be blazingly fast because Rust makes multi-threading more
or less trivial. So he wrote something that would not have passed Joss's review
criteria. It was under a thousand lines of code. I think it's like 200 lines of
code. Wrote something called BranchWater. And it was basically SourMash, but
multi-threaded. And we could scale it up to something like between 30 and 40
threads before it overwhelmed our file system. And he developed this technique
that let us, the software that let us search all of the 500,000 sequence read
archive metagenomes in about 18 hours on one of our compute nodes. Wow. And then
we were like, hey, what can we use this for? What's a use case? And we didn't
have one. And over the next couple of years, we found two. There was one use
case that a guy named Adrian V.A., V.A. Vegar. He's a MD, bioinformatician in
Germany, came up with, which was source tracking. They had a breakout of, I
think it was Klebsiella in their hospital. And he wanted to not just do source
tracking within the hospital, but across all of the public genomes and
metagenomes. So he used it to search, he used Branchwater to search all of the,
all of the metagenomes, found one match to the particular strain, along with a
bunch of other matches in the genome database, and then published, they
published that and they tracked the outbreak to origin to Greece somewhere. And
that's where it fell apart because they didn't have enough high density sampling
in Greece. And then another collaborator here at UC Davis found another use
case, which was they had searching for doing biogeography on interesting
genomes. So my collaborator, friend and collaborator, Dawn Sumner, and her then
postdoc, Christy Grettenberger, and a former student of mine named Jessica
Lumian basically got hit with COVID, not, not the lab got hit with COVID. They
stopped doing experimental work and they needed something to do. And they
decided, well, we have these five Antarctic cyanobacterial mags that we've
isolated from our, our Antarctic map, you know, lake work, and we want to know
where they are. This will be a great, you know, a great COVID, COVID exercise,
because we can't do experiments. So we're going to, we're going to do some
bioinformatics. And so they did a search again using this Branchwater tool
across all of the metagenomes and they found that, you know, long story short,
this, this paper is available on bio archive and will be published hopefully
soon. They found that, you know, some of the Antarctic cyanobacteria were very
limited to cold environments. Others were cosmopolitan and yet others were, were
very specific to extreme environments, not just cold, but they could do this.
They could take five mags and they could search all of, all of the sequence read
archive metagenomes and then dig into what the matches were and, and so on and
so forth. And that, that's a really nice validation paper for the Branchwater
stuff. It's also pretty exciting scientifically, but I of course was excited
that it showed that our matches were legit. So, so that's where we were as of
like two or three years ago. And then Luis graduated and kept on working on Sour
Mash in his, in his spare time. And I had a postdoc, Tessa Pierce Ward, who
started to get very interested in this and through a combination of things, Luis
basically, Luis was invited to give a talk at the Joint Genome Institute and
decided that 18 hours was too slow. So he developed an inverted index for the
entire sequence read archive. That's what you're seeing right here, where you
can do essentially real-time search using a RocksDB-based inverted index of the
hashes across now a million sequence read archive metagenomes. And Tessa, along
with Suzanne Fleischman and Adam Rivers at the USDA, put together a web
interface on top of that. That's what we are using right here that let you do
the search and then get real-time visualization of the, of the GPS coordinates
for where these things showed up in, in metagenomes. So we're really excited
about this. This is sort of petabase scale search. We're searching something in
the neighborhood of 10 petabases of sequence read archive metagenomes in real
time. Admittedly, what we've done is we've indexed them all. So we scrunched
them down by a factor of a thousand. So it's actually about 14 terabytes on
disk, and then they're indexed into a RocksDB database and you can take any
genome you want. It will scrunch the genome down using our sketching technique
on the browser side. So because we're using Rust, we now have a web assembly
version of SourMash that's fully runs in the browser. It, it scrunches, it does
the sketching on the client side and then sends it to the backend branchwater
server, which does the finding. And then there's JavaScript on the front end
that does the visualization and it all works quite well. It's also fully
deployable on private metagenomes. If you had your own collection of hundreds or
thousands of metagenomes, you could do this, do this to stand all this up on
your own. It's all free and open source software. That's not super well
documented yet. We're working on that. And now the biggest problem, and I think
this is a hilarious statement on bioinformatics as well. The biggest problem is
it's not clear what the use cases really are. There's no one dominating use
case. Source tracking, maybe. We have yet to demonstrate that it works really
well for that. Biogeography, sure. But how many people want to do biogeography?
It's not entirely clear. Despite this, like we're really excited about it
because it's just cool. There's a whole NIH initiative on petabase scale search
and lots of people are implementing stuff in the space and we still don't really
know what we would use it for. There's a lot of use cases, like, for example, if
you get, I don't know, XDR resistance popping up in typhoid, where did it come
from? Did it come from something that's been sampled on the other side of the
world, you know, and kind of reconstructing that story of how resistance moves
around. Or if you find something and you think, is this, well, something in
metagenome and it's like, is this just a local British thing, you know, because
of our lifestyles and we're very insular? Or is it widespread? And is this the
true cause of a pathogen? So, right. So I'm going to say two or three things
here. So one is absolutely, we totally want to do that stuff. And I think we
can. There's a couple problems though. One is this question of what the level of
specificity is. Like, are we getting down to strain level stuff? I think our
answer is yes, but we haven't really conclusively proven that yet. Another.  is
we don't fund, again, I'm gonna, we talked a lot about this off recording, but I
would say like, there's not, it's a chicken and egg problem. I think we can do
this, but right now we need the fund, we need to figure out how to support the
work necessary to show that it actually really works, if that makes sense. And
there's not a lot of funding out there in this exploratory realm of, well, we
have some like vague and fuzzy use cases, but nobody's really incredibly urgent
about solving this problem tomorrow, and there's no call for this, like that's
sitting on my desk. This is an incredibly fortunate discussion because I think
I'm chomping at the bit just like Andrew is here, because there are so many use
cases. Excellent. Like definitely source tracking, and you can say source
tracking of AMR genes for sure. CDC is actively engaged in like wastewater
analysis and tracking where pathogens are getting higher or lower, or where
they're present or they're absent. And then getting into the early discussion,
like what species is it? What lineage is it? You can start diving in like, and
just start tracking where all these subpopulations are going on. This is totally
a public health problem. That's awesome. Well, so there's two other flies in the
ointment that I want to mention just before people get too excited. One is that
the resolution is still somewhat limited. If you want to search for things under
10 KB in size, you start to get false negatives. It works exceedingly well above
that. We know how to adjust the parameters to drop the false negatives. We would
just need to resketch the entire sequence read archive, which would take us a
couple of months. But so that's straightforward to do. The other thing is, and
this is a weird, this is going to be a weird one. It's actually where a lot of
my follow-on work has been. A lot of the lab's follow-on work has been, you
don't actually get, we're dealing with sketches here. So you don't actually get
the sequences. What you get is a statement, like this thing you searched for,
it's probably in this metagenome. And then you need to follow up on it. So, you
know, there's a level of indirection. We call it the sort of, we've narrowed it
down to sort of this hit to lead terminology. Like we're giving you hits that
are probably pretty good leads for you to follow up on. We're not giving you a
definitive answer. I mean, sure. Like you'd get like the answers from that
genome. And then you could probably like just retroactively or what's it called,
a flex culture, some things and actually do like a molecular clock or get some
directionality for sure. Like I can, I think in my opinion, sketches are
definitely like a way to narrow down what you need to do, but it's usually not
like the last thing you do. Like you have to polish it. Yeah, exactly. But I
think we can get, I think we can get to the point where, you know, we have,
there's a bunch of theoretical work and here I want to shout out my
collaborator, David Kozliky and Tessa Pierce-Ward, who I mentioned before, who
together with one of David's grad students, oh no, I'm blanking on his name. His
last name is Raman Hara. And I've just forgotten his first name and I'm so
sorry, but there's a publication out. They've been able to tie these, our sketch
statistics to ANI. So we can tell you for a given match what the ANI is between
the query and the match. So we can do things like that. And we know what the
resolution, so we have good limits and sort of bounds on where our stuff works
well, where it doesn't work. We know how far away you get before you start to
lose signal and all of that other stuff. And that's actually all published at
this point. So we're very interested in character and accurately characterizing
the limits of our techniques so that you know when you need to pick up a
different tool. So we don't think Sour Mash is the thing that everybody should
use. We want to just be really clear about where it is useful and what it can
do. And coming back around to the beginning of this segment of the podcast, it
scales ridiculously well. And that has been really cool. And it's nice to hear
that there's a bunch of really good, strong use cases in public health because
we've always felt that way, but I guess we failed to talk, we failed to reach
out to the right people who appear to be you too. Thank you so much for that. We
all love tools that scale ridiculously well. It is often overlooked and, you
know, our desire to search everything versus our desire to get a result in a
reasonable time, you know, obviously conflict sometimes. So thank you so much
for talking to us again today. And I think this will be the final time we talk
to you, but it's been great over the past few weeks. And yeah, good luck in all
your endeavors. Thank you. It's been great. Been great being on here. Thank you
so much for listening to us at home. If you like this podcast, please subscribe
and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice.
Follow us on Twitter at microbinfee. And if you don't like this podcast, please
don't do anything. This podcast was recorded by the Microbial Bioinformatics
Group. The opinions expressed here are our own and do not necessarily reflect
the views of CDC or the Quadram Institute.