Hello, and thank you for listening to the MicroBitKey podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody really writes it down. There's no
manual, and it's assumed you'll pick it up. We hope to fill in a few of these
gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the
head of informatics at the Quadram Institute in Norwich, UK. Andrew is the
director of technical innovation for Theogen in Cambridge, UK. I am Dr. Lee
Katz, and I am a senior bioinformatician at Centers for Disease Control and
Prevention in Atlanta in the United States. Hi, and welcome to the MicroBitKey
podcast. We're here again today with Titus Brown, and he's going to continue his
journey in research with us. He's going to be talking about some of the research
he's done recently, or over the past few years. Titus, you've talked about
Saramash last time, and MinHash, and KMIRS. Could you give a bit more of an
explanation on that, and how they come about, and what is it? Sure, yeah. Happy
to. It's one of my driving obsessions these days. I'd worked with KMIRS quite a
bit before Saramash came along, mostly in this package called the KHMR. I will
say as a side note, this is a hat tip to Michael Crusoe. One of my rules after
KHMR is never name a software package after a country or a people. It turns out
the KMIRR are a people, and there's national aspects to it. And there's the real
reason, which is that it's rather rude, which I didn't really understand at the
time. And then the other problem that's sort of a funny problem, but is tied to
that is you get very different kinds of spam when you have named your software
hat. So I still get lots of spam for real estate from that area of the world. I
don't know exactly why. Anyway, so I'd worked with KMIRS quite a bit, with KHMR
and this digital normalization stuff, and it was great. KMIRRs are really just
fantastic computer-friendly ways of looking at sequencing data. But KHMR was
really heavyweight. When you picked it up to look at metagenomes or large
genomes, and it turns out there's just lots of KMIRRs in large genomes and large
metagenomes. So I don't know that I was intentionally looking for something
lightweight at the time, but MinHash was just incredibly attractive, because you
can take a sort of representative subsample of KMIRRs at random from a large
dataset and use that representative subset as a sketch to do very fast
comparisons of these otherwise intractably large datasets. And you could, in
particular with MinHash, you can estimate something called the Jaccard
similarity, which is a distance metric that tells you, are two things the same
or different, basically, and gives you a numerical scale. And it's a full-on
distance metric, so you can use it with clustering and large-scale search and
other things without much fear. The biggest problem with MinHash, and I should
say MinHash is a bit more complicated, MinHash and MASH were introduced in, I
think it was a 2015 or 2016 paper for genomics. By the biggest problem with
MinHash is that MinHash itself doesn't work well for metagenomics. And the
reason why is that with MASH, you cannot, with the MinHash technique, you can't
compare, accurately compare or accurately estimate the Jaccard similarity for
two sets of very different sizes. Two bacterial genomes that are approximately
the same size, and you want to know how similar they are, MASH works great. If
you have a bacterial genome and a large metagenome, and you want to know, is
that bacterial genome in the metagenome, or does it overlap with it? MASH will
not give you an accurate answer in most circumstances. And that's because of the
way the downsampling is done, essentially is picking a fixed number of k-mers to
represent the entire dataset. And if one of the datasets is much larger than the
other, the overlap between those fixed size sets is going to be very small
indeed. So, CyRMASH started out as just a straightforward re-implementation of
MASH in Python. I wanted to be able to play with it a little bit. I love re-
implementing simple algorithms and getting a really good feeling for how they
work underneath. I should say here that the first year, CyRMASH actually did the
wrong thing. It actually was a broken re-implementation of MASH, but it let me
get far enough in until someone corrected me. And so, I spent a year or two with
CyRMASH, and my grad student, Luis Erber, was working on it at the time as well,
struggling to figure out how to use it for metagenomics. And that's when I
discovered that I basically re-implemented something called ModHash, or
rediscovered something called ModHash. This had been published at the same time
as MinHash. And instead of sampling a fixed number of hashes or k-mers from a
set, it sampled evenly across a set. So, rather than say, I'm going to use a
thousand k-mers to represent this large data set, you would say, I'm going to
use one in a thousand k-mers to sample this large data set. And what that meant
was that when you were comparing sets of very different sizes, the smaller set
would have a smaller number of k-mers associated with it, and the larger set
would have a large number of k-mers associated with it, and the overlap would be
accurate. The overlap estimation would be accurate. So, you could do Descartes
calculations, just like you did with MinHash, with ModHash, with this new
ModHash approach. But you could also do containment and overlap-based analyses.
So, you could say, ah, I have a bacterial genome, and I have a metagenome. How
many of the k-mers in this bacterial genome are also in the metagenome? And you
could get an accurate estimate of that with this ModHash-based approach. We
actually implemented a slightly different version of ModHash that we've now
called FrackmanHash, and for the CS wonks out there, it's a bottom-sketch
version of ModHash. So, what you do is you say, I'm going to take all of hash
space. I'm going to shuffle it. I'm going to take all of the possible k-mers.
There's a fixed number of k-mers, 4 to the k, for any given k. And you shuffle
that with a hash function. And then you say, I'm going to take the bottom
1,000th of all possible hashes, or all possible k-mers, and I'm going to use
those as my representative subset. And any time one of those bottom 1,000th of
hash space k-mers shows up in a dataset, I'm going to save that. I'm going to
record that as a k-mer that's present. So, do you not get issues with noise
then, you know, just from sequencing? Absolutely, yeah. But I guess this is
another philosophical point that I would make that I, that sort of, what's the
right way to put this? That informs a lot of my thinking about data analysis.
Often you don't know what's noise and what isn't noise. So, it's better to have
techniques that are, account for the noise, that are robust in the face of
noise, but don't necessarily rely on you removing it in advance. So, yeah, I
think what I would say is, in response is, yes, all of that noise does show up
in your dataset, but it shows up, that's the right way to put this, it shows up
at the, it shows up in your sketch and that's fine, and you just need to deal,
develop methods to account for it. So, are those, are those methods like, like,
like removing like the low abundance ones, or do you have like other ones in
mind? Yeah, so, so with our earlier work in k-mer, we'd focused a lot on how to
accurately remove low abundance k-mers from datasets, and the problem is that in
metagenomes, that's a terrible  If you remove low abundance k-mers from
metagenomes, you remove not only the noise, but you remove a lot of signal, and
that's because, of course, metagenomes like transcriptomes are variable
abundance, or are sampling, you know, they're sampling a mixture where they're,
the mixture has some high abundance and some low abundance things, and it turns
out a lot of the real data in your metagenome, especially when you're looking at
things like soil, is in fact, low abundance, and if you remove low abundance
k-mers, you get rid of real, real stuff. So we developed a technique, a fairly
decent streaming, semi-streaming approach in k-mer that let us sort of remove
low abundance k-mers from reads that were otherwise high abundance, and that
worked okay, but it still looked at all of the data, and so it was, was kind of
too slow for, for real practical use. Anytime you have to iterate across a
dataset more than once, you've already sort of lost in metagenomics. So, so with
SourMash, over the years, we've reached this, I've reached this detente with
erroneous k-mers. It turns out if you could do things fast enough, when you're
doing reference-based approaches, you don't really care about the erroneous
k-mers. They tend not to match things. That's, that's my, that's my 80% true
statement. I guess when I've looked at, sorry, at novel k-mers before, a lot of
times the novel stuff is things like, you know, phage and whatnot. So it is kind
of interesting, but it's maybe not what you're interested in. Right. Well, so I
guess I should, this is a good time to say the, the, the biggest strength and
the biggest weakness of SourMash is that it's entirely reference-based, or at
least the most popular applications of SourMash are entirely reference-based. So
by and large, if you don't have it in your database, we don't find it with
SourMash. And that's, that's intentional, or that's, that's a feature, not a
bug. And in particular, it lets us ignore these sequencing, these issues of
sequencing errors, because they just don't match. We have found that in
reporting out matches, when we ask for a metagenome, how much of the metagenome
is known, if we.  weight the reporting by the abundance of things, then we can,
we can accurately say, gosh, you know, there's 5% of the abundance weighted
k-mers in this metagenome are unknown. Those are probably mostly errors. You
know, we can say things like that. And that turns out to be pretty effective at
weeding out the, the, the issues of erroneous k-mers. There, of course you can
filter them out and you can do other things if you have a genome or if you have
other things where you are, you're just interested in the high abundance k-mers.
But by and large, these days, I actually, we, this is, it's taken us about five
years to reach the conclusion that you don't need to do abundance filtering or
even error trimming of any particular kind for the kinds of metagenomics, the
reference based metagenomics we're doing. That's a lot quicker as well, because
nearly all the pipelines that I've seen, you know, from a genomics start off
with let's trim the reads and get rid of the, the bad stuff. So if you can skip
that step, that saves a lot of time. Yes. Yeah. Yeah. Yeah. So, okay. So, so we,
we, we took this Frackman hash-based approach. We figured that out probably in
around 2017. We implemented it robustly in SourMash. We implemented a lot of
additional things in SourMash to make sure that like you were comparing apples
to apples, the right k-mer site, you know, you're comparing the same k-mer sizes
in the sketches. You were, you know, had efficient ways of, of, of searching
large collections of k-mers. We developed fairly robust Python APIs for
interacting with large sets of k-mers. And this sort of illustrates, I guess I
would say my, what is my sort of characteristic approach to bioinformatics,
which is I don't want a tool that does one thing. I want a library that lets me
do whatever crazy thing comes across my brain today. And it's going to be
different tomorrow. And it's going to be different the day after that. So I care
much more about rapid iteration than I do about getting the thing right in the
first place. And I think this is both good and bad. It means that often our
software, the state of our software and the state of our research program is a
bit of a mess. It's very hard to convey to people because we're figuring things
out. We're doing research on, on the computational approaches. On the flip side,
it does let us figure out new ways of dealing with data that sort of often seem
to be different from what other people have done. Like digital normalization
came out of nowhere and was very successful for a couple of years. And SourMash
also sort of came out of nowhere. And I feel like we do some things very well,
and then we utterly ignore other things that other people are doing very well,
because that's not a problem that I need to solve. It's already been solved by
other people. And it's, it's fun. It, it, it, it makes life fun. I don't do it.
This is, this is probably an admission I shouldn't make. I don't read a lot of
other people's papers when I'm tackling a problem. I try and gain a feeling for
what the underlying technical problem is. And then I figure out whether it's
something I've seen somewhere else. And if it is, then I go borrow that
approach. Go ahead. So can I ask, are you doing this work these days, or are
you, are you outsourcing to grad students or what? Are you just the, you know,
the overlord sitting there taking all the glory? Wow. Well, for better or for
worse, the, the, the professor does tend to get a lot of the credit, but I sort
of have two modes of working. One is I try to do with the more computational
people in the lab. I tend to, to take a nerd sniping approach. I don't know if
you're familiar with this XKCD comic. Well, you should go look up nerd sniping.
I'm not going to do a good job of relaying it. Lee is bringing it up right now.
I think the idea of nerd sniping is if you can lay out interesting problems in
front of people, they'll often pick them up and run with them and you don't have
to solve them yourself. Nerd sniping here as XKCD goes is a little more mean,
but so nerd sniping is how I recruited like Luis Erber into SourMash. And he's
been my major, one of my two major collaborators on SourMash where he, you know,
I started, I started putting software in a GitHub repo. He thought, oh, that
looks interesting and is solving some of the problems I have in my work. He
started adding pull requests himself and over time adding new data structures
and algorithms. And, you know, eventually basically is, is, is one of the co-
equal contributors in SourMash. So that's how I deal with the technical people.
Go ahead, Lee. So you mentioned Luis a couple of times. Was he, was he at like a
different school, like an undergrad and he was looking for grad school and you
just picked up and, and, and, and you nerd sniped him? How did, how did that go?
Since you mentioned him. Yeah. Yeah. Luis is so, so Luis did his undergrad work
and then also worked as a staff scientist after his, his undergrad work in
Brazil for many years, for a number of years. And then he, he saw that I was
working in Python and open source and he, we met up at a PyCon. I don't even
remember what it must've been around 2010 or 2000. Yeah. 2010, 2011. He saw, I
was interested in writing good software. He saw that I was doing interesting
work. He didn't know anything about biology and he wrote me this lovely email
saying, Hey, I am a scientist from Brazil. I'm interested in grad school. I do
Python programming. What do you think? And I said, great, come to grad school.
And so he came to Michigan state as a grad student in 2014, I think 2013 or 2014
to work with me. Yeah. Well, so funny story there. I, he came mid, he came
midway through the year. So he came in winter and I remember vividly, he, he
flew into Detroit and then he took a bus from Detroit to East Lansing. And it
was, it was in one of the worst snow storms we'd had in, in 10 years. They
actually had closed Michigan state later that, that, that year, because it was,
had gotten so cold. And I remember I, I, I pulled up in my, my, my dad, mobile
minivan to the bus stop to pick him up. And he got off the bus with his wife,
uh, Steph, Stephanie, Stefania, and they were trudging through, you know, three
foot deep snow and they just looked miserable. And they got into the minivan and
sort of sat there and, you know, shell shocked silence. And I could just see his
wife giving him this look like, what have you gotten us into? Um, so, uh, I
think both of them were quite glad when later that year I said, well, we might
be moving to California, but yeah, it was definitely a, a big transition for
him. And also at the same time, just, just a personal point of view is I, when,
when we started seeing mash go online, I started, this is my first encounter
with you. I started seeing you commenting on the mash repo and just asking like,
what's, what's the random seed you guys are using behind the scenes? What's
this? I'm going to develop this thing called sour mash. And, and at the same
time I was thinking like, I can just turn this into, into trees and clustering,
and this would be great. And, and I dropped off the face of the earth in 2017
for paternity leave, but you kept going and I came, I resurfaced, I saw all this
happening and it was like, it was really cool. Just to let you know, like I
thought it was very transformative too. And I really enjoyed seeing this from
the other side, I guess from the East coast. Awesome. Thank you. Yeah. I, I, I
think one of the things that I, a set of experiences I brought to
bioinformatics, to my research work was actually the open source, this open
source mentality that I learned from this guy named Mark Galassi, who if you
Google, he's done a whole bunch of stuff in physics. He works at Lawrence Liver,
sorry, the one down by down in Los Alamos. There we are. Sorry. Why did I have
trouble figuring that out? And he got me my first Unix account. And when I was
in junior high and high school, he was a grad student in physics and the
department that my father was a physics professor in. And so I got early access
to Unix from that. And he introduced me to Unix and open source programming and
screen. I, I think it's funny to, I don't know if you use screen or TMUX, but he
basically was like, you just need to be using this. And that's something that's
carried through, through the years probably stands out more than any one other
thing he said he's done for me, despite all the stuff he's done, done for me,
screen was really transformative. And so he introduced me to open source and I
started writing some open source code in high school. I participated in various
different, I wrote a, I wrote a, a real-time chat program based on talk called
ring. I, I worked on some, a game called dominion for a little while. And so I
got, I was into open source programming before I really got into scientific
research. And then when I got into research and I started my own lab, especially
it's like, well, I feel like I should just continue doing this in an open source
way. And so something that I did, this in an open source way. And so something
that's, that's also stuck through all of the different work I've done is, you
know, we're supposed to be doing this stuff in the open. We're supposed to be
letting others make use of what we do. And why don't we figure out how to make
that work rather than hiding things for personal enrichment before we, until we,
until we figure out how to get credit, appropriate credit for them. So it's nice
that you saw, it's great that you saw what I was doing because I was posting on
the mash repo. And I like to continue doing that kind of stuff. Well, on that
note, I think we'll call it a day there for this week. And thank you so much,
Titus, for joining us again, and we will pick it up again in another week. Thank
you so much for listening to us at home. If you like this podcast, please
subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your
choice. Follow us on Twitter at MicroBinfy. And if you don't like this podcast,
please don't do anything. This podcast was recorded by the Microbial
Bioinformatics Group. The opinions expressed here are our own and do not
necessarily reflect the views of  CDC or the Quadram Institute.