Hello, and thank you for listening to the MicroBitKey podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the head of informatics at the Quadram Institute in Norwich, UK. Andrew is the director of technical innovation for Theogen in Cambridge, UK. I am Dr. Lee Katz, and I am a senior bioinformatician at Centers for Disease Control and Prevention in Atlanta in the United States. Hi, and welcome to the MicroBitKey podcast. We're here again today with Titus Brown, and he's going to continue his journey in research with us. He's going to be talking about some of the research he's done recently, or over the past few years. Titus, you've talked about Saramash last time, and MinHash, and KMIRS. Could you give a bit more of an explanation on that, and how they come about, and what is it? Sure, yeah. Happy to. It's one of my driving obsessions these days. I'd worked with KMIRS quite a bit before Saramash came along, mostly in this package called the KHMR. I will say as a side note, this is a hat tip to Michael Crusoe. One of my rules after KHMR is never name a software package after a country or a people. It turns out the KMIRR are a people, and there's national aspects to it. And there's the real reason, which is that it's rather rude, which I didn't really understand at the time. And then the other problem that's sort of a funny problem, but is tied to that is you get very different kinds of spam when you have named your software hat. So I still get lots of spam for real estate from that area of the world. I don't know exactly why. Anyway, so I'd worked with KMIRS quite a bit, with KHMR and this digital normalization stuff, and it was great. KMIRRs are really just fantastic computer-friendly ways of looking at sequencing data. But KHMR was really heavyweight. When you picked it up to look at metagenomes or large genomes, and it turns out there's just lots of KMIRRs in large genomes and large metagenomes. So I don't know that I was intentionally looking for something lightweight at the time, but MinHash was just incredibly attractive, because you can take a sort of representative subsample of KMIRRs at random from a large dataset and use that representative subset as a sketch to do very fast comparisons of these otherwise intractably large datasets. And you could, in particular with MinHash, you can estimate something called the Jaccard similarity, which is a distance metric that tells you, are two things the same or different, basically, and gives you a numerical scale. And it's a full-on distance metric, so you can use it with clustering and large-scale search and other things without much fear. The biggest problem with MinHash, and I should say MinHash is a bit more complicated, MinHash and MASH were introduced in, I think it was a 2015 or 2016 paper for genomics. By the biggest problem with MinHash is that MinHash itself doesn't work well for metagenomics. And the reason why is that with MASH, you cannot, with the MinHash technique, you can't compare, accurately compare or accurately estimate the Jaccard similarity for two sets of very different sizes. Two bacterial genomes that are approximately the same size, and you want to know how similar they are, MASH works great. If you have a bacterial genome and a large metagenome, and you want to know, is that bacterial genome in the metagenome, or does it overlap with it? MASH will not give you an accurate answer in most circumstances. And that's because of the way the downsampling is done, essentially is picking a fixed number of k-mers to represent the entire dataset. And if one of the datasets is much larger than the other, the overlap between those fixed size sets is going to be very small indeed. So, CyRMASH started out as just a straightforward re-implementation of MASH in Python. I wanted to be able to play with it a little bit. I love re- implementing simple algorithms and getting a really good feeling for how they work underneath. I should say here that the first year, CyRMASH actually did the wrong thing. It actually was a broken re-implementation of MASH, but it let me get far enough in until someone corrected me. And so, I spent a year or two with CyRMASH, and my grad student, Luis Erber, was working on it at the time as well, struggling to figure out how to use it for metagenomics. And that's when I discovered that I basically re-implemented something called ModHash, or rediscovered something called ModHash. This had been published at the same time as MinHash. And instead of sampling a fixed number of hashes or k-mers from a set, it sampled evenly across a set. So, rather than say, I'm going to use a thousand k-mers to represent this large data set, you would say, I'm going to use one in a thousand k-mers to sample this large data set. And what that meant was that when you were comparing sets of very different sizes, the smaller set would have a smaller number of k-mers associated with it, and the larger set would have a large number of k-mers associated with it, and the overlap would be accurate. The overlap estimation would be accurate. So, you could do Descartes calculations, just like you did with MinHash, with ModHash, with this new ModHash approach. But you could also do containment and overlap-based analyses. So, you could say, ah, I have a bacterial genome, and I have a metagenome. How many of the k-mers in this bacterial genome are also in the metagenome? And you could get an accurate estimate of that with this ModHash-based approach. We actually implemented a slightly different version of ModHash that we've now called FrackmanHash, and for the CS wonks out there, it's a bottom-sketch version of ModHash. So, what you do is you say, I'm going to take all of hash space. I'm going to shuffle it. I'm going to take all of the possible k-mers. There's a fixed number of k-mers, 4 to the k, for any given k. And you shuffle that with a hash function. And then you say, I'm going to take the bottom 1,000th of all possible hashes, or all possible k-mers, and I'm going to use those as my representative subset. And any time one of those bottom 1,000th of hash space k-mers shows up in a dataset, I'm going to save that. I'm going to record that as a k-mer that's present. So, do you not get issues with noise then, you know, just from sequencing? Absolutely, yeah. But I guess this is another philosophical point that I would make that I, that sort of, what's the right way to put this? That informs a lot of my thinking about data analysis. Often you don't know what's noise and what isn't noise. So, it's better to have techniques that are, account for the noise, that are robust in the face of noise, but don't necessarily rely on you removing it in advance. So, yeah, I think what I would say is, in response is, yes, all of that noise does show up in your dataset, but it shows up, that's the right way to put this, it shows up at the, it shows up in your sketch and that's fine, and you just need to deal, develop methods to account for it. So, are those, are those methods like, like, like removing like the low abundance ones, or do you have like other ones in mind? Yeah, so, so with our earlier work in k-mer, we'd focused a lot on how to accurately remove low abundance k-mers from datasets, and the problem is that in metagenomes, that's a terrible If you remove low abundance k-mers from metagenomes, you remove not only the noise, but you remove a lot of signal, and that's because, of course, metagenomes like transcriptomes are variable abundance, or are sampling, you know, they're sampling a mixture where they're, the mixture has some high abundance and some low abundance things, and it turns out a lot of the real data in your metagenome, especially when you're looking at things like soil, is in fact, low abundance, and if you remove low abundance k-mers, you get rid of real, real stuff. So we developed a technique, a fairly decent streaming, semi-streaming approach in k-mer that let us sort of remove low abundance k-mers from reads that were otherwise high abundance, and that worked okay, but it still looked at all of the data, and so it was, was kind of too slow for, for real practical use. Anytime you have to iterate across a dataset more than once, you've already sort of lost in metagenomics. So, so with SourMash, over the years, we've reached this, I've reached this detente with erroneous k-mers. It turns out if you could do things fast enough, when you're doing reference-based approaches, you don't really care about the erroneous k-mers. They tend not to match things. That's, that's my, that's my 80% true statement. I guess when I've looked at, sorry, at novel k-mers before, a lot of times the novel stuff is things like, you know, phage and whatnot. So it is kind of interesting, but it's maybe not what you're interested in. Right. Well, so I guess I should, this is a good time to say the, the, the biggest strength and the biggest weakness of SourMash is that it's entirely reference-based, or at least the most popular applications of SourMash are entirely reference-based. So by and large, if you don't have it in your database, we don't find it with SourMash. And that's, that's intentional, or that's, that's a feature, not a bug. And in particular, it lets us ignore these sequencing, these issues of sequencing errors, because they just don't match. We have found that in reporting out matches, when we ask for a metagenome, how much of the metagenome is known, if we. weight the reporting by the abundance of things, then we can, we can accurately say, gosh, you know, there's 5% of the abundance weighted k-mers in this metagenome are unknown. Those are probably mostly errors. You know, we can say things like that. And that turns out to be pretty effective at weeding out the, the, the issues of erroneous k-mers. There, of course you can filter them out and you can do other things if you have a genome or if you have other things where you are, you're just interested in the high abundance k-mers. But by and large, these days, I actually, we, this is, it's taken us about five years to reach the conclusion that you don't need to do abundance filtering or even error trimming of any particular kind for the kinds of metagenomics, the reference based metagenomics we're doing. That's a lot quicker as well, because nearly all the pipelines that I've seen, you know, from a genomics start off with let's trim the reads and get rid of the, the bad stuff. So if you can skip that step, that saves a lot of time. Yes. Yeah. Yeah. Yeah. So, okay. So, so we, we, we took this Frackman hash-based approach. We figured that out probably in around 2017. We implemented it robustly in SourMash. We implemented a lot of additional things in SourMash to make sure that like you were comparing apples to apples, the right k-mer site, you know, you're comparing the same k-mer sizes in the sketches. You were, you know, had efficient ways of, of, of searching large collections of k-mers. We developed fairly robust Python APIs for interacting with large sets of k-mers. And this sort of illustrates, I guess I would say my, what is my sort of characteristic approach to bioinformatics, which is I don't want a tool that does one thing. I want a library that lets me do whatever crazy thing comes across my brain today. And it's going to be different tomorrow. And it's going to be different the day after that. So I care much more about rapid iteration than I do about getting the thing right in the first place. And I think this is both good and bad. It means that often our software, the state of our software and the state of our research program is a bit of a mess. It's very hard to convey to people because we're figuring things out. We're doing research on, on the computational approaches. On the flip side, it does let us figure out new ways of dealing with data that sort of often seem to be different from what other people have done. Like digital normalization came out of nowhere and was very successful for a couple of years. And SourMash also sort of came out of nowhere. And I feel like we do some things very well, and then we utterly ignore other things that other people are doing very well, because that's not a problem that I need to solve. It's already been solved by other people. And it's, it's fun. It, it, it, it makes life fun. I don't do it. This is, this is probably an admission I shouldn't make. I don't read a lot of other people's papers when I'm tackling a problem. I try and gain a feeling for what the underlying technical problem is. And then I figure out whether it's something I've seen somewhere else. And if it is, then I go borrow that approach. Go ahead. So can I ask, are you doing this work these days, or are you, are you outsourcing to grad students or what? Are you just the, you know, the overlord sitting there taking all the glory? Wow. Well, for better or for worse, the, the, the professor does tend to get a lot of the credit, but I sort of have two modes of working. One is I try to do with the more computational people in the lab. I tend to, to take a nerd sniping approach. I don't know if you're familiar with this XKCD comic. Well, you should go look up nerd sniping. I'm not going to do a good job of relaying it. Lee is bringing it up right now. I think the idea of nerd sniping is if you can lay out interesting problems in front of people, they'll often pick them up and run with them and you don't have to solve them yourself. Nerd sniping here as XKCD goes is a little more mean, but so nerd sniping is how I recruited like Luis Erber into SourMash. And he's been my major, one of my two major collaborators on SourMash where he, you know, I started, I started putting software in a GitHub repo. He thought, oh, that looks interesting and is solving some of the problems I have in my work. He started adding pull requests himself and over time adding new data structures and algorithms. And, you know, eventually basically is, is, is one of the co- equal contributors in SourMash. So that's how I deal with the technical people. Go ahead, Lee. So you mentioned Luis a couple of times. Was he, was he at like a different school, like an undergrad and he was looking for grad school and you just picked up and, and, and, and you nerd sniped him? How did, how did that go? Since you mentioned him. Yeah. Yeah. Luis is so, so Luis did his undergrad work and then also worked as a staff scientist after his, his undergrad work in Brazil for many years, for a number of years. And then he, he saw that I was working in Python and open source and he, we met up at a PyCon. I don't even remember what it must've been around 2010 or 2000. Yeah. 2010, 2011. He saw, I was interested in writing good software. He saw that I was doing interesting work. He didn't know anything about biology and he wrote me this lovely email saying, Hey, I am a scientist from Brazil. I'm interested in grad school. I do Python programming. What do you think? And I said, great, come to grad school. And so he came to Michigan state as a grad student in 2014, I think 2013 or 2014 to work with me. Yeah. Well, so funny story there. I, he came mid, he came midway through the year. So he came in winter and I remember vividly, he, he flew into Detroit and then he took a bus from Detroit to East Lansing. And it was, it was in one of the worst snow storms we'd had in, in 10 years. They actually had closed Michigan state later that, that, that year, because it was, had gotten so cold. And I remember I, I, I pulled up in my, my, my dad, mobile minivan to the bus stop to pick him up. And he got off the bus with his wife, uh, Steph, Stephanie, Stefania, and they were trudging through, you know, three foot deep snow and they just looked miserable. And they got into the minivan and sort of sat there and, you know, shell shocked silence. And I could just see his wife giving him this look like, what have you gotten us into? Um, so, uh, I think both of them were quite glad when later that year I said, well, we might be moving to California, but yeah, it was definitely a, a big transition for him. And also at the same time, just, just a personal point of view is I, when, when we started seeing mash go online, I started, this is my first encounter with you. I started seeing you commenting on the mash repo and just asking like, what's, what's the random seed you guys are using behind the scenes? What's this? I'm going to develop this thing called sour mash. And, and at the same time I was thinking like, I can just turn this into, into trees and clustering, and this would be great. And, and I dropped off the face of the earth in 2017 for paternity leave, but you kept going and I came, I resurfaced, I saw all this happening and it was like, it was really cool. Just to let you know, like I thought it was very transformative too. And I really enjoyed seeing this from the other side, I guess from the East coast. Awesome. Thank you. Yeah. I, I, I think one of the things that I, a set of experiences I brought to bioinformatics, to my research work was actually the open source, this open source mentality that I learned from this guy named Mark Galassi, who if you Google, he's done a whole bunch of stuff in physics. He works at Lawrence Liver, sorry, the one down by down in Los Alamos. There we are. Sorry. Why did I have trouble figuring that out? And he got me my first Unix account. And when I was in junior high and high school, he was a grad student in physics and the department that my father was a physics professor in. And so I got early access to Unix from that. And he introduced me to Unix and open source programming and screen. I, I think it's funny to, I don't know if you use screen or TMUX, but he basically was like, you just need to be using this. And that's something that's carried through, through the years probably stands out more than any one other thing he said he's done for me, despite all the stuff he's done, done for me, screen was really transformative. And so he introduced me to open source and I started writing some open source code in high school. I participated in various different, I wrote a, I wrote a, a real-time chat program based on talk called ring. I, I worked on some, a game called dominion for a little while. And so I got, I was into open source programming before I really got into scientific research. And then when I got into research and I started my own lab, especially it's like, well, I feel like I should just continue doing this in an open source way. And so something that I did, this in an open source way. And so something that's, that's also stuck through all of the different work I've done is, you know, we're supposed to be doing this stuff in the open. We're supposed to be letting others make use of what we do. And why don't we figure out how to make that work rather than hiding things for personal enrichment before we, until we, until we figure out how to get credit, appropriate credit for them. So it's nice that you saw, it's great that you saw what I was doing because I was posting on the mash repo. And I like to continue doing that kind of stuff. Well, on that note, I think we'll call it a day there for this week. And thank you so much, Titus, for joining us again, and we will pick it up again in another week. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at MicroBinfy. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.