Hello, and thank you for listening to the Microbid Key Podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the Head of Informatics at the Quadram Institute in Norwich, UK. Andrew is the Director of Technical Innovation for Theogen in Cambridge, UK. I am Dr. Lee Katz, and I am a Senior Bioinformatician at Centers for Disease Control and Prevention in Atlanta in the United States. Hi, and welcome to the Microbid Key Podcast. We are here today with Titus Brown. He is a full professor at UC Davis. He has had his hands in many things that you may or may not realize you work with. So anyway, let's get into it. Let's learn what all those things are. So we have a few different topics we can get into. But first of all, I'm sorry, it's about 6am over there to help accommodate our different time zones with Andrew in England and me in Atlanta. So are you awake, Titus? Maybe that's the first question. Yeah, thanks for having me on the podcast. Yeah, I normally get up at 5am, and that's when I get most of my deep work time, I guess, is the term for it, and so I'm really happy to be here. And I'm fully caffeinated. Could you tell us about yourself and what your background is? Sure, gosh. So my background is I did a math bachelor's degree at Reed College, and that's when I started getting really interested in research. And I did some research and I worked in two different areas as an undergrad. I worked in digital evolution or digital life work that's now being continued on by my collaborators that's been continued by my collaborators at the time to this day. And then also, I worked in something called physical meteorology, where we were looking at climate, we were measuring the albedo of the Earth by using the moon as a mirror, following work that was done that was initiated by Leonardo da Vinci, actually. And so, yeah, yeah. And so it turns out that you can observe the ratio of the intensities of the dark side of the moon and the bright side of the moon using a ground-based telescope, and that that informs you about sunlight that's bouncing off the Earth's atmosphere, hitting the moon and then reflecting back to you. And so you can get a whole hemisphere measurement of the Earth's albedo in real time, just by observing the moon. And anyway, so I did that as an undergrad. That work's also still going on, just without me as part of it. And that got me, the digital evolution stuff got me really interested in biology. And so for grad school, I asked a couple of different scientists that I knew, mostly physicists, what I should do for grad school. And without any hesitation at all, all three of them said, you should do biology. If I was starting today in research, I would do biology. And these were all pretty senior physicists. So it was startling to hear three separate physicists say biology is where the future's at. So I went to Caltech and I bounced around a couple labs in the beginning and eventually found a developmental biology lab where I started doing experimental molecular biology. And I don't think many people know that I actually did experiments for six or seven years. Yeah, in the lab, wet bench, wet bench, PCRs, yeah, PCRs, micro injections and C2s, all the good things. And I worked for about eight or nine years in the lab of Eric Davidson, who did a lot of work on cis-regulatory analysis and gene regulatory networks in sea urchins. So I got a, my most in-depth formal training actually at this point is in developmental biology. And during that time, this was in the late nineties and early two thousands, and that's when genome sequencing really started to blossom. And we started getting lots and lots of sequences. And so I was in this lab and I had a background in open source programming at the time, and I had a math background. And so all of a sudden we had too much data and I was someone in the lab that knew how to deal with it. And so I started developing tools to help people do, you know, large-scale blast searches. It started with large-scale blast searches, which I guess is where a lot of bioinformaticians got their start. And then it moved on to custom software tools for doing comparative sequence analysis and a web server to run the compute heavy stuff and, you know, really nice, really, really nice interactive bioinformatics applications within a large lab. And I think this is where one of my, I would say, formative philosophies for my life came along where I worked for, I worked on developing a comparative sequence analysis tool that would let people look for conserved non-coding elements in very large segments of eukaryotic DNA. And I developed this graphical user interface and I developed this backend web server, and then I wrote a tutorial. And then I went back to the lab and started working on my own stuff. And over the next two years, about eight different people in the lab picked up the tool, used it on their genomic DNA to find cis-regulatory candidates, cis-regulatory elements, then tested in the lab, and then without really having to talk to me since I'd written tutorials and documentation and stuff. And so I showed up at the annual, at the every 18 month sea urchin meeting to discover that eight different people in my lab had actually used my tool to further their research, had gotten good results and were ready to publish. And I had never had to talk to them or support them in any way because I'd done a decent job in writing the software and so on. And I thought to myself, this is the way to do bioinformatics. I get to write software and then I don't have to talk to people. It's just, it's, and they, they advanced their, their own science. And, and ever since then, I've sort of had this philosophy, like I love writing software. I don't particularly like talking to people on a daily basis because I'm, I'm, I guess I'm an archetypal bioinformatician type. And so what I do instead is I do my best to make the software usable. And I write good tutorials and documentation. And that is what spreads, spreads my labor more easily than anything else I've found. And so that that's been behind a lot of the software work that I've done since then is to try and do good documentation tutorials. So, so I got my PhD. I did a very short postdoc working in chick neural crest. I'm also at Caltech. And then I went to Michigan state university as an assistant professor. And that was another big career shift happened then where I was planning to do gene regulatory network stuff. I was planning to, to combine good embryology with large-scale gene expression analysis. And I arrived at Michigan state in 2008. And right about then was when the Illumina GA2 sequencer became widely available. And I remember walking around the halls at Michigan state university. And my colleagues would run up to me with zip disks in their hand and say, my antenna transcriptome is on the zip disk. And, and it's got all the secrets to my organism. I can't open it in Excel. What do I do? And within a couple of years, I, I, I switched pretty decisively over to thinking, well, probably I don't need to generate data. I know everybody on the planet, every biologist on the planet is going to be generating more data than they can handle. So maybe I should switch to analyzing data instead. And so I took a pretty sharp right turn and I switched over to doing non-model transcriptomics and metagenomics. Cause that was what my colleagues needed. And I figured that's what everybody else I knew in biology would need. And that's when I started working on k-mers pretty much full time. And I've stopped, I haven't stopped since then. So it's been, it's been, I don't know, somewhere between 12 and 15 years of, of k-mer focused bioinformatics as a way to deal with, with all of this data. So that ended up being, that work ended up being what, what got me tenure. I, I worked on a bunch of, I worked on this k-mer software that let us, and digital normalization, things that let us deal with large transcriptome, large non-model transcriptomes and metagenomes. I've used that software. It's really, really good. I used it years ago for downsampling, you know, when a bacteria, if you get 200 X or a thousand X or 10,000 X, you don't want to assemble that and the assemblers often fall over or they pick the wrong thing to focus on. And it was quite nice, you know, to zoom in and just make stuff work. Yeah. Thank you. Yeah. We were, I, I had this sort of epiphany, you know, about digital normalization that I still remember. I think I was, I, I was driving home with my wife and I'd been looking at this ridiculously deeply sequenced single cell E. coli data set. I remember thinking, why do we need 400 X coverage? Wouldn't it be better if we just had like 50 to a hundred X coverage? And I was like, what if we wait, wait, that could work. And somehow I just, what if we could estimate the coverage of each read and just throw them, throw it away? Once it got high enough. And I think once I got home, I dipped out while my wife made dinner and I should say that I cook regularly. So this wasn't, this wasn't, you know, I don't always do this, but. And I think within 30 minutes I had something functioning because I had a nice Python interface to the software and so on. So once I, once I had the right idea, it was very easy to implement. And that ended up solving a lot of problems for us with the RNA-seq and also the, the metagenome stuff we were working on. I was working on soil metagenomes at the time. And that's what another one of my. sort of basic philosophies is if you can analyze a soil metagenome, any other metagenome is easier. So if you develop tools that work on the scale of soil metagenomes, then your tools will be as broadly applicable as they can be. So soon after, or not soon after, I mean, this is all happening over a period of about seven years, I ended up getting recruited more or less out to UC Davis, where I took the position I hold now, which is as a professor in the School of Veterinary Medicine. And it's not, wasn't a natural position for me. I was very nervous about taking a position in a vet school. And it basically, you know, I sort of joke that, and I think the joke has some level of accuracy, that they basically recruited me and hired me because they wanted somebody that they could have coffee with whenever they wanted to talk about data science or sequence analysis. And so, you know, in the vet school, I pursue my normal research program, whatever, wherever that may lead me. And then just pretty regularly, I have coffee or meetings with my colleagues where they say, well, what would you do if you were in this situation? Sometimes I collaborate with them and more often than not, they go away and they do their work. And sometimes I interact with the graduate students and help them analyze their data. But it's been a really interesting experience that, you know, the other joke is everything has DNA. So if you work with DNA sequence, you can fit in pretty much anywhere. And when I arrived at UC Davis, I went through a really, one of the two or three toughest periods of my career, which was, I was exhausted from moving. My wife was starting a new job. You know, I had a whole bunch of new people in the lab and I thought to myself, I'm never going to be creative again. And I came across, at the time I came across two papers. I came across this paper by David Koslicki and Daniel, I think it's Daniel Falouche called Metapolet, which was about k-mers and species level specificity. And I came across a paper, I think I was asked to review this paper and I was asked to review another paper, which was the MASH paper by Andov et al from Adam Phillippe's lab. And that was on MinHash. And those two papers together sort of sparked something in me. And as part of my review for the MASH paper, actually funny story, I didn't review the MASH paper the first time out. I gave it to my grad students and had them write the review. And I read the review and I was like, okay, it seems like a cool paper. And I agreed with it and I sent it off. And then it came back and one of the reviewers had, one of the other reviewers had said, had been kind of critical of the paper. And so I was like, okay, well, I need to dive in and figure out like, am I going to be an advocate for this paper or not? And I read the paper and I said, oh my God, this is, this actually works. This is amazing. I'd never run across, I'd never come across MinHash before because I'm not a computer scientist. As you may note from all of my background, one of the, one of the things I, I have never formally trained in as computer science, although I was a computer science professor at Michigan state. How did you become a professor without actually studying the topic? You do a dof programming. You, I guess you, you pick up some stuff and the department at Michigan state was, again, I'm channeling a little bit of my understanding, but the department was mainly concerned, like, can he teach some of the CS classes? And it turns out that I could, I wasn't doing CS research. Although of course my research ended up being pretty algorithmically, algorithmically, but you know, I could teach programming. I could teach data structures and algorithms. I could teach some stuff. I just wasn't trained that way. So it worked out. Okay. So I'd never heard of MinHash and I read this paper and I thought, Oh my God, this is amazing. And then I re-implemented as part of the review, you know, it's like five lines of Python or 10 lines of Python to implement MinHash. And I, my review, I wrote, I don't know what the other reviewer is smoking. This, this is the best thing ever. And you absolutely need to publish it. And I didn't say, I don't know what the other reviewer smoking, but I said, I said, this is really transformative. Like this is practically speaking, you know, this is transformative. I know it's an old algorithm, but what the authors have done here has shown that it works exceedingly well for genomic sequence of all, all kinds. And it's going to be, it's going to change everything. I don't think I wrote, it's going to change everything, but that's how I feel now looking back. And a year or two later, I got this very nice email from, I don't remember if it was Adam or somebody else saying, thanks for your review. It really helped push the paper over into publication. And so that was really nice. So those two papers together really lit a fire under me and I started implementing this software called SourMash. And then I had a grad student at the time, Luis Erber, who as far as I can tell, just monitored my GitHub, saw that I was working on SourMash and started contributing pull requests. And that's how SourMash got started. And that's now one of my main obsessions, which is the sort of MinHash derived way of dealing with k-mers. We extended MinHash in a different way and sort of not in a super novel way, but in a way that made sense for metagenomics. And ever since then I've been sort of mentally exploring what you could do mentally and I guess computational exploring, like what can you do if you could look at all the k-mers and didn't have to worry about the fact that there are so many of them. And that's basically how SourMash works. It just throws away 99.9% of the k-mers. And that turns out to let you look at pretty much all of the genomes, all of the metagenomes at once without really having to worry about memory or disk the way you would if you were looking at all of the k-mers. And so for the last, I guess, six years, I've really been focused on that. And ironically, I guess I would say, I was really worried about being creative again. And ironically, I look back on that period and it's been the most creative and most interesting portion of my, part of my scientific career, or at least one of the most. So it's been, it just goes to show you like your own perception of where you are in your career and what you're doing may bear no resemblance whatsoever to the actual reality on the ground. So. Absolutely. So yeah, in terms of k-mers, I was looking at a method recently called Gambit. It's really cool. You should check it out. Gambit? Gambit. G-A-M-B-I-T. It's from David Hess's lab. And it's really good because what it does is it has, you look for not all k-mers, but you look for k-mers with a particular prefix. And then you look at the bit after that. So it's like a targeted k-mers. And I think it's based on the k-mers usually start at the beginning of star codons. So you get approximately one k-mer per gene in a bacteria, which is kind of cool because you could do like lots of really good typing. And Lee has just gotten it up there. Anyway, targeted k-mers works really well. Check it out. Yeah. So, so the other thing, so I did take a, I did take a six year detour from, from thinking obsessively about k-mers somewhere in there. And this overlaps a lot with COVID due to some of the training activities that I had done as a starting in 2010. I, I got connected with an NIH program officer, Vivian Bonazzi, who recruited me to work on the data commons pilot phase consortium that the NIH was running. This was in about 2019 or so. And I spent, well, I guess four or five years really focused on practical aspects of implementing infrastructure for large-scale data reuse at the NIH. And I, that project started as the data commons pilot phase consortium, and then it morphed into the common fund data ecosystem, which was a more targeted effort focused on the NIH common fund specifically. And I ended up being one of two people running, running a large NIH consortium for a couple of years. And I left that in April of this year. And I think my takeaway from that is that I don't like meetings enough to keep doing coordination work, but it was really interesting and informative in thinking about where the real obstacles are to data reuse. And I would just say, I think data reuse is the question of data reuse and fairness as in findability, accessibility, interoperability, and reusability is probably up there in, in sort of the, the wicked problems that we face in science and in biology in particular. And I think it's unfortunately one of these things where we need socio-technical solutions. We need solutions that really are aware, factor in how people work. And that is historically something that science has not been very good at. And when you get into, when you get into different countries and different states with different laws, it's just becomes a minefield. That's right. And the amount of money that's spent on giving the illusion of sharing data, but without actually sharing data is just obscene. Yeah. Yeah. I think we need to leave it there. We're going to, we're going to cut this off and make everyone really upset at me and make everyone wait for the next podcast episode. So we're going to, we're going to thank you. That was a good note to end on. Thank you. So thank you very much for joining us on this podcast and we will see you again shortly. Thank you so much for listening to us at home. If you liked this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at MicroBinfy. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.