Hello, and thank you for listening to the Microbid Key Podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is a Senior Bioinformatician at the Center for Genomic Pathogen Surveillance, University of Oxford. And Andrew is the Director of Technical Innovation for Theogen in Cambridge, UK. I am Dr. Lee Katz, and I am a Senior Bioinformatician at Centers for Disease Control and Prevention in Atlanta in the United States. Hi, welcome back to the Microbial Bioinformatics Podcast. We're here again with Kostas Konstantinidis, and Nabil's out on holiday. Andrew's here with me. And let's get a little bit into metagenomics. Yeah, so I have a question about soil metagenomics, okay? So you said 20 years ago in the last episode that soil metagenomics, or soil microbiome, was unknown at that point, really. And that's why you got in the field, because there's so much out there to explore and whatnot. And however, it seems like there's still a lot to explore, because if I went out to my garden right now and sequenced some of the soil from my flowerbeds, I'm quite sure I'll find many novel genus and species and probably even worse out there. So tell me about your foray into the soil microbiome, and what do you have to do to actually close all of those gaps? I don't know. This is a very difficult question. Indeed, soil is probably the most diverse environment in terms of the organisms that it harbors. And I think you are right about what you said about your backyard. I'm not sure we need to describe all this diversity, but I think we need to understand what drives it, like the mechanisms. And I think, like in the human field, the soil microbiome field is also going that direction. Like, OK, we have done the description of diversity in the last two decades. Now is more the time to understand the mechanisms, how it works. And I don't think we need to describe all of this diversity. It might be an interesting academic exercise, but I don't see necessarily a lot of practical applications. I think now our focus will be, you know, how they are adapting to changes, like, for example, the climate change or agricultural practices, how we can manipulate them to do what we need to do, what we want to do, like, for example, retain more nitrogen in the soils, which is not so different from the human microbiome, right? How you can keep a human healthy. And so, you know, do you have to do, give them some probiotics, give them some good microbes? I think that's where the field is, and I think that's what should be the future, really, to understand how it works. So then you can model it, but also manipulate it when it's not healthy, you know, make it healthy again with some additions. And I will say that we are not there yet. I think there's a lot to learn, because probably the era from 2000, 2010 was more diverse, and also after 2010, and I think only now, recently, we start asking some more hypothesis driven questions and use the metagenomics to help us answer them. But I think overall, it's a great topic. It's a great topic. I think it's very important. In my view, it's as important as the human microbiome, but it's getting all the attention right now. But, you know, if you think about it, soil is really feeding us through the agriculture and etc. So I hope there's more funding to understand these mechanisms and how it works. There is some funding, but I think, in my view, it's still a big frontier, and I think needs more study, and I don't think we have major breakthroughs or conclusions yet. There are some exceptions, I think. Well, what I always found fascinating is where plasmids come out of nowhere or phage come out of nowhere, and clearly they're somewhere in the environment or somewhere in some kind of microbiome, and they magically appear or they cause problems or they don't cause problems. But there's so much we don't know about what's around us. I had a postdoc go out and take some soil samples from around the Institute and then just as a quality control step, just sequence them on the MinION to do some long-read sequencing and test out some protocols. And, you know, we're getting a fully complete chromosomes of completely novel species and novel genus. And that's just phenomenal that we can just do that. There's so much we don't know just right under our feet, literally. And that's world over as well, you know, because obviously we have diversity everywhere. So I think it's going to keep us occupied for many, many, many years. But anyway, that's kind of a bit of a digression. And you and your lab have done actually quite a fair bit on lots of different software. And I think we should probably go on and talk about that. Over to you, Lee. Yeah, I think that earlier the soil question was just so heavy. Maybe I can help frame it just a little bit, like, and bring in the software for sure, because I just know secondhand from just listening to your talks while I was at Georgia Tech and elsewhere, like, I know that you've developed your own metagenomic software to study, for example, like the metagenomics in a lake or elsewhere. And maybe you can tell us kind of where you're coming from. And why you develop the software and what metagenomics you're looking at. OK, there's too much to answer here. No, that's fair. The question is very broad. I think I'm going to give a couple examples. So indeed, we have developed some software, some pieces of software. As far as I remember, there was usually a wall that we met, that we ran into. And there was no good tool available to do what exactly we wanted to do. And that's when we took the time to develop the software. So, for example, and you know, you guys mentioned that to me offline, nonpareil. This is a tool that, you know, a great former student, Luis Miguel, developed. And basically, nonpareil, what it does is if you have a metagenome, it can tell you how much of the total DNA that was in that sample you have sequenced. So have you sequenced, I don't know, 10% of what is there or 100%? And, you know, when we develop nonpareil, that's probably, I'm going to say 2013, 2014. There were no tools to do that available. Now there are a couple other tools, obviously. But that's what I'm saying that, you know, as an example, you know, my students were coming to me and said, you know, Costas, we sequence. And then I was asking them how much did we sequence? Did we sequence all of it? And then we realized, oh, we need to somehow quantify that. And nonpareil is doing exactly that. I think it's a very handy tool if you want to know, oh, I have sequenced 99% or I have sequenced 9%. And also, the other thing, cool, it's based on a simple idea. And I will say most of our software, even if they look complicated, they have a simple idea behind. Nonpareil, the simple idea is doing, is employing, is looking how redundant the reads are. So if you have sequenced everything that is there, most of your reads should be redundant to each other. And so the other thing it does, and I think it's very cool, it projects, if you have undersampled, it projects how much sampling you need to do to get to the 99. And the soil, when we did the comparative analysis between lakes, the human gut and the soils and sediments, soil were the most diverse. I think we estimated that in order to get to the 99% coverage of what is there in the DNA we extracted, we needed usually on the terabyte, terabase per sequencing. And so like as an example, I think this is how we have been working and hitting a wall than taking the time to develop a piece of software to help us. In terms of what we are doing in the lab in metagenomic projects, we do use them a lot to understand how microbial communities respond to perturbation, say they're in the human gut when there is an infection, that's a kind of perturbation. But also in the soil, like for example, where there is an oil spill, what is going to happen and how microbial communities respond. Hopefully we can understand also some of the, how fast this horizontal transfer happens, that's an interesting question to me. So going back to what Andrew was talking about, about all these plasmids, et cetera, I think it's very cool that we have the technology now to see them. And I think some of the questions that are interesting, at least to me, is to figure out how fast they are moving between organisms. Is it in the time of days or months or years? and help them to adapt to the perturbation. So to me, that's an important question that we are working on as an example of what Lee asked me about, you know, what we are doing with these metagenomes. Yeah. So do you want to go a little bit into the perturbations that you did and which environment you did that in? That sounds interesting. I don't know. I have a couple of examples because we have several projects in the lab. I want to give just one. So if you remember, you know, almost 10 years ago there was this big oil spill in the Gulf of Mexico. Sure. The deep horizon accident that millions of gallons came on shore. So we went there, we took metagenomes before the oil reached the sand but also when the oil was in the sand. So this is the coastal environment, the beach sand. And we saw that, you know, there's a there are at least a couple of microbes that we cannot detect them before. And during the oil, they were like 10, 20% of the total microbial community, which is very high. Like you don't usually see 10, 20% unless you have, for example these oil spills or like a human infection in the gut that is an organism that goes like 10, 20% of the total. So apparently these organisms are very good at degrading oil and they are hanging there very low abundance. But when they see the oil, they grow very fast. So we sequence the metagenome. We assemble the genome. We name it as a candidatus. And now my recently funded NSF, US NSF. NSF is the National Science Foundation is about figuring out what biosurfactants these organisms excrete and make them so successful to break down oil. So that's what I'm saying. Now we are going more into the mechanism, you know okay, why they are so successful? Are they excreting something that, you know dissolves the oil and they grow on the oil? And that's, I think a nice example of what I mean that I think, okay, we did the descriptive study but now I think it's important to understand, you know how they are so successful. And obviously in this case there's also the practical application, right? Cleaning the oil, right? If we can figure out how they do it if they are excreting some surfactants maybe that becomes a very interesting biotechnology to deal with the next oil spill. I think that gives a nice example in my view at least of, and it's not done it's work in progress. So we don't have a good surfactant yet but I hope we can discover it in the next two, three years. But I think- It's incredible. Yeah, example of how you use the tools and now we are more on the mechanism and understanding how it works, why they are so successful. But again, like what Andrew said I think what really facilitated our research was the metagenomic tool. We were able to assemble the genome out of the metagenome see that it's a novel taxon, see that it has some very interesting genes in the genome that, you know, they have homology. They are similar to genes that previously have been shown to be excreting surfactants. And so that make us create that hypothesis that maybe that's what they do and they're so successful. So I think it really a project where this technology has helped a lot to go to the next step and doing something that I think is very exciting right now. I can ask this question. I feel like since this is a Bioinformatics podcast, what is the software or the algorithm that you're using to go from metagenomes to figuring out whatever this gene pathway is to get to figuring out surfactants and just kind of classifying all that? Well, okay. Another difficult question that I think- Ignore me if it's too difficult. I don't wanna ambush you again. I think it's difficult because basically I think there are many tools that you can use for every one step. And I cannot pretend I know all of them. It's a very difficult thing to keep up with. You know, the number of software that the softwares that have been published even for the same task. And if I tell you, okay, we use X, there might be another 10 that are better than X. But X is what we found working stable in our hands. And we are using it. So I am okay to say, you know, what we are using, but I wanna warn the audience that there is a ton of other things for the software for the same purpose, some of the same steps. And I cannot predict I know all of them. It has been very difficult to keep up. Fair, very fair. So, you know, the steps are, okay, you need to extract the DNA. Usually we use a kit. These days, old days was the mobile. Now I think Kaizen has bought them. We found it to work well in most of the soils. And then we do the sequencing. In the old times with short read technology, but I see long read also becoming important now. And then I think for the assembly get binning, it will depend a little on what data you have. So we, this work I'm talking about the oil spill was done, at least the metagenome sequencing, et cetera, seven, eight years ago. So it's based on short reads only using some of the tools that were very good back then for the assembly. And then for the next step, which is the binning. And now there are other tools that I have seen in the last three, four years, and maybe they are doing better or slightly better. I don't know. I cannot tell you. And then usually you have to do your bi-formatic sequence analysis. And we like to use the curated databases as a first step, like for example, Swiss protein was what has come up from the Swiss protein idea. But as I always say to my students, for the functions that we really care, we also check against GenBank at the end, because the latest information would be in GenBank, but not necessarily in the curated database. The curated databases lack a couple of years behind in my view. So they're very good and give you reliable information when you wanna annotate something, but the latest will always be at GenBank and not necessarily the curated databases. And I think that's more or less what everybody's doing in the field. They might use different tool for one of these steps. There are several steps. I think there are at least five, six distinct steps. And there are pipelines that do all of them for you, also nowadays. So I think it's becoming mature as a technology, as a tool, not standardized yet. Every one of them will give you slightly different results and it's up to the user to figure out, what are the advantages for what they wanna do of the different tools. And this is kind of the process. Can I ask, what kind of tools and the resources and methods have you got in the pipelines that are coming soon? Because I'm really interested to know, what is the future of this fields? Or can you not reveal it yet? That's fine as well. But we want a teaser. I'm not sure I'm up to date. My students are much more than me and then I learned through them. But I think the field is also still fluctuating a lot. So I don't think I can give you a great answer like for example, I think long read technologies are gonna revolutionize. And I have seen studies that you can actually, before, five years ago, that was not possible because for the long read sequencing, you needed a lot of DNA and we couldn't extract, a lot of DNA and high molecular weight. And we couldn't extract that from the soils or sediments. We always get small pieces. But I have seen recently a couple of studies that they managed to do that also because the technologies have become better. They need smaller amounts of DNA. Like now I think we are on the, a few nanograms, whereas before we were a few micrograms. So five years ago. So I think the long read technology is changing the field a lot because basically you can get better genomes and more complete or complete, fully complete. And I think there are a couple of tools that are doing these or they do hybrid assemblies and hybrid binning. I don't know the names of all of them, but I think that's where, I think the state of the art is right now. And I think the long read technology is gonna revolutionize again. The field is already doing it, but I think it's gonna become more cheaper and more accessible to everybody in the next couple of years. So if you're asking me, that's what I think a lot of the future is. The short read is still very useful, especially for diversity discovery, abundance estimates. But I think the long read has cuts up and I think we're gonna see more of it in the next couple of years. And I guess the advantage of like a MinION is you can literally take it to the field. You can take it on a ship, you can take it to a beach where there's an oil spill, which is phenomenal when you think about it. You can actually get the freshest of fresh samples on the fly, which does matter actually, because so many bacteria will die if you freeze them. Yeah, yeah, that's another big advantage, I think. Yeah, in my field, we have ethical poo. So from gut samples where it's fresh and it's only been refrigerated, it's never been frozen. And you need that, you need it fresh to get the most out of the microbiome because you see huge difference. As soon as you go and freeze it at least once, you know, you'll, you'll get a huge die off. And every day. Yeah, we're so we're almost at the, at the end of our discussion here and at this time I just wanted to ask you if there are any other points you want to make before we sign off. Some interesting things from your lab or some interesting. Or if you have job positions you want to advertise, get people to come research with you. Yes, we are always looking for people that are interested to work with us. As I told you, we are working both on the human microbiome, usually through collaborators, like for example at CDC or Emory University, the big medical institutions in Atlanta but also outside Atlanta. So if people are interested they can check out the website, shoot an email. We have some interesting publications with A&I gaps, including within species that are coming up. People should check them out and let us know what they think. And it was a great pleasure talking with you guys and, and, you know, you asked me some very difficult questions and broad questions I'm not sure I answered well. I tried my best, but it was a pleasure for sure. Well thanks everyone for listening. This is Dean Kostas and he's been an amazing guest. Just the originator of so many things that you may or may not have known that you've been using this whole time. And we'll see you next time. Thank you, Lee. Bye. Thank you so much for listening to our podcast. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at Microbinfee. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC, Theogen, or the Center for Genomic Pathogen Surveillance.