Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Hello and welcome back to the MicroBinfeed podcast. Andrew and Nabil are your hosts today, and this is part two of our extended holiday special on bacterial taxonomy. With us, Ian Sutcliffe, Professor Phil Hugenholtz, and Professor Mark Palin. Join us, and we will jump right back into it, talking about integrating modern genomics into taxonomy and nomenclature. So Phil, GTTB is definitely the gold standard now, despite what anyone else says to me, it is definitely the gold standard for modernizing and integrating genomics into this classification and taxonomic problems that we were talking about in the previous episode. So Phil, what is GTTB though, and what does it try to accomplish? Well basically it's a taxonomy that's based on genomic comparisons. And so we are far from the first to make phylogenetic trees from concatenations of single copy conserved genes. So that's been around for a while. And I guess the main thing that we've done differently is first of all that we have been as comprehensive as possible. So we have taken as many genomes as we can get, and we've got them from a single source, which is NCBI from the Genome Assembly Archive there. And we have taken a set of concatenated single copy genes, 120 in this case, and that's a moving target because depending on who you talk to, there'll be people saying, you know, we should or shouldn't include a particular single copy. But what I would stress is actually that the trees that come from these type of comparisons are pretty comparable for the large part. And actually I'm very happy to see a kind of convergence on the structure of say the bacterial domain that's coming out recently. But because we make the tree from as many genomes as we can get our hands on, we're sort of forced to use a suboptimal way of, or heuristic way of inferring the tree using fast tree, which is the only maximum likelihood inference method that will scale. And we've had some criticism of that, that, you know, it's not perfect and it definitely is not, but it's adequate for the task. And then what we've really spent more than half our time on is overlaying a hierarchical taxonomy on top of that. So within any given tree, particularly if you've got a tree that's got 30,000 tips, you actually have many, many stable interior nodes, and we're using the canonical taxonomy, phylum class, order, family, genus, species. And you often have many more nodes that you can put a label on, but we're using that, those seven ranks. So we're basically taking a tree and overlaying a taxonomy onto it. And what we've tried to do is do that in a systematic way by using, taking relative evolution and divergence into account. And I'm very happy with this, actually, when I wrote the original grant, I hadn't actually incorporated that into it, sort of came up later. And I think that was a real winner for this taxonomy, because I'm actually not interested in taxonomy per se, I'm interested in evolution. And so having a standardized taxonomy that takes evolution into account, I think is a great way to move forward. And so the idea here is that if you say you have a node in a tree, and that represents an ancestor, that's a family ancestor, it's comparable to other family ancestors within the same tree. And what you're really saying is that those organisms at those interior nodes co-existed on the planet together, more or less, that's the explicit idea there. So we've spent an awful lot of time, it sounds pretty simple, right? But actually, it's a lot of work. And so to give you an idea, we're working on the latest release, which is 207 at the moment, and there are 17,000 extra species to include in the tree. And we've been curating it for the last two months. So it does have a lot of human curation, there's a lot of also automation, that gets us to a point that makes that curation. We've had criticisms that it's just thrown together, and it's not looked at, and it's just automated. But that's not really the case, there's a lot of manual curation of the taxonomy. When you do the relative evolutionary divergence, how do you actually pin the nodes to a particular time? How do you decide that this branch is 100 million years old, or a billion years old? We don't actually connect it to time, it's relative. So we set the root of the tree at zero, and then all of the tips at one, and then we do a linear interpolation from the root to all tips. And so it's a relative time. So generally, genera fall from about 0.85 to 0.95 on that zero to one, and so it's important also to point out that you don't compare those relative values between trees, since people have wanted to do that. You can compare within a tree, but not between trees. So you can then make a time tree out of it, but then you need to have some points that you can pin down to particular times, which as you know, is very, very difficult for bacteria, because we don't have a fossil record to work with. But in theory, you could do that. And it is a simple approach that we developed, but then of course, as you often find, others have done it, and there are other ways of doing this relative evolutionary divergence. And there was a nice paper from Antonis Rokas's group, where they did it on fungi, and they did it with a more sophisticated model and compared against our simple model. And I was very happy to see that it was actually pretty consistent. One thing I'll add is that we create corridors, because we're fitting seven ranks to a tree. We have a corridor, and that's actually a saving grace, because we didn't want to go in and completely upend the existing taxonomy. We wanted to have some movement so that you'll have deeper and shallower genera, deeper and shallower families within a corridor. So we can actually accommodate taxonomic opinion. I know that one of your guests spent a lot of time on mycobacterium, and we can accommodate splitting or lumping that, because it's within the corridor, right? So we're following taxonomic opinion. But if somebody then comes out and says, you know what, mycobacterium is a phylum, well, we can't follow that because it's completely out from our scheme. But within a corridor, there is some flexibility. And we do try to follow taxonomic opinion of others. We do try to follow nomenclature as much as we can. And I certainly don't want to be the arbiter of nomenclature, because that's a very sticky wicket there, which Ian can speak to better. But we try to take the nomenclature as best we can. But yeah, it's a very passionate area, I guess, as we'll get on to later. But for my mind, we create a taxonomic framework, and then you can fill that taxonomic framework with names, as Mark has been very, I guess the word is aggressively, or let's say very enthusiastic. And I have, and I admire that he made a very good case for it. As a microbial ecologist, I have some reservations too, but I'm very interested to see where that will go. Before we move on to nomenclature, I think there's some interesting questions about taxonomy. So there are some surprises when you look at molecules and sequences. And I mentioned that mycoplasma, so in the old days, mycoplasma was thrown out into its own phylum, and they were seen as, oh, there must be primitive organisms, they're so different from everything else, you know. And then Carl Woese, I mean, I remember reading his review, Bacterial Evolution, it was just an amazing moment to see that. And in there, he spends a lot of time saying, well, mycoplasma has evolved quickly, they're not a phylum, I'm not even sure, where are we now, are they a family or a class, I can't remember. But they're in order, okay, so they're down, shoved down into the Firmicutes. And so how many other surprises are there like that, do you think, in your classification, that people who haven't thought about taxonomy much will say, hang on, I never realised that. Mycoplasma is a good example of why you don't want to use a straight threshold. So taxonomists have been quite keen on using flat sequence- based identity thresholds. And I think that's when you get into trouble, because a very fast clock group like the mycoplasmas look much more deeply related than they actually are, once you take that evolutionary divisions into account. That's one of the most striking ones. I guess the other one is the CPR, or the Candid Phylar Radiation, another very fast clock group, and the building consensus now is that it's a derived feature, and long branches are due to rapid evolution. And it's a single phyla rather than up to 100 phyla based on the red values, and assist the lineage of the chloroflexota, if I'm using the new nomenclature. So yeah, I guess that was another one. I mean, I was very concerned about the CPR, because Jill Banfield's group had been driving the CPR, and Jill is a very smart cookie, and I was a bit concerned about that we had such a discordance between our estimates on it, you know, one phylum versus 100. So I was very keen to try to resolve. that and then got involved with people like Tom Williams to try to root the tree and in order to find, because you need to know where the root is because the CPR is often portrayed in the, in the iconic picture from the hug paper as, as a basal lineage. And the most recent analysis, and there've been a few, there's one from, from our combined effort and one from Frank Elwood that would, that indicates that the CPR is actually a sister lineage of Chloroflex sota and, and derived. So I, again, that's, that harks back to that. My main interest is actually in the evolution and understanding how, how the organisms came to be. Yeah. I want to cross over to Ian on, on his thoughts on all that Phil has presented here. Well, I, I was going to jump in and say something about mycoplasma because mycoplasma has been fairly controversial at the other end of the sort of taxonomic hierarchy, the sort of species and genus level as well, because, and it's quite a useful illustration of the way rules work in, in the code, because there's been a lot of controversy, which about the fact that Radegupta proposed the reclassification, the large scale reclassifications within the, the mycoplasma group, because it's been long recognised that the, the rules of nomenclature say that the genus that contains the type species mycoplasma mycoides, which was described in the early 20th century, must be called mycoplasma. But the, of the many other species of mycoplasma that are known, they, they don't fit when we, when we can now do things like phylogenomic analysis, they don't fit in that, in, in the same part of the tree as, as that particular species. So Radegupta came up with a scheme that proposed the renaming of those in species into general like mesomycoplasma and metamycoplasma and so forth. I think I've got those names right. And that's been met with almost universal hostility because people don't like the renaming of organisms, basically. But my take on that is that actually that's what the science says. And so that raises this question of what do you do when things are renamed and how long does it take people to get used to them? And you know, Rade has followed the rules of the code, his paper is, is very proficient in that respect. All of his nomenclature proposals were done correctly. So all of the names he proposed are valid names and, and, you know, they've left the genus that contains the type species Mycoplasma mycoides called Mycoplasma and they've come up with names that other, sort of recognizably connect to the historic connection of those other genera to, to their roots, original roots in Mycoplasma. I think that's a really good pragmatic approach. It's interesting to say that, to observe it's been met with considerable hostility and he's also had the same experience with Mycobacterium as Phil mentioned earlier. Can I just chip in Ian, it's ironic because the Mycoplasma is actually in a separate family from all of those new, you know, Metamycoplasma and Mesomycoplasma. So I think that's been an issue in nomenclature for a while, that the type is so far removed from the majority of the other ones that we've described, but yeah, we follow, we follow that, that, that approach in GTV too. I do think this is what we're seeing in the current sort of Twitter storm around Phyla. It's that big of people say, well, we've always called it this one thing. Why can't we keep calling it that? And actually we, we also have to, as taxonomics have to be flexible to adapt to changes in knowledge and changes in knowledge, particularly methodological changes, tend to change names. I think that we, to go back to the historic dimension on this, we have a taxonomic framework and a nomenclature framework that was established in the mid 20th century and proved itself reasonably proficient or fit for purpose, particularly after the order was brought to some of the chaos by the announcement of the approved lists in 1980, or publication of the approved lists of names in 1980. And in that era, we were highly reliant on, because of the lack of molecular methods, on combinations of phenotypic methods and things like wet lab, DNA, DNA hybridisation. But then in the 1990s, and certainly with the advent of PCR, we saw the revolution that was caused by 16S RNA analyses. And that resulted in a raft of reclassifications, but people got used to it. And, you know, people, my view is that people quite quickly get used to new classifications and news applying new names. Now in the last decade, we've seen a raft of reclassifications on the back of phylogenomic analysis. But that's because we've got this new tool, which is phylogenomic analysis. I don't think there has been a raft of reclassifications, and some of them are, like in the example I mentioned by Radhi Gupta, fairly large scale papers. But I think actually the system will settle down, because as everyone starts to apply genomic taxonomy, we should arrive at a relatively stable classification, and we should arrive at it relatively quickly. I mean, I know Phil is passionate about the GTDB, I would say that other classifications are available. I made the point on Twitter, bacterial taxonomy is a solved problem now. Phil has shown us the way, okay, you can argue about the details, a few little things here and there, but the broad framework is now accepted. And these arguments are going to stop soon. And we should celebrate that, we should stop quibbling about all these little issues, stand back and see the big picture and say, look, wow, we've got a method, it works, we can see the big picture. We have a method that doesn't just work for the 2% that we can culture, but works for the 98% we can't. I mean, you used the term sublime to describe the kind of grandeur of microbial life. We have a method that works not just for that 2%, not just for the 2% of 2% that are clinically important, but works for all bacteria, wherever, going forward for centuries and millennia to come. So I think that these little petty arguments, we should put them in their place and say, right, you can argue about angels on pinheads. But the grand vista is ahead of us now, and this is something to celebrate. This is amazing. It's an amazing time to be alive. And we should be positive about that, rather than be pernickety about all this kind of stuff. That's my opinion. I thank Mark for those comments. But I would say, let's consider it more a prototype to show that it's possible. And science moves forward, methods improve. I'm not saying it's going to last for centuries. If we're going to go back into the detail, can I just ask you, how does the NCBI taxonomy work? And how does that differ from what you're doing? Why do you have to do what you're doing when they have their own taxonomy? So the NCBI has its definition of phylum and class and order, and it has presumably some way of deciding how to throw things into those ranks. But it's a bit opaque to me, I don't know, Phil, if there is order in their approach or whether it's all ad hoc, I don't know. What's your feeling? Do they use a DNA-based approach to make those decisions? Yeah, I think they do to some degree. They certainly use average nucleotide identities to define species, as we do for GTD-V. And they do a lot of work, and we rely very heavily on them for, you know, we use a lot of their work. So I'm not about to diss them. They don't use a rank-normalized approach, and they use many sources, different sources of information, published information, some of their own work. So yeah, I think it's a combination effort to come up with the taxonomy. I mean, one of the reasons I got into this initially was to try to get a bit more clarification around all the uncultured stuff. Initially through 16S, when I used to curate green genes, that, you know, there were just miles of, on the NCBI taxonomy page, there were just miles of these environmental claims that were just unclassified, and that used to bug me. And so I was, I got into it from that point, and GTD-V is sort of the next iteration of that, where we're using genomes, and I think that's one of our major contributions, is providing a full classification for all of these uncultured taxa, which is definitely, I would say, missing to a large degree from, through the NCBI taxonomy. Yeah, let's follow on from that with uncultured. Andrew, you had a question or a point. Yeah, I've made a mistake of doing long-read sequencing on metagenomic samples, and then I made the second mistake where I fed all these reads or assemblies into a 16S database that shall remain nameless, and also then into GTD-V, and I was quite surprised by some of the major differences in the high-level calls that these are making. Some of them are just crazily different, and I guess what I'm wondering is, is 16S just a load of rubbish? Is this like, you know, we look back in a few years and say, this was like a very big misadventure, we spent huge amounts of money on it, and it's, you know, most of it has worked out to be, you know, complete crap. So I would not say that. I mean, having made my start in 16S, I would be, I would not agree with that. I would say what's happened in terms of if you're making large 16S trees and you're using environmental sequences, and I've published this a few, well, many years ago, that through the PCR process, you can create chimeric sequences. this tends to corrupt your trees to some degree. And we were estimating at least 4% of the 16S sequences are chimeric to some degree. And even worse, some of the chimeras are actually highly reproducible because what's happening is the polymerase is falling off a particular bend in the secondary structure, it will form secondary structure, and then it will recombine locally with another sequence. So there's a well-known instance in the human gut where you get a chimera between two quite distantly related bacteria, almost around the same region. So I think that's adding noise to it. And also, I think if you're trying to make a tree with 30,000 tips out of 16S, it does seem, you do seem to get an effect from the chimeras, but also just you get a reduction in the bootstrap support for those very, very large trees. That's been my observation anyway. Yeah. I want to bring Ian into the conversation on this. I was going to add, I was going to agree with Phil. I think 16S has been highly successful. I think, you know, some of the details we got from the 16S trees are a little blurred, but 16S was a real milestone in terms of our ability to delineate species well, and there have been reclassifications, as I said earlier. And now we have, I think what's really important is this recognition that we have good metrics like ANI or digital DDH that can allow us to define species pretty convincingly. And I think the interesting work to be done is, and this is where some of the big reclassifications are coming in, is at the level of the higher taxa. You know, how do those species group into genera, how do genera group into families and so forth? And that's where, you know, I absolutely agree that, you know, the GTDB is well recognised as the gold standard. And I think some of the reclassifications that we're seeing come from mistakes that were introduced through 16S being applied at those higher taxa levels. It's utility as a metric, certainly from the mid nineties through to, I guess, the sort of first decade of this century as a way of delineating species has been extremely valuable. I just wanted to ask Ian really quickly, for those people who would be interested, what are some of the alternative methods to GTDB? You kind of touched on it, but didn't mention anything specific. People have different tools that they use for processing genomic data. I work very closely with my colleague, Vartal Sangal, who is very enamoured of a tool, which I can't pronounce. I can never pronounce this, Phil might be able to say it properly. Is it PhyloFlan? PhyloFlan. And you will get slightly different clustering from these different phylogenetic approaches. And that's really what I meant. I mean, obviously I'm not actually aware of other really large scale user friendly websites like GTDB, where you can look at the whole hierarchy of the bacterial and archaeal worlds like you can within GTDB. I was thinking more of the application of different phylogenetic methods to sort of studies, you know, say the genus level or the family level. I can't make a plug for other tools. There's TYGS, which is part of the DSNZ and LPSN guys. I think it stands for type strain genome server. So they have a larger scale taxonomy. There are other tools, but yeah, so that's what I mean by just being a little bit cautious and saying this is the only solution because it's not. But I think more generally, we should look at it as this is part of our adolescence as we move from the old phenotype based classification, which is not going to scale to a fully genotype based classification, which will scale. I think that's the most important thing. And, you know, Mark made that point very well earlier on. I think another thing that came out just to pick up on a comment I made earlier, I've written and indeed others have written to criticise this sort of salami slicing, what I call what I described earlier as the one colony, one species, one paper approach. And I'm conscious of the fact that that's quite a negative criticism in the sense it's telling people what not to do. The question that goes, I think, should be asked alongside that is what should people be doing? I personally think that we now have the opportunity to look at things at slightly larger scale, whether it's larger scale papers that look at, say, genera and families, whether it's papers that look at sort of bundles of species together. We've done that with Aharan and others when we did the chicken gut microbiome, we named over 600 species in a single paper. And it's possible and it is the way ahead. It doesn't scale, like you say, to have a single paper for each species. We've got to move ahead and scale with the times. But what we did was we initially had a spreadsheet where we named all the species and we put in traditional protologues explaining the Latin names and so forth. But because we heard down the grapevine that there were some nomenclature experts who don't accept Excel spreadsheets and accessory supplementary information, they wanted to be in the body of the manuscript. We injected into the body of the manuscript over 100 additional pages that contained just those protologues. We paid over a thousand dollars more to get the manuscript published. But I think we broke the mould there because we showed that you can name hundreds of species in a single paper. And yeah, you can do it, just get on with it. What's the problem? And I think what we're picking up on here is it is a shift. It's a shift in behaviour. So it will take a little while to bed in. But that thing that there was a mentality, there's a 2010 paper on something like notes on classification of prokaryotic species that will become a bit of a millstone around the taxonomist's neck. And that really clearly states, I think possibly even in the abstract, in order to classify something, you should characterise it as thoroughly as possible using this panel polyphasic approaches. And actually, this goes back to the comment I made earlier about the emphasis on diagnosis rather than description. Once you have a genome sequence, you don't need to do all of that extra work to to come up with a robust classification. And that allows you to pick up on the way Phil described it. Not every leaf on the tree is particularly interesting. And you can use your classification to, my belief is that you can use a classification to come up with a framework and then you can look at that bigger picture and go back in and then interrogate the bits that look interesting in more detail and do the characterisation downstream. You know, so say a particular taxon looks particularly interesting as a source of metabolites, then one might go back in and start interrogating genomes from members of that taxon for their biosynthetic gene clusters and all those kinds of experimental approaches. And that's the thing possibly redefines or is a reappraisal of the working methods of the jobbing taxonomist. And that will take time to settle in. There's a distinction between a taxonomist and a microbial ecologist or a microphysiologist. For the taxonomy, we can be a fundamentalist. All we require is a description and a circumscription. That's all that the code tells us to do. It doesn't tell us what methods are used for that. And any any reviewer that says that you have to do a certain thing. We just need to be fearless and stand up to them and say, no, this is enough and let's draw a line. This phylogenetic placement is robust. This circumscription is clear. Please get out of my way. I've got other things to do and argue with you. All right, and I think that's an excellent point to end it. Don't fear, we will be back with more with our esteemed guests. This is Andrew and Nabil talking with professors Ian, Phil and Mark about bacterial taxonomy on our holiday special of the Microbe and Feed podcast. And we will see you in part three. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud or the platform of your choice. Follow us on Twitter at Microbe and Feed. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.