Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Hello, and welcome to the Microbial Bioinformatics podcast. Leo and Nabil, myself, are your hosts for today, and we're talking with Dr. Emma Hodcroft about all things SARS-CoV-2 and Nextstrain. Emma is a postdoctoral researcher at the ISPM, University of Bern in Switzerland, and a member of the SIB, that's the Swiss Institute of Bioinformatics in Switzerland. Emma has had many lives, including college in Texas, graduate school in Scotland, and now a research position in Switzerland. Her topics have been very diverse, from HIV to tuberculosis, and now SARS-CoV-2. She currently works on Nextstrain and SARS- CoV-2, and there is plenty to unpack today. We're also joined by Dr. Leo Martens, who's head of phylogenomics at QIB. So he runs analysis for other researchers at the Institute, especially when they involve a phylogenetic inference. At the same time, he's been developing and implementing new software where there are gaps in the current landscape and help the community. So one of the examples is Uvaya, which we've been using internally here at QIB for SARS- CoV-2. So welcome to the show, both of you. Good to have you on. Thank you so much. Really happy to be here. So let's start with something simple. So you've been a major developer on Nextstrain over the last couple of years. So what is Nextstrain? What is it actually for, for people who don't know? That's a great opening question, because I think even with the publicity that Nextstrain has had in the pandemic, it still might not be completely clear what Nextstrain is. And Nextstrain is kind of two things. So it's a website. If you go to nextstrain.org, you can look at these beautiful visualizations of pathogen evolution over time, but it's also a software package. So it's also something you can download and you can run and you can actually create the things that you see on the Nextstrain website. And the way that this works is it's phylogenetic analysis, which essentially means that we take the genomes of pathogens, mostly viruses, but we do some bacteria as well, or we have done some bacteria as well. And we essentially look for differences and similarities in that genome. And we can use that to cluster, kind of draw lines between which sequences are more similar and which are more different. And we can create these great graphs that let us look at how pathogens have changed over time. And also because we often have the sample date and the location, we can infer things about when they've moved and where they've moved around the world. And when you do that local installation, I've noticed there's two components, one called Orga and one called Auspice. And what are both of those? So this is another kind of secret split that even within the software part, that's actually two parts as well. And this is because this is tied a little bit to the fact that we have the software and the website. Orga is the part that is really the bioinformatics pipeline. It's what takes in the raw sequences and does all of the analysis, all of the phylogenetic work, all of the inference, and it will produce an output file that has all of the results in there together. And then Auspice is the part that actually makes the visualization. We don't actually produce from the pipeline like pictures or movies. I think that's a common misconception. We just produce one file or a couple of files that are loaded with all of the information we've inferred. And then Auspice is what takes that text data, that numerical data, and turns it into the beautiful and colorful trees that you can see on the website and allows those to be interactive. So essentially, Orga is what does the kind of pony work, does the analysis, and Auspice is then what makes it really beautiful and interactive. And how did you get involved with all of this to begin with? Because I think Nextstrain, in its conception, was before you joined the group? Yes, absolutely. So Nextstrain predates me. It was started at the end of 2014, beginning of 2015, by Trevor Bedford, who's at the Fred Hutch Cancer Research Institute in Seattle, and Richard Nayer, who's at the University of Basel here in Switzerland. And it originally started out as Nextflu, so it was really flu-specific. And the idea was, can you do things and learn more about how flu is evolving and eventually have some idea about what is the next flu, so the next flu strain that is circulating? And until the pandemic, that was still the majority of the Nextstrain focus. And I'm sure that will return as SARS-CoV-2 dies down. And the reason that I got involved was that I actually, just by chance, I happened to share an office with Trevor. I like to say before he was famous. So he was a postdoc at the University of Edinburgh when I was a research assistant there, and then eventually started my PhD. And we just happened to be put into an office together, and so I was always around the work that he was doing. Now, he wasn't doing Nextstrain then. This is 2009, 2010, 2011. But he was already really interested in doing cool visualizations, especially around flu. So in particular, he had these great graphics that he could show about antigenic evolution and phylogenetic evolution in flu. And it was so intuitive. I'd never really seen anything like it. And it just seemed to convey these changes so efficiently. And so I was very interested in this idea of what can better visualization do for phylogenetics. So I kind of kept track of what Trevor was doing after he left Edinburgh. And I heard about this Nextstrain program. And when I saw a job ad for a position here in Switzerland to work with Richard Mayer on Nextstrain, I was like, okay, that's what I'm doing. So that's how I got involved. And who else is actually involved in the project? You mentioned Trevor, and you mentioned Richard. Is there any other people you'd like to shout out at this point who are pulling the strings in the background? Oh, man, there's so many great people that have been involved in Nextstrain. And there's quite a few that, or at least a few that predated me that I never really had the pleasure of working with. But certainly at the moment, some of the biggest influences have been James Hadfield. So he is especially the person who's come up with Auspice. So a lot of the awesome visualization stuff and the cool interactive and a lot of the features we've added during the pandemic have come from James. John Huddleston has also been really, really involved in Nextstrain, and he's made a lot of contributions for helping to get the pipeline to be more efficient and thinking through things like making sure things are usable, making sure that we're using appropriate functions for doing things. So Tom Sibley is also someone who joined about the same time I did, and he's been amazing as far as helping to set up a lot of the kind of backend work on Nextstrain. So making sure that we have all of the software kind of organized in a way that makes sense, making sure that the databases are all working. Those are people that have been involved in the longer term, but we've got a lot of people that have contributed to the project through the pandemic. I'm not going to try and name everyone just because I'm sure I would leave someone out and then that would be really bad. Definitely, and we will put links to the Nextstrain website and all of that in the show notes so you can go and marvel at all the wonderful people who have contributed to the project. It is huge. And speaking on its size, Leo had a question. If I remember correctly, I think you were the first to use Nextstrain for non-virus, right? Yes. For bacteria. So yeah. So why didn't you take us through that? How did you get into that and what was the motivation for that? Yeah, so this is a bit of a coincidence or one of these chance things as I think so often happens in science that maybe people don't realize. But when I arrived in Basel, I actually, I was moving away from HIV. I'd been in HIV since my research assistantship. So for many years, I think seven or eight years, and I was kind of touring pathogens. So I was very curious as to what else, you know, what else can we do these kinds of analyses on? And part of that, Richard Mayer, who was my supervisor at the time, was working with someone who was working on tuberculosis. And this was a particularly appealing bacteria because tuberculosis, it's a little bit nicer than most bacteria. It doesn't do very much recombination, none of this gene loss, gene gain nonsense. And so in some ways, you can treat it like a really big virus. And so this seemed like a great pathogen to think about, can we expand Nextstrain to this? Now, even though we didn't have to think about things like recombination, you still have to deal with a bunch of different stuff if you're going to move for bacteria. Because of course, when I say a really big virus, bacteria are huge compared to viruses. So you really have to start thinking about, okay, how can we keep all of the same features? But we need to do this a lot more efficiently, because if we try and pass in entire fast files of tuberculosis genomes, we're going to blow computers up. So a lot of my work on that was figuring out how to apply the same algorithms and in the same kind of efficient ways we were able to process viruses, but expanding that up to bacteria. And of course, that also comes down to things that might seem a little bit boring, like file types and data storage, but are really critical if you want to make sure that not only you can take in the formats that those researchers are using, but that they also come out in a way that people can access that data type. You haven't just invented another data type that no one knows how to use. And this is pretty exciting. And in the end, I think that we accomplished this very well. We can run bacterial sequences pretty effectively at Next right now. terms of that technical challenge, what languages, how is Nextstrain actually implemented in to achieve that, or just in general, what languages are you coding? So the vast majority of Augur, which again is the bioinformatics part, the analysis part of Nextstrain is in Python. I hadn't actually used Python before I started working with Nextstrain, but it's a pretty flexible, pretty easy language to get into if you have experience in other languages. And it also makes it, I think, quite accessible because Python has become a little bit of a lingua franca, at least in the scientific community. So it also makes it easier for other people to collaborate as well, if they have additions that they want to make. For the visualization side, this is a bit more complicated. There's some React in there. There's some JavaScript in there. When you get into this web-based stuff, you have to start being much more kind of jack of all trades. And it's also maybe worth mentioning that within Augur, Augur actually calls a lot of other programs. We didn't want to reinvent the wheel. So for example, to actually do the phylogenetic reconstruction, you can use existing programs like FastTree or IQtree. And these of course are often done in languages like C or more efficient languages to do these kinds of more complex algorithms more efficiently. You've got at least two, maybe a couple of other languages in there. How do you collaborate? I mean, it is a massive software package for people who haven't had a look at it. There's a lot of moving parts. So how do you collaborate and how do you work together with a huge team of people all around the world? Yeah. So I think the biggest challenge here is actually the around the world part, because I think one thing that surprises people about Nextstrain is that I guess that we're not small, but all in all, the number of people that really work on Nextstrain on a day-to-day basis is probably like 10, maybe 12 or so. It's not a huge number. And a lot of those people are people like me, where we work on Nextstrain, but that's not our job. We are really scientists doing research. And then on the side, we work on Nextstrain. And so we're not even like full-time Nextstrain employees, I guess you might say, or full-time Nextstrain people, but we are very distributed. So we have people here in Switzerland. We've got Richard Nayer's lab, Cornelius and Yvonne, and then myself at the University of Bern. We have James Hadfield's actually in New Zealand. And then of course, Trevor and his team are in Seattle. So we really have covered the time zones and this does make scheduling meetings a bit of a challenge. Seattle's very lucky. They're kind of in the middle. So they get to have a nice lunchtime meeting, but it's usually very early morning for James and very late night for us. So we do depend, and we have a meeting every two weeks. So we do keep that kind of in- person, not in-person, but in real-time communication. And I think this is critical, but we also depend a lot on Slack and on GitHub issues and GitHub communication, because we do work on a time delay. If I have a question or put something up today, I'm not going to get an answer until tonight or until sometime in the night when I'm sleeping. And so making sure that things are clear and that we do this communication and we put things in the open, that if there's an important discussion maybe between James and Trevor while we're asleep, that there's a record of that left online or in an issue somewhere so that those of us who missed it in the night can wake up and be up to date and make sure that we know what's going on. I think this has actually worked really well. Of course, this is how we operated before the pandemic. And if anything, I think we've all gotten much more used to this kind of communication, but I do think that it works quite well. And I think it's also helpful that we're not too big of a team because it means that it's much easier to ensure that people are on the same page as far as what our goals are, what we're trying to tackle next, and what the important pillars of Nextstrain are. All right. And I'm going to ask, following from that, a question that's one of Andrew's favorite is, to allow all of this to happen asynchronously as well, you're going to need to have a lot of very good testing involved, unit tests or integration testing to allow all of those changes to work together. Am I correct? So you are correct. And this is something that has really improved since I arrived at Nextstrain and particularly in the past couple of years. When I arrived at Nextstrain, certainly my first impression when I opened up the code was not impressed. It was very much still, I mean, this was only a couple of years after Nextstrain had really started. And it was still really, really flu focused. Everything in the code assumed you were working on flu. And if you weren't working on flu, you kind of had to trick the code into thinking that this might be flu, but it's going to be a little bit different. And as you move to other pathogens, you know, we're really lucky with flu. We have a lot of data for flu. We have a lot of sequences. We have usually pretty standardized things like sequence names and antigenic titer data. As soon as you step away from flu, you don't have these luxuries. And so I found that kind of annoying as someone who has never worked on flu. Luckily, soon after I arrived, there was a big push to clean up the code and to make it more modular and more flexible. And I think this has really been a key of what has made Nextstrain much more popular since about 2018 is that we now have these little modules of code that you can call from the Nextstrain package. So you don't have to run one monolithic thing for your analysis. You can call different steps that are focused on, you know, bringing in the data, cleaning it, aligning your sequences, renaming them, doing the phylogenetic analysis, and so on and so on and so on. As many of those as you might need for your analysis and leave out any that you don't need. For example, in flu, you would want to look at your antigenic data. For other viruses, you may not have that, so you can skip it entirely. And this has also made it just really flexible to adapt to other pathogens, which of course, the current scenario was really useful, but has been useful, of course, for many people, even beyond that. And certainly in the last few years, we've also done the much more technical part. I personally have not been greatly involved in this, but putting things in like unit tests and making sure that we have, you know, mini runs that can execute when we update the code to make sure that there aren't big errors. Of course, you know, with anything this large, nothing's ever perfect. I think probably no piece of code that's more than five lines is going to be absolutely perfect and something will catch you out. But with the tests that we have and with many people using Nextstrain in many ways, we usually catch something before too long and can make a correction if we need to. What's the relation between Nextstrain and Nextclade and Nextalign? So that's a really good question because Nextstrain and Nextclade were mostly developed by Richard Nayer and Ivan Aksamentov. This was in response to the pandemic. And so this started out as a way of how can we more efficiently align the SARS-CoV-2 sequences. And this is something we might talk a little bit about later, but of course, the volume of data we've had in the pandemic is really unprecedented. And one thing we were running into is that aligning all of this data before we did the Nextstrain analysis was proven to be a real bottleneck. But we also knew that SARS-CoV-2 sequences are actually not very diverse compared to a lot of pathogens that we work on. And we thought that this probably meant that there was a more efficient way that we could do this analysis without having to account for a lot of the things that you need to do if you're going to write a more diverse aligner that needs to account for a lot of different things. And so they set about developing this Nextalign. And this is what we now use in Nextstrain for SARS-CoV-2, and it works really well. It also can be used for other pathogens. And this also started turning into, okay, are there ways that we can build on this super fast aligner to help people also understand more about their SARS-CoV-2 data? And Ivan came up with a great idea of turning this into a web interface where people could not only align their sequences in a browser window. It actually doesn't come to us. It all happens in your browser. But also figure out their clade. So we obviously have different variants, different lineages, different clades within SARS-CoV-2, alpha, beta, gamma, and of course the many scientific names that you have. And with Nextclade, you can upload these sequences, your sequences, and get information about, well, you can align them, get information about what clade they are, and also get a lot of great quality control information as well. So information about your coverage, weird mutations you might see, mutations that might impact primer sites, homoplasies, all of this kind of stuff. But really these are part of Nextstrain just because we use them in Nextstrain, and they're part of the Nextstrain pipeline, and they were developed by people that are involved in Nextstrain. They are a little bit outside of the main Nextstrain, Augur, and Auspice package, but it's also great to be providing more tools that hopefully are helping people with their analysis. And so out of all of those features, even out of ones you didn't mention, what are you most proud of that you've developed or you've added? I haven't done as much development with Nextstrain during the pandemic as I did pre-pandemic. There's been a big change as far as how much time I unfortunately have to invest in coding and software development, which is something I really, but it's not something, it's certainly not my strongest feature. There are people that are better at this. And in a pandemic, you have to kind of go to where you think you can do the best work. But I think that a lot of the work I did on bacteria, I'm still pretty proud of this. So that did involve thinking a lot about how to algorithmically create these phylogenetic trees more efficiently, how to throw out, of course, in bacteria, you have a lot of bases that never change, and then you have a few bases that do change. And they're both important, but in different ways. And so were there ways that we could divide this data up and take only the bits that we needed when we needed it? And I also got to work on changing the visualization so that this works a little bit better with bacteria. So for bacteria, you also have, of course, positive and negative sense genes. So you need to have kind of a display of two strands for a bacteria, whereas for virus, you just have one strand. And so I was able to work on getting this visualization to work better and getting the zoom function to work a little better. I often now think when I use it of ways that I could have made it even better. But in general, really enjoyed working on that because it really was a top-to-tail way of working on Nextstrain, from the really algorithmic basic parts of working on tree time, which is how we make time- resolved phylogenies, through adding new components to the Augur pipeline, and all the way up to changing the visualization in Auspice. So now that we solved coronaviruses, can we expect to see more bacteria? I'm hopeful that people will start using Nextstrain for bacteria more, but of course, phylogenetic analysis of bacteria is always a little bit fraught, because you have to really decide what are the questions that you're trying to answer. If you have a bacteria, it may not even have the same genes between all of your samples, so there's always a question in bacteria of what is the genome that you're going to use, or are you just going to use one gene, or part of the genome, and because of things like recombination and gene swapping, you're also going to have to think about things like how do we interpret this phylogenetically, because as you go back in time, if you've had recombination, for example, that phylogenetic tree is going to be impacted by that, and it's not so straightforward to interpret. Now, for viruses, of course, we also have recombination, but at least in my experience, we can often kind of figure a little bit out, a little bit more straightforwardly how to deal with this, because it's only recombination, it's not gene swapping. In bacteria, this often seems to get a lot more complicated a lot more quickly. Having said that, I've used Nextstrain to look into outbreaks of things like Campylobac, and for really short time periods, it can work just as well for bacteria as for viruses, because there's less chance there's been a gene loss or gene gain, and there's less chance there's been recombination, and often this can be still really insightful. So, for example, if you wanted to look at an outbreak in a town or in a building or something, you can see this still very clearly, just as you would with a virus, with bacterial data as well, and so you can still get good information there. I would say that even for larger analyses with bacteria, of course, phylogenetics are still useful, you just have to be a little bit more careful in crafting your questions and making sure that you're interpreting the tree in light of what may have happened to the bacteria. We haven't discussed the real elephant in the room, which is SARS- CoV-2, which I think we will do in the next episode. We're going to take a break here, so that's all the time we have for today. So, I'd like to thank my guests, Emma Hodcroft and Leo Martens for joining me, and we'll see you next time on the MicroBinfy podcast. are our own and do not necessarily reflect the views of CDC or the Quadram Institute.