Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention and am an adjunct member at the University of Georgia in
the U.S. Hello, and welcome to the Microbial Bioinformatics podcast. Leo and
Nabil, myself, are your hosts for today, and we're talking with Dr. Emma
Hodcroft about all things SARS-CoV-2 and Nextstrain. Emma is a postdoctoral
researcher at the ISPM, University of Bern in Switzerland, and a member of the
SIB, that's the Swiss Institute of Bioinformatics in Switzerland. Emma has had
many lives, including college in Texas, graduate school in Scotland, and now a
research position in Switzerland. Her topics have been very diverse, from HIV to
tuberculosis, and now SARS-CoV-2. She currently works on Nextstrain and SARS-
CoV-2, and there is plenty to unpack today. We're also joined by Dr. Leo
Martens, who's head of phylogenomics at QIB. So he runs analysis for other
researchers at the Institute, especially when they involve a phylogenetic
inference. At the same time, he's been developing and implementing new software
where there are gaps in the current landscape and help the community. So one of
the examples is Uvaya, which we've been using internally here at QIB for SARS-
CoV-2. So welcome to the show, both of you. Good to have you on. Thank you so
much. Really happy to be here. So let's start with something simple. So you've
been a major developer on Nextstrain over the last couple of years. So what is
Nextstrain? What is it actually for, for people who don't know? That's a great
opening question, because I think even with the publicity that Nextstrain has
had in the pandemic, it still might not be completely clear what Nextstrain is.
And Nextstrain is kind of two things. So it's a website. If you go to
nextstrain.org, you can look at these beautiful visualizations of pathogen
evolution over time, but it's also a software package. So it's also something
you can download and you can run and you can actually create the things that you
see on the Nextstrain website. And the way that this works is it's phylogenetic
analysis, which essentially means that we take the genomes of pathogens, mostly
viruses, but we do some bacteria as well, or we have done some bacteria as well.
And we essentially look for differences and similarities in that genome. And we
can use that to cluster, kind of draw lines between which sequences are more
similar and which are more different. And we can create these great graphs that
let us look at how pathogens have changed over time. And also because we often
have the sample date and the location, we can infer things about when they've
moved and where they've moved around the world. And when you do that local
installation, I've noticed there's two components, one called Orga and one
called Auspice. And what are both of those? So this is another kind of secret
split that even within the software part, that's actually two parts as well. And
this is because this is tied a little bit to the fact that we have the software
and the website. Orga is the part that is really the bioinformatics pipeline.
It's what takes in the raw sequences and does all of the analysis, all of the
phylogenetic work, all of the inference, and it will produce an output file that
has all of the results in there together. And then Auspice is the part that
actually makes the visualization. We don't actually produce from the pipeline
like pictures or movies. I think that's a common misconception. We just produce
one file or a couple of files that are loaded with all of the information we've
inferred. And then Auspice is what takes that text data, that numerical data,
and turns it into the beautiful and colorful trees that you can see on the
website and allows those to be interactive. So essentially, Orga is what does
the kind of pony work, does the analysis, and Auspice is then what makes it
really beautiful and interactive. And how did you get involved with all of this
to begin with? Because I think Nextstrain, in its conception, was before you
joined the group? Yes, absolutely. So Nextstrain predates me. It was started at
the end of 2014, beginning of 2015, by Trevor Bedford, who's at the Fred Hutch
Cancer Research Institute in Seattle, and Richard Nayer, who's at the University
of Basel here in Switzerland. And it originally started out as Nextflu, so it
was really flu-specific. And the idea was, can you do things and learn more
about how flu is evolving and eventually have some idea about what is the next
flu, so the next flu strain that is circulating? And until the pandemic, that
was still the majority of the Nextstrain focus. And I'm sure that will return as
SARS-CoV-2 dies down. And the reason that I got involved was that I actually,
just by chance, I happened to share an office with Trevor. I like to say before
he was famous. So he was a postdoc at the University of Edinburgh when I was a
research assistant there, and then eventually started my PhD. And we just
happened to be put into an office together, and so I was always around the work
that he was doing. Now, he wasn't doing Nextstrain then. This is 2009, 2010,
2011. But he was already really interested in doing cool visualizations,
especially around flu. So in particular, he had these great graphics that he
could show about antigenic evolution and phylogenetic evolution in flu. And it
was so intuitive. I'd never really seen anything like it. And it just seemed to
convey these changes so efficiently. And so I was very interested in this idea
of what can better visualization do for phylogenetics. So I kind of kept track
of what Trevor was doing after he left Edinburgh. And I heard about this
Nextstrain program. And when I saw a job ad for a position here in Switzerland
to work with Richard Mayer on Nextstrain, I was like, okay, that's what I'm
doing. So that's how I got involved. And who else is actually involved in the
project? You mentioned Trevor, and you mentioned Richard. Is there any other
people you'd like to shout out at this point who are pulling the strings in the
background? Oh, man, there's so many great people that have been involved in
Nextstrain. And there's quite a few that, or at least a few that predated me
that I never really had the pleasure of working with. But certainly at the
moment, some of the biggest influences have been James Hadfield. So he is
especially the person who's come up with Auspice. So a lot of the awesome
visualization stuff and the cool interactive and a lot of the features we've
added during the pandemic have come from James. John Huddleston has also been
really, really involved in Nextstrain, and he's made a lot of contributions for
helping to get the pipeline to be more efficient and thinking through things
like making sure things are usable, making sure that we're using appropriate
functions for doing things. So Tom Sibley is also someone who joined about the
same time I did, and he's been amazing as far as helping to set up a lot of the
kind of backend work on Nextstrain. So making sure that we have all of the
software kind of organized in a way that makes sense, making sure that the
databases are all working. Those are people that have been involved in the
longer term, but we've got a lot of people that have contributed to the project
through the pandemic. I'm not going to try and name everyone just because I'm
sure I would leave someone out and then that would be really bad. Definitely,
and we will put links to the Nextstrain website and all of that in the show
notes so you can go and marvel at all the wonderful people who have contributed
to the project. It is huge. And speaking on its size, Leo had a question. If I
remember correctly, I think you were the first to use Nextstrain for non-virus,
right? Yes. For bacteria. So yeah. So why didn't you take us through that? How
did you get into that and what was the motivation for that? Yeah, so this is a
bit of a coincidence or one of these chance things as I think so often happens
in science that maybe people don't realize. But when I arrived in Basel, I
actually, I was moving away from HIV. I'd been in HIV since my research
assistantship. So for many years, I think seven or eight years, and I was kind
of touring pathogens. So I was very curious as to what else, you know, what else
can we do these kinds of analyses on? And part of that, Richard Mayer, who was
my supervisor at the time, was working with someone who was working on
tuberculosis. And this was a particularly appealing bacteria because
tuberculosis, it's a little bit nicer than most bacteria. It doesn't do very
much recombination, none of this gene loss, gene gain nonsense. And so in some
ways, you can treat it like a really big virus. And so this seemed like a great
pathogen to think about, can we expand Nextstrain to this? Now, even though we
didn't have to think about things like recombination, you still have to deal
with a bunch of different stuff if you're going to move for bacteria. Because of
course, when I say a really big virus, bacteria are huge compared to viruses. So
you really have to start thinking about, okay, how can we keep all of the same
features? But we need to do this a lot more efficiently, because if we try and
pass in entire fast files of tuberculosis genomes, we're going to blow computers
up. So a lot of my work on that was figuring out how to apply the same
algorithms and in the same kind of efficient ways we were able to process
viruses, but expanding that up to bacteria. And of course, that also comes down
to things that might seem a little bit boring, like file types and data storage,
but are really critical if you want to make sure that not only you can take in
the formats that those researchers are using, but that they also come out in a
way that people can access that data type. You haven't just invented another
data type that no one knows how to use. And this is pretty exciting. And in the
end, I think that we accomplished this very well. We can run bacterial sequences
pretty effectively at Next right now.  terms of that technical challenge, what
languages, how is Nextstrain actually implemented in to achieve that, or just in
general, what languages are you coding? So the vast majority of Augur, which
again is the bioinformatics part, the analysis part of Nextstrain is in Python.
I hadn't actually used Python before I started working with Nextstrain, but it's
a pretty flexible, pretty easy language to get into if you have experience in
other languages. And it also makes it, I think, quite accessible because Python
has become a little bit of a lingua franca, at least in the scientific
community. So it also makes it easier for other people to collaborate as well,
if they have additions that they want to make. For the visualization side, this
is a bit more complicated. There's some React in there. There's some JavaScript
in there. When you get into this web-based stuff, you have to start being much
more kind of jack of all trades. And it's also maybe worth mentioning that
within Augur, Augur actually calls a lot of other programs. We didn't want to
reinvent the wheel. So for example, to actually do the phylogenetic
reconstruction, you can use existing programs like FastTree or IQtree. And these
of course are often done in languages like C or more efficient languages to do
these kinds of more complex algorithms more efficiently. You've got at least
two, maybe a couple of other languages in there. How do you collaborate? I mean,
it is a massive software package for people who haven't had a look at it.
There's a lot of moving parts. So how do you collaborate and how do you work
together with a huge team of people all around the world? Yeah. So I think the
biggest challenge here is actually the around the world part, because I think
one thing that surprises people about Nextstrain is that I guess that we're not
small, but all in all, the number of people that really work on Nextstrain on a
day-to-day basis is probably like 10, maybe 12 or so. It's not a huge number.
And a lot of those people are people like me, where we work on Nextstrain, but
that's not our job. We are really scientists doing research. And then on the
side, we work on Nextstrain. And so we're not even like full-time Nextstrain
employees, I guess you might say, or full-time Nextstrain people, but we are
very distributed. So we have people here in Switzerland. We've got Richard
Nayer's lab, Cornelius and Yvonne, and then myself at the University of Bern. We
have James Hadfield's actually in New Zealand. And then of course, Trevor and
his team are in Seattle. So we really have covered the time zones and this does
make scheduling meetings a bit of a challenge. Seattle's very lucky. They're
kind of in the middle. So they get to have a nice lunchtime meeting, but it's
usually very early morning for James and very late night for us. So we do
depend, and we have a meeting every two weeks. So we do keep that kind of in-
person, not in-person, but in real-time communication. And I think this is
critical, but we also depend a lot on Slack and on GitHub issues and GitHub
communication, because we do work on a time delay. If I have a question or put
something up today, I'm not going to get an answer until tonight or until
sometime in the night when I'm sleeping. And so making sure that things are
clear and that we do this communication and we put things in the open, that if
there's an important discussion maybe between James and Trevor while we're
asleep, that there's a record of that left online or in an issue somewhere so
that those of us who missed it in the night can wake up and be up to date and
make sure that we know what's going on. I think this has actually worked really
well. Of course, this is how we operated before the pandemic. And if anything, I
think we've all gotten much more used to this kind of communication, but I do
think that it works quite well. And I think it's also helpful that we're not too
big of a team because it means that it's much easier to ensure that people are
on the same page as far as what our goals are, what we're trying to tackle next,
and what the important pillars of Nextstrain are. All right. And I'm going to
ask, following from that, a question that's one of Andrew's favorite is, to
allow all of this to happen asynchronously as well, you're going to need to have
a lot of very good testing involved, unit tests or integration testing to allow
all of those changes to work together. Am I correct? So you are correct. And
this is something that has really improved since I arrived at Nextstrain and
particularly in the past couple of years. When I arrived at Nextstrain,
certainly my first impression when I opened up the code was not impressed. It
was very much still, I mean, this was only a couple of years after Nextstrain
had really started. And it was still really, really flu focused. Everything in
the code assumed you were working on flu. And if you weren't working on flu, you
kind of had to trick the code into thinking that this might be flu, but it's
going to be a little bit different. And as you move to other pathogens, you
know, we're really lucky with flu. We have a lot of data for flu. We have a lot
of sequences. We have usually pretty standardized things like sequence names and
antigenic titer data. As soon as you step away from flu, you don't have these
luxuries. And so I found that kind of annoying as someone who has never worked
on flu. Luckily, soon after I arrived, there was a big push to clean up the code
and to make it more modular and more flexible. And I think this has really been
a key of what has made Nextstrain much more popular since about 2018 is that we
now have these little modules of code that you can call from the Nextstrain
package. So you don't have to run one monolithic thing for your analysis. You
can call different steps that are focused on, you know, bringing in the data,
cleaning it, aligning your sequences, renaming them, doing the phylogenetic
analysis, and so on and so on and so on. As many of those as you might need for
your analysis and leave out any that you don't need. For example, in flu, you
would want to look at your antigenic data. For other viruses, you may not have
that, so you can skip it entirely. And this has also made it just really
flexible to adapt to other pathogens, which of course, the current scenario was
really useful, but has been useful, of course, for many people, even beyond
that. And certainly in the last few years, we've also done the much more
technical part. I personally have not been greatly involved in this, but putting
things in like unit tests and making sure that we have, you know, mini runs that
can execute when we update the code to make sure that there aren't big errors.
Of course, you know, with anything this large, nothing's ever perfect. I think
probably no piece of code that's more than five lines is going to be absolutely
perfect and something will catch you out. But with the tests that we have and
with many people using Nextstrain in many ways, we usually catch something
before too long and can make a correction if we need to. What's the relation
between Nextstrain and Nextclade and Nextalign? So that's a really good question
because Nextstrain and Nextclade were mostly developed by Richard Nayer and Ivan
Aksamentov. This was in response to the pandemic. And so this started out as a
way of how can we more efficiently align the SARS-CoV-2 sequences. And this is
something we might talk a little bit about later, but of course, the volume of
data we've had in the pandemic is really unprecedented. And one thing we were
running into is that aligning all of this data before we did the Nextstrain
analysis was proven to be a real bottleneck. But we also knew that SARS-CoV-2
sequences are actually not very diverse compared to a lot of pathogens that we
work on. And we thought that this probably meant that there was a more efficient
way that we could do this analysis without having to account for a lot of the
things that you need to do if you're going to write a more diverse aligner that
needs to account for a lot of different things. And so they set about developing
this Nextalign. And this is what we now use in Nextstrain for SARS-CoV-2, and it
works really well. It also can be used for other pathogens. And this also
started turning into, okay, are there ways that we can build on this super fast
aligner to help people also understand more about their SARS-CoV-2 data? And
Ivan came up with a great idea of turning this into a web interface where people
could not only align their sequences in a browser window. It actually doesn't
come to us. It all happens in your browser. But also figure out their clade. So
we obviously have different variants, different lineages, different clades
within SARS-CoV-2, alpha, beta, gamma, and of course the many scientific names
that you have. And with Nextclade, you can upload these sequences, your
sequences, and get information about, well, you can align them, get information
about what clade they are, and also get a lot of great quality control
information as well. So information about your coverage, weird mutations you
might see, mutations that might impact primer sites, homoplasies, all of this
kind of stuff. But really these are part of Nextstrain just because we use them
in Nextstrain, and they're part of the Nextstrain pipeline, and they were
developed by people that are involved in Nextstrain. They are a little bit
outside of the main Nextstrain, Augur, and Auspice package, but it's also great
to be providing more tools that hopefully are helping people with their
analysis. And so out of all of those features, even out of ones you didn't
mention, what are you most proud of that you've developed or you've added? I
haven't done as much development with Nextstrain during the pandemic as I did
pre-pandemic. There's been a big change as far as how much time I unfortunately
have to invest in coding and software development, which is something I really,
but it's not something, it's certainly not my strongest feature. There are
people that are better at this. And in a pandemic, you have to kind of go to
where you think you can do the best work. But I think that a lot of the work I
did on bacteria, I'm still pretty proud of this. So that did involve thinking a
lot about how to algorithmically create these phylogenetic trees more
efficiently, how to throw out, of course, in bacteria, you have a lot of bases
that never change, and then you have a few bases that do change. And they're
both important, but in different ways. And so were there ways that we could
divide this data up and take only the bits that we needed when we needed it? And
I also got to work on changing the visualization so that this works a little bit
better with bacteria. So for bacteria, you also have, of course, positive and
negative sense genes. So you need to have kind of a display of two strands for a
bacteria, whereas for virus, you just have one strand. And so I was able to work
on getting this visualization to work better and getting the zoom function to
work a little better. I often now think when I use it of ways that I could have
made it even better. But in general,  really enjoyed working on that because it
really was a top-to-tail way of working on Nextstrain, from the really
algorithmic basic parts of working on tree time, which is how we make time-
resolved phylogenies, through adding new components to the Augur pipeline, and
all the way up to changing the visualization in Auspice. So now that we solved
coronaviruses, can we expect to see more bacteria? I'm hopeful that people will
start using Nextstrain for bacteria more, but of course, phylogenetic analysis
of bacteria is always a little bit fraught, because you have to really decide
what are the questions that you're trying to answer. If you have a bacteria, it
may not even have the same genes between all of your samples, so there's always
a question in bacteria of what is the genome that you're going to use, or are
you just going to use one gene, or part of the genome, and because of things
like recombination and gene swapping, you're also going to have to think about
things like how do we interpret this phylogenetically, because as you go back in
time, if you've had recombination, for example, that phylogenetic tree is going
to be impacted by that, and it's not so straightforward to interpret. Now, for
viruses, of course, we also have recombination, but at least in my experience,
we can often kind of figure a little bit out, a little bit more
straightforwardly how to deal with this, because it's only recombination, it's
not gene swapping. In bacteria, this often seems to get a lot more complicated a
lot more quickly. Having said that, I've used Nextstrain to look into outbreaks
of things like Campylobac, and for really short time periods, it can work just
as well for bacteria as for viruses, because there's less chance there's been a
gene loss or gene gain, and there's less chance there's been recombination, and
often this can be still really insightful. So, for example, if you wanted to
look at an outbreak in a town or in a building or something, you can see this
still very clearly, just as you would with a virus, with bacterial data as well,
and so you can still get good information there. I would say that even for
larger analyses with bacteria, of course, phylogenetics are still useful, you
just have to be a little bit more careful in crafting your questions and making
sure that you're interpreting the tree in light of what may have happened to the
bacteria. We haven't discussed the real elephant in the room, which is SARS-
CoV-2, which I think we will do in the next episode. We're going to take a break
here, so that's all the time we have for today. So, I'd like to thank my guests,
Emma Hodcroft and Leo Martens for joining me, and we'll see you next time on the
MicroBinfy podcast. are our own and do not necessarily reflect the views of CDC
or the Quadram Institute.