Hello, and thank you for listening to the Microbid P podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody really writes it down. There's no
manual, and it's assumed you'll pick it up. We hope to fill in a few of these
gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is a
senior bioinformatician at the Center for Genomic Pathogen Surveillance,
University of Oxford. And Andrew is the Director of Technical Innovation for
Theogen in Cambridge, UK. I am Dr. Lee Katz, and I'm a senior bioinformatician
at Centers for Disease Control and Prevention in Atlanta in the United States.
Hello, and welcome to the Microbial Bioinformatics podcast. Nabil and I are your
co-hosts today. Today, we're talking about minimum spanning trees. This is
something we touched on talking about tree visualization last time, and it's
something we use, but people might not understand. So we're going to also talk
about the application of minimum spanning trees in our lines of work. Someone
was recently asking why anyone would want to even use one. So a minimum spanning
tree is a concept from graph theory, and we sometimes just call these MSTs. It
applies to a connected, undirected graph where each edge has a weight or cost
associated with it. The MST is a subset of the edges in the graph that connects
all the vertices or nodes together without any cycles in the minimum possible
total edge weight. So over to you, Nabil. Help us understand this a little bit
more. Yeah, so that's the technical dictionary definition of what a minimum
spanning tree is. I think it's very easy to imagine, even though it sounds
complicated. I mean, if you think about a constellation of nodes, of points in
space on a piece of paper or something, and between them, there's all a bunch of
known distances between all of those different points. What the minimum, that's
your graph, basically, nodes and a bunch of nodes and edges. And what the
minimum spanning tree says is, well, find the tree. So a tree, in this case, is
something that models the hierarchical relationship between all of those nodes,
those points. And it's just a connected acyclical graph. So an acyclical graph,
meaning it doesn't have any loops. And there's one path between the two nodes.
There's no redundant connections or anything. So that basically means connect
all the dots, all the points up together, so that there's only one connection
between two nodes. And there's no loops. In a tree that we usually use, so just
to drive this home, usually a tree we'll use is a bifurcating tree or
multifurcating tree, like whatever you want to call it. It starts with a node
and just keep going out and out and out. With a minimum spanning tree, it sounds
like it's more like, just to drive this home, it sounds like it's just an
acyclic graph. So it's just like a set of profiles connected to each other,
maybe. How would you separate these out? I mean, a phylogenetic tree is
actually, is a tree. It has the same thing. It's a bunch of nodes connected
together. And there's no loops. And there's this sort of hierarchical
relationship there. But the theoretical differences is that in the minimum
spanning tree and trees in general, compared to phylogenetic trees, they can
have, one node can connect to multiple other nodes at the same time. It's not
bifurcating. In most phylogenetic trees, you'll see one node and then two
children, and then two children from each of those and so on and so forth. That
makes sense to me. Yeah, but so yeah, you want, so you have your graph and you
want a tree and you don't just want any tree, you want a spanning tree. So
something that touches all of the nodes on your graph that you have in front of
you, and you want the minimum spanning tree, which is the minimum, you want to
do this with as low total edge weight. So as little distance as possible. People
use this in very contrived maths questions, like imagine you have a bunch of
houses and you want to connect them up with telephone and you have telephone
wires and you're trying to figure out what's the least amount of copper you need
to use, right? The sort of thing where you have a bunch of houses spread across
the countryside and you're disconnecting them all up. It's one of those kind of
maths-y problems. And we use it in bioinformatics. I think it's from, it
might've been eBurst was the first real implementation that uses, certainly one
of the first. And in this case, the data you've got is you've got genomes,
you've got strains and the distances between them is the distance in terms of
genetic distance. So alleles, allele distance, usually because it was originally
used for MLST, but you could use it for anything, any type of distance. So it
could be MILVA, it could be any measurement of distance between these strains.
And you then have all your data points out and you just connect them all up and
see the relationship between them with the minimum spanning tree. There is sort
of a problem though, that the definition of how you create the minimum spanning
tree, I explained it to you what it is. And there's nothing in that that
mentioned, we didn't talk anything about evolution or genome or genes or
bacteria or microbes or anything. It's entirely just the maths problem of just
minimizing the distances, connecting all the points up, but minimizing the
distance. So that runs into problems. And one of the problems you'll find, it's
described in the Go eBurst paper, is that you will run into a case where if
you're just optimizing the minimum distance, you will find multiple solutions.
Because you'll find multiple solutions just for optimizing on the minimum
spanning tree for the lowest distances, lowest possible total edge weight. So in
that case, what do you do? Because you don't want, no one's interested in having
10 different solutions to 10 different trees for your question. You have to pick
one. And then there are different approaches that you can find in the literature
of how people have tried to pick the correct tree that's sort of in line with
what you'd expect, biologically speaking. So a minimum spanning tree is that you
see then in, say, bionumerics or you go to eBurst or you go to PhyloBiz or you
go to GrapeTree or whatever is not actually a minimum spanning tree in the
strictest sense. Oh, so with GrapeTree or these other ones, you're kind of
optimizing it in the context of evolution? No, a lot of them is just a deal.
First, a lot of the decisions is just to deal with the tiebreaker problem where
you're going to have just on minimum edge weight, total edge weight, you're
going to find multiple trees. So now you have to try and pick one. So that's the
first thing you're probably going to run into. But then, yeah, I think in
GrapeTree, there is the underlying algorithm does try to fix certain things to
make it more realistic. And particularly if you have missing LDL data between
different texts in your data set. So GrapeTree definitely, the minimum spanning
tree produces definitely not a minimum spanning tree in the mathematical sense
that you can see in the code that I can point to in the code base that it
doesn't. It isn't actually a real minimum spanning tree. Yeah, but in some of
the other tools, they also make minor changes here and there. So it's not
strictly rank and file definition. It's not a true minimum spanning tree. Well,
I like to use minimum spanning trees for outbreak analysis, actually.
Ordinarily, we'll try to put everything in the context of strictly in terms of
evolution, just using a bifurcating tree. But sometimes you know that you just
have a founder that started the outbreak. And it's a bottleneck expansion event.
So you get one founder and it expands outward evolutionarily. So you're
expecting tons of the pathogens to have one specific genetic profile. And then
just a minority of differences that you can probably see radiating outwards
using a minimum spanning tree. So if you just are expecting a couple of SNPs or
a couple of alleles difference, it doesn't make sense to try to correct those
distances using evolutionary ideas. It just makes sense to just try to show the
distances between them in a nice, what'd you call it last time? A blobby format.
So I like that way of looking at it. And then you can look at those profiles.
And if you have a majority in one profile, the circle, the blob for that profile
gets bigger and the minorities are smaller and you show the distance between
those blobs. It's really interesting because there's so...  much sort of wrong
with the minimum spanning tree from an evolutionary perspective. Like it's just
anyone with a solid to say this is not appropriate. This is totally
inappropriate. One of the critical things that most people will pick on
immediately is the fact that a minimum spanning tree does not have hypothetical
nodes. It does. The thing that you like about it is the thing that's problematic
for other people, which is the fact that there's no, there's no concept of an
internal split. So what it will do is it will force one of your existing nodes
to be the founder, which may not be true. And on long evolutionary timescales,
it usually isn't true, but you usually don't have such good sampling to have the
founder. So you want that hypothetical node shown. So these are the internal
branches. These are the internal splits in a tree. People have seen them all the
time. But then as you say, rightly that in an outbreak scenario, and you've, you
have in principle, sampled every patient that's come through your traceback
support, you know, your contact tracing supports this idea that you've got the
full panorama of all of the different strains that have, that are implicated in
this outbreak. You can do that, right? Like ideally, or in principle, you can do
that. So it's an interesting application of it. I think one of the other things
people really like about minimum spanning trees is the distances between the
different strains are something that they can understand very easily. So usually
it's distances in terms of number of alleles or number of repeats or number of
whatever it is, snips, whatever the measurement they're used to using, PFG
bands, whatever, whatever, whatever it is, they can look at it. The, the axes on
the, on the tree is in that number. So just number of alleles away, that's
something they can understand something like substitutions per site is something
a little less intuitive. I think that's a good point. You can use all these
different distances in there. It makes it pretty versatile though. Yeah, because
you're sort of getting away from thinking about it in terms of evolution and
mutation and clock rate. And you're talking about this in terms of just distance
difference in some way, because you can make a minimum spanning tree of any
data, as we sort of said at the beginning, any, any data, any, just any nodes
that, that distance between the different nodes can be anything. You, you may
even have a really contrived way of calculating the distance between them of
evolution, number of evolutionary events. You might say number, you could, you
could do something like insane, like number of plasmid insertions, a number of
structural variation or something, and it would still work. Yeah. Yeah. Speaking
of different scenarios that might work, you have in our own document where we're
sharing between each other, a few notes, you have a transmission network
actually. So ordinarily you and I are thinking about, honestly, I mean, we're
focusing usually on what we, what our backgrounds are with, with foodborne,
which has usually a food intermediate vehicle. You might have a common ancestor
in the food actually. But a transmission network, if you're looking at something
like tuberculosis, where it's person to person, Shigella, a transmission network
with a minimum spanning tree might be more appropriate too, because you might
not have a lot of hypothetical common ancestors if you've sampled everything.
Like if you have a population of people who have unfortunately or inadvertently
shared tuberculosis between each other, like you'll, you'll see person to person
and the lines in a minimum spanning tree would represent those transmission
lines and you wouldn't need a food vehicle intermediate. I think, I think for
that kind of application, it also is a lot more familiar to people who aren't
used to phylogenetic trees because they might've seen transmission networks. And
so, you know, you can put the two together and yeah, if it is a sort of closed
space where you know, you don't have a possible third party that is perhaps
infecting both, you know, you can kind of map out transmission. It is something
controlled that you can, you can follow then the transmission network and the
minimum spanning tree may be quite similar and they're both fairly same type of
interpretation. So I think that I always felt was why people tended to gravitate
to this sort of thing. Well, we're at time. Do you want to wrap it up? Do you
have? Yeah. Well, I mean, we were just talking really quickly about minimum
spanning trees and why we would ever be interested in it and why people would
use it. And hopefully it's helpful for people out there to learn about this kind
of odd application of, of a fairly messy concept. And yeah, so that's it from
the Micro Binfy podcast and we'll see you next time. Thank you so much for
listening to our podcast. If you like this podcast, please subscribe and rate us
on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on
Twitter at Micro Binfy. This podcast was recorded by the Microbial
Bioinformatics Group. The opinions expressed here are our own and do not
necessarily reflect the views of CDC, Theogen, or the Center for Genomic
Pathogen Surveillance.