Hello, and thank you for listening to the Microbinfeed podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both
Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work
on microbes in food and the impact on human health. I work at Centers for
Disease Control and Prevention and am an adjunct member at the University of
Georgia in the U.S. Hello. Welcome to another Software Deep Dive, where we
interview authors of a bioinformatics software package. Today, we're talking
about Kraken, the taxonomic classification software, and in the hot seat are Dr.
Jennifer Liu and Natalia Rincon. Jen is a staff scientist working with Steven
Salzberg in the Johns Hopkins University Center for Computational Biology. Nat
Rincon is a PhD student in biomedical engineering, also in Steven Salzberg's
lab. There was recently a paper in Nature Protocols on which Jen and Nat were
co-authors about metagenomic analysis using the Kraken software suite, and there
seems to be a lot of tools that have been developed around Kraken. So today,
we're going to deep dive firstly into Kraken itself and then into the wider
Kraken cinematic universe. Let's kick it off with something easy for people
who've been, you know, living under a rock. What is Kraken? What's the problem
that Kraken is trying to solve? So Kraken was developed back in 2013-2014 to be
a classification software. And for those that are not as familiar with the
space, it basically means it's trying to take sequencing reads and tell you as
specific as possible what those reads are. So it's going to assign them to some
kind of taxon, whether it can assign them to a specific species, or if that read
is DNA that's shared between different genomes across the taxonomic space, it'll
assign them to a genus level or even sometimes just to a general bacterial
level. So it tries to tell you what those reads are. So when someone wants to
look at reads and they want to figure out how to classify them, why does someone
choose Kraken over a different software? Like what are the unique selling
points? So Kraken is, when it came out in the space, it was one of the fastest
programs out there. One of the fastest, especially compared to BLAST. So a lot
of people are very familiar with BLAST where you can just blast reads and it'll
try to give you what it's matching. But because of the algorithm that it's
using, which is exact matching, it's significantly faster and uses
significantly, I would say, less memory space and it's just faster in general.
And especially when you have millions or billions of reads, which you might have
in any given sequencing experiment, it's just going to be a lot faster to be
able to tell you what's in your sample. And I guess to add to that, I would say
that for BLAST, you'd want to do like shorter sequences because of the speed.
With Kraken, you can really do whole genome sequencing of like classification of
whole examples. Other than just using BLAST on what else, what other tools were
out there at the time? So if you look at many of the comparison papers that are
out there today, you'll see that Kraken is often compared to Metafland,
specifically MegaBLAST. There's other various tools that have also been brought
up to compare to Kraken, QIIME being one, if we're talking about the 16s space,
Centrifuge is one that was also developed in Steven's lab, but it kind of
approaches the problem in a very different way. And so there are a number of
comparison papers that I suppose will tell you various different results as to
which program is actually better. And so I've learned to skim through these
comparison papers, but not give them too much heed. There's quite a bit of
debate that's ongoing, but I think that one of the things that is notable is
that through all of these comparisons and as the space continues to evolve,
Kraken continues to be one of the leading tools in the space. And so I consider
Kraken to be a fairly easy to run suite. It's fairly fast and with the new
improvements on it, it doesn't require too much memory. And I think it does good
at classification and good at what it's supposed to do. You said you don't pay a
much heed, but I mean, has there ever been some classifier that came out and
you're like, they did a good job at that. I didn't address that or that did
something different than mine, or that's a good idea that I should incorporate
too. I think that there have been instances where I've, so I guess, for example,
in the 16s space, there's a recent tool called EMEA that came out. And I think
that because of the ways that the databases are structured and 16s is a whole
different world in and of itself, I considered EMEA to be a pretty good tool for
doing 16s classification, especially in comparison to Kraken. I think that
there's still some bugs to be kind of figured out, but I think it's more on the
database size rather than the tool itself. I haven't really found that Kraken
itself as a tool and as a method needed to be improved upon significantly.
Obviously, since its inception, one of the big things was decreasing the memory
size of Kraken databases. And that's something that has been addressed in recent
versions of it. So I don't consider, but the actual method itself of
classification, I consider it to be pretty good in and of itself. I think that
constantly we're looking at databases, we're looking at the increasing database
size and just trying to make sure that runtimes and memory usage are limited.
For those people who don't know, what would be the typical amount of resources
required to run Kraken? So it changes by the day, it changes by the day based
off of the number of genomes that are out there. And so the base for Kraken, it
uses all the complete genomes in RefSeq, in NCBI RefSeq. And it's using the
bacterial, archaeal, viral genomes plus human, currently GRCh38. And I think
that originally databases were not significantly big, but I think that they can
run on the order that years ago they probably ran on the order of like 50 gigs.
But because of the increase in the number of genomes, the databases for Kraken 1
got to 300 gigs. And so that's when we had to really kind of scale back and
change the way that the databases are stored. And that's when Kraken 2 now
addresses this and brings the database size down to about 30 to 50 gigs of RAM,
which I think is fairly reasonable. But then also you have the options of mini
Kraken. And so this has been a constant. So since the creation of Kraken, we've
been aware that people are not always going to have 30 gigs of RAM on whatever
system they're running Kraken on. And so they provided mini Kraken where you
could say, OK, my system only has, say, 8 or 16 gigs of RAM. Can I run Kraken
still? And so you can choose to make a database called a mini Kraken database
where you specify how much RAM, how much memory you want this database to hold.
And so it'll build the database in 8 or 16 gigs of RAM. This will decrease the
sensitivity so you'll have more unclassified reads, but it still allows you to
be able to run Kraken. And the only thing of note with the different database
sizes is that if the database is smaller, you have less information that you're
comparing against. So you'll likely have many more unclassified reads. So if
you're really doing like an in-depth analysis, you'll probably need some
external compute resources, but you can potentially run it like on your laptop
or with limited resources. So while you're on the subject of databases, like my
go-to at the moment is using the GTDB taxonomy with all the genomes that they
pull in. Like what is your preferred taxonomy? For our own projects, we've
always just gone with NCBI RefSeq as that's changed and evolved. But I know that
some people do use the GTDB taxonomy, and I know that there are people that have
built Kraken databases for that. Specifically, they do provide GTDB Kraken
databases. And that's the other thing that is kind of useful is that we, through
Ben Lang Means Lab, provide Kraken databases that are pre-built so you don't
have to go through the process of getting the genomes and building them. And
same with the GTDB Kraken databases. And so these are pre-built databases that
you can just download and just run Kraken. Otherwise, you would have to build
your own database, which for some people can just take a lot of time and be a
little bit more difficult and run into other issues. I was wondering, who are
the major people involved currently in the development process other than both
of you? So Kraken itself was originally conceptualized and created from Derek
Wood and Stephen Salzberg. And so I would consider this Kraken to be kind of
Derek Wood's baby in a sense. It's been a project that he's been on since the
beginning and a project that he's very, very involved in. Over the years, as
we've.  kind of re-evaluated how Kraken reports things, and as Kraken itself has
evolved and gotten better, other people have been involved in the process. And
so Florian Breitweiser, he wrote Kraken Unique, which we'll talk about later.
And he and Steven kind of evaluated or created this new version of Kraken, which
was based off of the original Kraken. So it uses the same concept, but kind of
gives you more information. Other people that are involved in the maintenance of
this, of Kraken Unique, are Christopher Paukrant, who's now in Germany. And then
Alexei from Steven's lab is still maintaining the software. Other people that
also have now been involved is including Ben Langmead, who I mentioned, he's the
one providing these Kraken pre-built databases. And he and Derek worked together
to create Kraken 2, which is the main version of Kraken now today. Martin
Steinegger, who's also on the protocol paper with us, he's the one that kind of
headed the efforts for this protocol paper alongside me and Nat. And so I think
between all of us, that's quite a large group of people that are going through
this. I've kind of become the face of Kraken in a way, where I'll still be
looking through all the GitHub issues and trying to help with this. But yeah, so
a lot of us are now involved, and we're developing more and more tools and kind
of things to assist with Kraken, and just trying to make sure that it's usable
for everybody. And I guess the only thing I'd add was that Jen wrote Braken for
abundance estimation, and you can use the Kraken output into the Braken output.
And then I'm part of the newest editions of the Kraken tools, which is like the
diversity ones. So after you have your abundance estimation, you can use, or you
can easily find like alpha diversity and beta diversity, like within the same
suite. So Nat, you're kind of writing the next edition of the newest twist on
Kraken. What can we expect? What is the alpha and beta diversity? So those two
are just more specific metrics for when you're trying to look at diversity in a
sample. In the protocol paper, the first like section, we have like two
pipelines sort of. The first session goes through just like looking at reads,
knowing that you have two different diversity levels and being able to just have
a quantitative metric for it, which is what alpha diversity and beta diversities
do. And one is like within community richness, and then the other is more
comparing between communities. That's nice. So I think I'll address the elephant
in the room. Do you guys know where the name comes from? So I believe, and this
is, I was not involved in the creation of this name. And so this is what I've
heard from the people that were originally involved. So they created, they just
chose Kraken. So it's not an acronym, but they chose Kraken as kind of a
mythological creature. I think after the fact that the original Kraken relied on
jellyfish, and jellyfish is a camera counting tool. And so you still use it in
camera counting today, but the original Kraken, you relied on jellyfish to
create the database. And so it kind of wanted to go off of that as like a sea
creature kind of thing, and then make it more of a mythological sea creature.
And so that's how they chose Kraken. But that's not an acronym, as far as I'm
aware. And if Derek decides to let us know, otherwise, I'm sure you'll hear from
him. Have you ever heard Derek say, release the Kraken? Many times. So he and
Steven did write a paper recently on releasing the Kraken, where they kind of
went through the process of creating the software and how they kind of
conceptualized it in a way. Yeah, and this was a 2021 paper that they released.
So once I've run Kraken, I've put my reads in and it's aligned to what's matched
with the database. What are the outputs I can expect from the base Kraken? So
base Kraken will give you two outputs. One is a very long text file, where for
every single read, it's going to give you the read ID, whether or not it was
classified, and then the taxonomy ID of like the classification, along with a
breakdown of all the cameras. And so Kraken uses exact camera matching to try to
tell you what's in your sample. But then it also kind of, for each read decides
on, okay, this is what this read is, it belongs to this genus or belongs to the
species. And so it gives you that very long text file, wherever every single
line is a different read and with the taxonomy classification. But then the
other output that Kraken provides, which I think is a little bit easier to read
is the Kraken report, which gives you a breakdown of for every taxonomy ID, how
many reads do you have. And so there's kind of two numbers that it gives for
every taxonomy ID. And one is for that for that taxonomy ID, how many reads are
classified directly to that ID. And then for that taxonomy ID and like tree, so
for given us, so for example, if you have a genus, it's going to be like, okay,
there's maybe like 100 reads that belong just to this genus, and I can't specify
further what they belong to. But then it's also going to give you another number
that's going to be larger. And so like, say, like, okay, there's 150 in this
subtree. And so that would be like, okay, there are 50 reads that I can specify
to species in that genus. But there's 100 that are just at that genus. And so
that would be the 150. And I'm sorry, if that doesn't make a ton of sense. It's
easier if you kind of like look at the report itself. Yeah. So in that, I think,
I mean, that's just the count. How does someone interpret? Can someone interpret
that report? So what would you say is a meaningful hit that a particular species
is actually there? I mean, is five reads enough or 1000 reads enough? Or how do
you guys interpret the report? I think it really depends on what the sample is
originally. So we've actually found and this is the reason why we don't tell
people to filter out initially, as we've actually found that taxons with very
few reads can still be incredibly meaningful. And I think we found this when we
were looking at brain samples, where we ran brain samples, we sequenced them,
and we classified them with Kraken, and we looked at the reports. And we kind of
compared the reports to each other and found that even there was one instance
where one sample had very few reads, I think only on the order of maybe a dozen
reads that belonged to a species, but that species was not found in any of the
other samples. And so that kind of raised our flags a little bit. And it was
just like, okay, well, this probably isn't a contaminant. And what can we look
into further? And we found that that was a true pathogen, a true species that
was in the sample. And that was really important for our downstream analysis.
There are cases where you care more about just like overall what's in your
sample. And so you might not care about those very few reads that are hitting
these various species, but there are instances where you want to kind of look at
all of them collectively. And I guess one thing that I'd add to that is that
depending on what you're using it for, you might want to use Kraken Unique
versus Kraken 2. So like, if we're doing infectious pathogen detection, like in
the brain samples, and we've done it for these like corneal samples where we
really are looking for a couple of reads that will say, or give us a candidate
for a candidate pathogen as the cause of the infection. So we really are looking
for 5, 10, 20 reads. But if you're really just looking at diversity in a sample,
Kraken 2 is faster. Maybe it won't give you that sort of resolution that Kraken
Unique will, but yeah. Great. Are you also kind of making a distinction between
like Kraken 1 and Kraken 2 when you say Kraken Unique, because it's basically
like counting k-mers versus comparing it on the minimizer level. So is that
going to be something that gets in the way? Is that kind of what you're saying?
So Kraken Unique is included now in Kraken 2. And so when we say Kraken Unique,
we're talking about running... So there is Kraken Unique and there's Kraken 2
Unique, and these are pretty much the same thing, same concept. And yes, they
are k-mer versus minimizer. But conceptually, it just means that you're going to
get a different column, an additional column in your Kraken report, which is
going to try to count the number of unique k-mers or minimizers in addition to
that, which can help validate that result, validate that count. Because what we
found is that if there is contamination, either in the database or in the
sample, the number of k-mers or minimizers is not going to reflect the same
number of reads. And so you might have a lot of reads, but very few k-mers. And
that probably means that there's some kind of contamination somewhere. Can I
ask, in terms of the input data, what is the impact of, say, using Nanopore
over, say, Illumina? And do you do things like check to see, like, if you have
four universe reads for Illumina, you know, is it calling the same species for
the same read? Is there anything like that? We typically suggest... So Kraken
was built with Illumina reads in mind. And so it is kind of assuming that error
rate. And so what people have found with Kraken with Nanopore reads, which you
can run Kraken with Nanopore reads, but because of the higher error rate, there
is a bit of a loss in accuracy when it comes to classification. So you would
probably want to adjust parameters slightly. If you were building a database
with Nanopore reads in mind, you would probably want a smaller k-mer size.  But
we haven't quite figured out what that camera size might be, but it is giving
you longer reads. And so there are more cameras in that read that you could
classify against. And so it kind of is that balancing act of trying to figure
out what the best parameters are. So that is actually a really important take
home, because I've been using a database built for Illumina, but with Nanopore
data. So I need to change things, I think. The interesting thing is that what
we've found is because Nanopore chemistries have been changing so frequently,
it's a lot harder to fine tune a Kraken database to a Nanopore dataset when the
accuracy of those reads is constantly changing. And so Nanopore is a fairly, I
would say comparatively new technology, and they're still kind of fine tuning
the chemistry. And we work very closely with Winston Tim's lab, who is always
testing new chemistries. And every time we think we might get close to the
perfect parameters for Nanopore reads, suddenly the chemistry has changed. And
so therefore the accuracy of the reads has changed. And so I will say that I've
found that it's OK to be just using the base, the same databases for both
Nanopore and Illumina reads. Just be aware that you might have slight
differences in the classification because of the difference in accuracy. So I
had, this is sort of a help desk question. So one of the questions I run into is
sometimes reads are assigned to the root, but no further. And then sometimes
reads are assigned, are unclassified. And so what's the difference? So for the
reads that are unclassified, it means that it was not found in the database. So
it didn't have any hits. If it's classified at the root, it means it had many
hits. So we can put it at a more specific level. That's pretty, that's basically
it. I think in that instance, for the unclassified reads, you might want to look
into those further. And so you can extract out the unclassified reads and try to
see what else they might hit, too. And so remember that the base Kraken
databases are only like microbes plus human. And so you might have, say, some
vertebrate DNA in there or some plant DNA that's not being detected. And so I
think those are of interest. For the ones that end up at the root, I think that
they end up being at the root because of contamination or just because of k-mers
that may be shared between, say, vectors and human. And so those k-mers are just
found across the taxonomy at such a distance that Kraken is saying there's
nothing we can really specify those to. And so it might be a DNA that's found in
both bacteria and human or something like that. And that will cause the reads to
be at the root. And those just aren't very informative at all. So you said that
some of them might be like vectors or something similar, kind of artificial.
I've heard often on like people asking about the KITOM. I think this also comes
from Andrew sometimes. Have you ever seen someone use a KITOM Kraken database or
something similar to make use of that? I don't personally know much of that. In
our databases, we always include a vector database or vector sequences as well,
just to make sure that Kraken is detecting them. It's not going to confuse them
for something else. But apart from that, I don't know much about the KITOM.
Yeah, like you might give the KIT, like the vectors, like a different taxonomy
ID, just a little bit under the root or something. Have you? OK, I guess you got
the answer to that one. Yeah, they have their own taxonomy ID, as I believe
synthetic sequences is what Kraken specifies them as. So just to get nerdy for a
second, what language is it all written in? It's a mix of Perl and C++. That's
brilliant to hear. I wouldn't have guessed. It's, I have talked to Stephen about
this previously and not to make anybody feel old, but he says it's a bit of a
generational difference, Perl versus Python. I think that for many of our tools
that we're now developing today, we're writing them in Python and not in Perl.
And so I guess that the generational difference was sometime between Derek and
us. So what is actually written in C++? Which bits? A lot of the database
building, the database building and the database. So I think just Perl is just
used for some of the basic stuff of processing the inputs. But C++ is used for a
lot of the database building and the compacting of all the sequences and writing
of the bytes and stuff. And so Kraken databases are very specifically designed
to be able to fit all of that genome information into a very small space. And so
a lot of that heavy memory management stuff is written in C++, yeah. And the
classification itself, because it is going to need to read in all of that
information and be able to search through it in a short amount of time. Yeah, I
had a little bit of an appreciation of the Perl in there when I was looking at
it one day and it was like actually editing the jellyfish database. And I was
just like staring at it and learning from it that day. And I don't know if I
totally absorbed it, but it opened my eyes that I could actually edit a binary
file like that and insert the taxonomy IDs. It was cool. Which features are you
both most proud of regarding Kraken? I think for me, it's just the simplicity of
the classification algorithm. It's just like using exact matching of k-mers has
the exact matching of k-mers seems like such a simple concept, but it spawned
out this whole Kraken world of tools that are very widely used because it's a
tool that's very fast and very good at what it does. Yeah, so that's basically
what I'm most proud of. And then for me, it's probably just like being able to
be part of this community, right? Joining this lab and sort of getting led by
Jen to, I don't know, help continue the Kraken world. I don't know. That's
awesome. All right. Thanks, everybody. Today, we've been talking with the
developers, Dr. Jen Liu and Nat Recon about the Kraken software suite around
Kraken, the taxonomic classification software. I think we'll have a little bit
more to do on the next episode, so stick around. See you next time. Thank you so
much for listening to us at home. If you like this podcast, please subscribe and
rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow
us on Twitter at Microbinfee. And if you don't like this podcast, please don't
do anything. This podcast was recorded by the Microbial Bioinformatics Group.
The opinions expressed here are our own and do not necessarily reflect the views
of CDC or the Quadram Institute.