Hello, and thank you for listening to the Microbinfeed podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both
Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work
on microbes in food and the impact on human health. I work at Centers for
Disease Control and Prevention and am an adjunct member at the University of
Georgia in the U.S. Hello. Welcome to another Software Deep Dive, where we
interview authors of a bioinformatics software package. So today, we're going to
continue the conversation to talk about the software developed around Kraken,
and in the hot seat are Dr. Jennifer Liu and Natalia Rincon. Jen is a staff
scientist working with Steven Salzberg in the Johns Hopkins University Center
for Computational Biology. Nat Rincon is a PhD student in biomedical
engineering, also in Steven Salzberg's lab. There was recently a paper in Nature
Protocols on which Jen and Nat were co-authors about metagenomic analysis using
the Kraken software suite, and there seems to be a lot of tools that have been
developed around Kraken. Yeah, we sort of touched on this last episode, but like
there's different methods of Kraken. It's not just one Kraken to rule them all.
There's a Kraken 2 and a Kraken Unique. We mentioned Kraken 2 Unique, Kraken
Unique 3. So could you guys break, then there's the Kraken Chunking for Kraken
Unique. I mean, can you guys break it down? Like what's, what's, what's Kraken
lacking? Oh, that was a good one. So the original Kraken, so the original Kraken
1, which we refer to, is the original tool written by Derek Wood back in 2013,
2014. And that was the start to this whole world. And so that was the
introduction of the exact camera matching. We use a camera size of 31 based on
jellyfish. And so that was the original. As time went on, there needed to be
some changes into, based off of how we were using it in our lab and how people
were using it. And so Kraken Unique, I believe is the next one that was
released. And so Kraken Unique is very based on Kraken 1. So it uses the same
database structure. It uses the same kind of algorithms for everything. It just
adds an extra step, which is the unique camera counting. And so that provided
the additional column, which basically told you that for your read, I don't
remember all the details of it, but it basically told you how many unique
cameras are you covering with the reads that you have. So under the assumption
that if you're classifying against a species, for example, you'd expect that
these species would be fairly spread out across the whole genome, right? You'd
expect that they wouldn't all the same read, and you'd have various reads for
that genome or various reads for that species. And so you would also expect that
the number of unique cameras would also be fairly high. And so it added that
extra column to kind of help verify things. And so that was the first two of the
tools. But as we kind of mentioned on the last episode, as the number of genomes
has increased significantly, there needed to be smaller databases. We needed
some way to make the databases a lot smaller because 300 gigs of RAM was just
not going to cut it for a lot of people. And that's when Kraken 2 comes into
play. And so Kraken 2 was released by Derek, Ben Langmead, and myself, where the
database structure is significantly changed, and it uses a probabilistic data
structure. So you're sacrificing a little bit of accuracy for a very significant
decrease in the database size. Accuracy is fairly similar to the original
Kraken, but because of the probabilistic data structure, it's slightly
decreased. And you're using some minimizers, and there's other details that I'm
not really going to go into. But it took that 300-gig database, and it decreased
it down to about like 30 to 50 gigs, which is a very, very big change and a very
big improvement. And because also of the changes that were made, there were a
number of additional changes that I'm not going to go into, but the one other
one that I will mention is also is faster, Kraken 1. Can I push just a little
bit, since this is a bioinformatics podcast, do you want to go a little bit into
the algorithm? Not all the way, but just a little bit. Into the Kraken 2
algorithm? Yeah, why minimizers? I mean, it's essentially, it's trying to map
the cameras to a shorter sequence size, which are the minimizers, which is then
mapping back to the original camera lookup process. And so there's space seeds
and minimizers, which try to take the full cameras and reduce them down in such
a way that you're not like biasing any particular sequence, but you're also kind
of randomly masking out individual parts of that camera in order to reduce down
the database size. And that's the easiest way I can describe that without
hurting my brain. And I guess one of the things that make it faster is that they
do clever like caching by, since we're using these minimizers, we don't have to
save the cameras or we don't need to go to a specific camera. We go to a
specific like Elmer first, and they're all sorted in this way so that when you
go to that position, now it's in the cache. And the next camera that you're
going to look at is likely going to have that same Elmer and be in that same
group. And that speeds things up a lot. And then so the other tools. So as now
we have cracking unique and cracking too. Right. And so then the next question
is, okay, which tool do I use? And so one of the things that we did recently was
we incorporated the cracking unique camera counting into cracking too. The one
big difference there is that you are using minimizers instead of cameras. Right.
And so it's counting unique minimizers and not cameras, but you have the
cracking unique functionality now in cracking too. People still use cracking
unique if you do have that large database, like RAM availability. And if you are
a little bit more concerned about the decrease in accuracy with the new cracking
too. And so we kind of suggest that you can use either or. And then the other
big update with cracking unique is that they now have a flag where if you have
this large database, say your database is 300 gigs, you don't need to load it
all into memory at a time. And so we, I guess we've mentioned the cracking
unique caching. I'm not sure if they came up with the name for it, but it's
still in the same cracking unique code base. It's just another flag that you
specify for when you're running cracking unique to say, I only have this much
RAM available. Can you just run it with this much RAM? And it'll compare your
sequencing reads against a portion of the database at a time with a slight
decrease in speed. And so you are kind of sacrificing a little bit of speed in
order to be able to only have a small portion of the database classified
against. And so those are the main cracking tools and kind of iterations of
cracking. And then the other two things that I, or the other three things that I
want to mention are bracken, which I'm the author of, which was written kind of
in between before the release of cracking too, but it works with cracking unique
and cracking too. That provides abundance estimation. And so, as I said, it's
cracking is a classification algorithm. It's not going to give you read counts
at species level only. It's going to tell you as specific as possible. So you
might have reads that are at a genus level or family level based on how specific
those reads are. But for people that want to know kind of percentage of my
sample, how much is this species over another bracken uses a Bayesian algorithm
to estimate what the abundance is. Are at any given level. So you can do genus
abundance, you can do species abundance, and it'll do the re-estimation of the
read counts that way. The other tools that I do want to mention are Pavian,
which is near and dear to my heart, although I'm not the author of it. Pavian
was written by Florian, who was the original author of cracking unique, which
makes analyzing these reports a million times easier. So Pavian is a graphical
interface written in Shiny, written in R, that allows you to compare read counts
across species and also just kind of gives you some nice graphical
interpretations of your species. So you just provide the cracking reports and it
gives you amazing visualizations. So you can kind of see what's in your sample
based off of the report. And then the last tool, which Natalia and I, sorry,
have been working on is cracking tools. And this is an ongoing project for us
where we're creating a set of scripts that help people do downstream analysis
based off of kind of what you're looking to use your cracking results for. Are
you trying to look for more statistical metrics for measuring what's in your
sample? Do you just want some visualizations? Are you trying to extract reads?
Cracking tools is kind of the catch-all set of scripts that should help you do
all these things. Yeah, that's not quite so brief, but it's an overview of all
the different tools that we have in our wheelhouse as of date. Yeah. Yeah. So I
think last time, one of the great things about all the Kraken suite and
everything, Nat was saying like she came into a family and  And you're a third
year, so how did this all strike you? It seems like if I were going into that,
it would be a little bit overwhelming. There are a ton of tools coming in and
you're kind of the next PhD student looking at all these tools. Is this
overwhelming or are you kind of getting a handle? It was so easy, you already
got a handle on it. I guess I started with it as soon as I started my PhD. I was
really lucky in the beginning of my PhD, we got this sweet data set of ocular
sarcoidosis eye infections. And with patient and controls, whole genome
sequencing, so immediately we were put into basically the infectious pathogen
detection pipeline. And that was one of my very first projects and that's the
very first paper that I had with the Salzburg lab. And that was one of the
original reasons why I wanted to join. So I think having Jen around is basically
been my lifesaver. And so, yeah, so it hasn't been overwhelming, but it really
could have been. And it's been really great. Speaks to you as a good mentor,
okay. I do my best. So what are some of the applications that you both have been
using this suite of tools? When would you use one or the other? So our lab
mainly focuses on more pathogen detection than microbiome analysis, but I'll
kind of go through it. So microbiome analysis is something that I think that a
lot of people have used Kraken for. And so if you look up papers that have cited
Kraken, there's a lot out there that use Kraken to try to just generally give
you an overview as to the microbes in your sample. And so this can be anywhere
from, you're looking at microbes in water and drinking water and various
waterways. I even looked at, can we use Kraken to detect the kind of pathogens
and microbes that might be in the inner harbor and Baltimore water? And so
there's a lot of different tools for that. There's some people that are looking
at microbes in the gut microbiome, human microbiome stuff, and Kraken is very
useful in those instances. But I would say that for our particular lab, we do a
lot more pathogen detection. And so one of the really cool things is that our
lab's been able to use Kraken to detect infections. So microbes that are in
samples that shouldn't belong there. And so we are very fortunate to work with
Johns Hopkins Hospital and some doctors there that have patients and they need
to diagnose them. And so they'll get a CSF or brain biopsy sample and so we'll
send it through sequencing and then we'll analyze that with Kraken to try to
detect what kind of things might be infecting them. There's many cases where
we're not able to find anything, but there are some very cool cases that have
been published. Are we able to detect tuberculosis, for example, in a patient or
a viral infection in a patient and successfully diagnose these patients using a
combination of Kraken and Pavian? And I guess to add to that, in terms of like,
should you use Kraken 2 or Kraken Unique or Kraken 1, we usually use Kraken 1
Unique for the infectious pathogen detect. Since with Kraken 2, we have a small
false positive rate that is just like, for us, we consider not like good enough
to find that needle in a haystack. But if we're just looking at how much
diversity is in the sample and a rough estimate, not even that rough, but an
estimate of how many of the abundance of species, Kraken 2 is much faster,
smaller storage, so we use that for those sorts of problems. And I will say that
you can build both a Kraken 2 database and a Kraken Unique database in the same
folder. So you don't need two separate storage spaces because they do use
different file names, it's different database structure. So for a lot of our
databases, we have both a Kraken Unique and a Kraken 2 database built. So you
don't need to have two copies of the library files or two taxonomies or anything
like that. Do you have any advice to people who want to do microbiome analysis?
Because quite often you have an over-representation of pathogens versus anything
that's not a pathogen. Like if you look at salmonella, virtually probably
everything that's in there is gonna be like typhimurium or typhi, that kind of
thing, because it causes serious disease in animals and humans. Well, we
typically would just run Kraken and then Bracken. And I think that one of the
big things and the difficulties around Kraken is you're very dependent on what's
in your database. And so this is also dependent on what's been sequenced, what's
been genotyped and like what's actually available in NCBI RefSeq Complete,
right? So what we have found is that those databases are heavily bacterial and
viral. You're not gonna have a lot of the eukaryotic pathogens just based off of
like what's been studied. And so you kind of have to be aware of that when
you're building your database. Bracken should kind of like account for what
you're, it should kind of help adjust some of those recount levels when doing
the re-estimation based off of like what's a unique read and how many reads you
expect to get classified when you just have the base database. But I will say
that that's something that we are aware of. I don't really know. It's all about
what's been out, what data genomes are out there, yeah. So we're kind of reliant
on that. Yeah, so I guess if you're shedding DNA in an infection, which is what
you do, then I guess the database itself can't really handle the, it can't
account for overshedding of DNA from an infection. That's just kind of beyond
the scope, huh? Yeah, for infections, we wouldn't really use the crackin results
to say the amount of the infection that's in there. I think it's just the
presence or absence rather than how much, yeah. There's a lot of, we're still
kind of refining how we use sequencing for pathogen detection, but we are also
reliant on the doctors themselves and kind of verification from pathologists for
any of our infectious disease analysis. That's fair. Bioinformatics, we're not
alone. We work with other people. So I have some other use cases that I use
crackin for, but I don't know. I'd like to hear if you guys have come across any
special use cases that wasn't metagenomics specifically. Like have you seen
anyone use crackin in a surprising way? I guess I'll give you an example of what
I do. Yeah. So part of our quality control analysis is taking a single genome
and treating it like a metagenome. And the null hypothesis is that this is a
single genome and that there are no conflicting taxa in the reads. And crackin
can give us a result showing us whether or not there is a conflicting taxon in
the reads. And so if we have, you know, over a certain percentage of reads that
are conflicting, like if it's, you know, Listeria in our E. coli sample, then we
can say that there's some contamination there. We do a lot of work with
contamination. I think that's something that Steven Salzberg is very passionate
about is contamination in genomes. And so that we have used crackin, I guess I
really didn't think about this. We have used crackin to detect contamination in,
for example, eukaryotic pathogen genomes. And so, as I said, like we're very
reliant on the databases and what's out there. And NCBI Rest State Complete has
very few eukaryotic pathogen genomes, but there is many, many draft genomes for
a lot of eukaryotic pathogens. And so we actually use crackin to detect in those
draft genomes, any and all contamination. And so we compare all these eukaryotic
pathogen genomes to say bacteria, human, even we built databases for vertebrates
and plants, very large databases to try to mask out anything that could be
contaminating. And it was very often that we found a lot of these eukaryotic
pathogens had contaminating sequences from their hosts. So the, what you would
expect these sequences to be infecting. And so, for example, there were some
eukaryotic pathogens that were various forms of malaria that were found in say
chicken or cow. And so you'd find a lot of chicken DNA that we had to mask out
in these draft genomes. And so we use crackin for a lot of these contamination
detection things. And I know that now with HM13, we're also including that in
our databases as well. So the full T2T consortium human genome, we're now
including that in our databases and which has also yielded some interesting
contamination results as well. I run crackin on all of my genomic data, even if
it's isolate data, because yeah, I don't believe what anybody tells me. It's
like, it's a salmonella. Yeah, okay, sure. Sure it is. Quite often it's not,
yeah. And if you ever want to see how much people get it wrong, you can
actually, we run crackin on all the genomes in Entrobase. So you can actually,
and it'll, I think we expect the amount genomic content to be 90% the species we
expect it to be. Otherwise we just say, well, something's gone wrong. And yeah,
there's a lot of trash. People saying, oh, it's a salmonella. It's like, no,
it's not. So just to close out this episode, here's a question to both of you.
What's next?  for Kraken and the Kraken development team? Yeah, so we are
continuing to maintain all the GitHub repositories to make sure that Kraken
continues to be an accurate and fast tool for everybody. Yeah, and we'll
continue to keep an eye out on the metagenomics fields and classification fields
just to see if there's any improvements that can be made. Yeah, and Kraken tools
is gonna continue being developed as the need arises for any new scripts and
downstream analysis. We're gonna continue to add to our Kraken tools suite to
make it as complete and as useful as possible. And one other thing that's
really, I think it's sort of like the next thing that we need to tackle for
people to at least continue to use Kraken as we get more and more genomes
available. We need to figure out how to make the database smaller. So finding
different ways of indexing and sketching and making that be more doable or even
hosting on a compute resource, like we're sort of trying to figure out how we
should continue and how we can improve upon it. That's awesome, guys. I look
forward to the future. Today, we've been talking with the developers, Dr.
Jennifer Liu and Natalia Rincon about the Kraken software suite around Kraken,
the taxonomic classification software. Thank you so much for listening to us at
home. If you like this podcast, please subscribe and rate us on iTunes, Spotify,
SoundCloud, or the platform of your choice. Follow us on Twitter at MicroBinfy.
And if you don't like this podcast, please don't do anything. This podcast was
recorded by the Microbial Bioinformatics Group. The opinions expressed here are
our own and do not necessarily reflect the views of CDC or the Quadram
Institute.