Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There's no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both
Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work
on microbes in food and the impact on human health. I work at Centers for
Disease Control and Prevention and am an adjunct member at the University of
Georgia in the U.S. Hello. Welcome to a new thing we are doing called Software
Deep Dives, where we interview the author of a bioinformatics software package.
Today, Hank Dembaker is in the hot seat with Sepia. He's an assistant professor
at the University of Georgia. That's the state, not the country, in the U.S. I
work with him through the food safety informatics group at the University of
Georgia. Hank's an alum of the Wieman Lab and has a rich history working on
things like Listeria and Campylobacter. I think, correct me if I'm wrong later
on, please, but I think fungi and a lot of other things out there. He got into
computational biology and bioinformatics from working on all these different
things. We're interviewing him today on Sepia. First of all, Hank, what is
Sepia? I pronounce it Sepia, but it's not to correct you. Sepia is, I would say,
yet another read classifier. That's what it is. And why do we need it? One of
the things, being a taxonomist and somebody who uses read classifiers a lot,
there were just a lot of things that I wanted to have in a read classifier that
I didn't have yet. I wrote a Sepia to just address all those things. So Sepia
uses a couple of data structures. One of them is the complex hash table that
Kraken 2 uses, and actually some of the principles or algorithms that Kraken 2
uses to classify reads. And one of the, I think, main components in read
classification is the use of taxonomy. And taxonomy is very important. And I'm a
taxonomist by training. So integrating new taxonomies, for instance, the GTDB
taxonomy versus the NCBI taxonomy, there are several taxonomies currently that
are being used in the metagenomic field. And seeing how that influences your
ability to classify reads, and especially reads that are from organisms that are
not necessarily well known, that really interests me. So it's a tool to
experiment with those things. The other thing is that I'm interested in data
structures. So how can we, the compact hash table of Kraken 2 is already pretty
compact, but can we make that compact more compact with things like combining it
with something like a perfect hash function? So CPI has those things. So maybe
to step back a little bit, you say hash table there. So how are you actually
storing the sequence information? How are you encoding it? So the sequence
information, currently it's encoded in bits. So we have basically the compact
hash table or the hash table with the perfect hash function is one big factor
containing unsigned 32 integers. And we can use those unsigned 32 integers to
store both information of part of the sequence and the associated taxon that
goes along with those sequences. So we first hash the sequence. And then we use
that hash to find the position in that factor. And then we use either the hash
value of the sequence of your minimer to confirm that really or part of your
sequence to confirm if that's a good match or not. So how long would your k-mers
be then in that case? So currently my k-mers can go up to 31 base pairs, but so
you don't use the full 31 base pairs. One of the things that I'm excited about
is that I can actually can extend the size k-mers so we can go up to 64 base
pairs. The language that I'm using Rust has unsigned 128 value. So we should be
able to make it even bigger. I don't know if that affects the performance of the
software. So I got really excited when I looked at your code and I saw this is
Perl because it has many of the same constructs as Perl and a similar kind of
layout and syntax and whatnot. But I was very disappointed then when you told me
you'd abandoned Perl for some other frivolous fly-by-night language called Rust.
Can you tell me more about that? So Rust, let's see. So first I have to explain,
probably explain the reason that I never abandoned Perl. Like I can read Perl, I
write most of my scripts and things like that. Where Python is fast enough, I
use Python. That's my go-to language at the moment. But if I write code that
really is performance critical and I want to read classifier, I want to classify
a couple of million of reads and tens of data sets within a limited amount of
time, that's where I use Rust. So if we go to the Rust website, they say it much
better than me. Like they say a language empowering everyone to build a reliable
and efficient software. And I think you can really, with Rust, you can get the
same performance out of Rust as C and C++. You can get at that level. It seems
to occur with some types thrown in on top of that. Yes. So it's a compiled
language. Awesome. I don't see, actually, which parts are you seeing that have
Perl on it? Like I feel like when I started learning it, going from Perl to
Rust, I was like, it just blew me away. Like I had to go step by step in the
tutorial and learn a whole new language. Well, the way I see it is, you look at
it, it says use and then library and then a semicolon, you know, that's very
Perly. True, true. Okay. Which got me. And then, you know, all the curly
brackets and stuff like that. It is a very beautiful language, actually. Yeah, I
think so. I got, honestly, like my experience was getting so frustrated with it,
but like being just like gobsmacked when it was performing, like I translated my
Perl over to Rust and I got like a 10 or 20 fold speed increase. That's insane.
Yeah. This language is insane. Sorry. Back to you, Hank. Yes. So that's
absolutely, that's the reason I chose it. I mean, the other thing is, I can read
C++, I can read C, just to look at algorithms and at the details of some
people's codes. But what always frustrates me is to skip between files, like
your header files and whatever you need. Here you just have one file. That's
where your code is. And it's not overly verbose like Java either, where you have
to put in a million different objects and then stuff like that. Yep. Okay. So
maybe let's get back to why you didn't just use Kraken 2 and why you went and
made up your own classifier. So I think Kraken 2 is fabulous and it's fast, but
there are just things. I don't know if it already exists, but one of the things
that frustrated me was that there wasn't a batch mode. So if you start a Kraken
2 run, the first thing that the software does is load the index or database,
whatever you want to call it. And if it's large, and no matter how big your
computer is, that takes a long time. That usually takes longer than the average.
actually the actual action of classifying your reads. Absolutely. So one of the
things CPI can do is just a batch mode where that loading the database is done
once and then you can specify to have the limited files with your sequence data
and your sample names and it will just do it in one go. So it takes like a
minute to load your whatever 80 to 90 gig index and then it takes like 10
seconds per sample to classify the reads and give you nice summary files and all
those things. So one thing with read classifiers I find is that you can have
bits that are shared by different species like maybe mobile genetic elements or
AMR genes or virulence genes or whatever and that can sometimes throw some weird
curve balls and it's just influenced by the number of samples that happen to be
sequenced. So like say salmonella, that's massively over represented than just
generic salmonella you find in the soil. So how does your classifier work in
that case? In that case, it will just be as bad as other classifiers. That's the
other thing that I'm very focused on, indexes that use reference type or not
type strains but the reference strains instead of like trying to index all of
salmonella, take like a median and centroid strain from a population and use
that as a reference. That takes away some of those genera being over represented
but you still have that same problem. If you use HyperLogLog for instance, to
estimate how many k-mers are, or how many minimizers, whatever are represented
by some of those elements. Say you sequence a soil microbiome and you run your
read classifier and you have like a hundred thousand reads that match
salmonella. But it turns out that those a hundred thousand reads all a hundred
thousand reads should be enough to cover like a salmonella genome several times.
But if you find that actually it's a subset of the k-mers, a small subset, say
2000 k-mers compared to the whole genome should be like 4 million, 5 million
k-mers. Then you can say it's probably a shared gene instead of the organism
itself. So I'm working on that currently. The other thing that I find really
helpful that I get standard currently that I started to integrate into CPI from
the start is what I call a hit ratio. So a minimizer based estimate of the k-mer
similarity, average k-mer similarity of your reads compared to the reference
strain or the strains that are classified. The reads are classified as such. So
kind of like a much score of some description. Exactly, yes. It correlates
really well with ANI. So an average nucleotide identity. And I find that really
useful to see if one, if you have a really high score, since you're working with
k-mers that that's something between like 0.8, 99, you never get a one unless
you have like exactly the same strain that you recovered from the metagenome.
You have a pretty good indication that you have that organism. The other thing
is that you can filter out a lot of the noise. So if you have these read
classifiers, things get classified as, it's usually over classified. So these
classifiers always go to the lowest level that you can get. But if you have like
a k-mer similarity of 0.01, you know, that's clearly noise. And that's just the
read classifier being not that great, being very, doing its over classification
thing. Do you, just switching gears for a second. If you feel like it, do you
want to give any hints on what you've been using CPF or on any applied research?
So currently I'm working on some metagenome projects, like using metagenomics to
predict species that like animal intrusion in farmlands and using metagenomics
to predict like how long ago that animal dropped its feces on your land. We're
working on mapping the microbiome of the, what is it, of the retail environment.
So at the far end, actually of the farm-to-food continuum of the farm-to-food
continuum in food safety. So, and see how we can relate microbiome data to the
occurrence of things like listeria or salmonella. Where I use CPF there is if I
have 16S datasets within seconds, I can quickly scan through a dataset and pick
out reads of interest. That's especially with like an Amplicon database use 16S,
that's super, super fast. You don't even need the batch mode there. Have you
used it for coronavirus yet? No, and I know it can do coronavirus, yes. I
suppose that would be the next thing you need for your paper when you write it
up. It's coronavirus capable. I made sure of that. Awesome. Yeah. Because you're
going to get that everywhere, I'd imagine at some point. All the reagents of
contamination will be coming through and whatnot. Yeah. Yes. So tell me, going
back to your, I suppose, animal droppings and food safety bit, does that mean
that say if you get maybe contaminated lettuce or cantaloupes or whatever, that
you can give some kind of classification on where that might've come from? Yes
and no. One of the first things I did for that project was see how long the
native microbiota of animal droppings. So that's in particular indicative of
what species was associated with those droppings, and how fast that disappears.
So what you see first in the first couple of days is that most of the typical
obligate anaerobes disappear. They're the most indicative, I think. But I was
quite surprised. So it depends. It depends how long ago that the fecal
contamination occurred. That's basically that. So, I mean, if you think about
foodborne pathogens like Salmonella, Listeria, and E. coli, they're actually
some of the species that I found in that data set, especially we had some nice
Ascoricans that last the longest. So with Listeria, I always, or I've always
heard that it's very difficult to take from the environment and you have to do
an overnight culture, that kind of thing. And just pick it up off the ground or
in a factory and do something with it. So what type of samples would you be
dealing with in that case? So that's absolutely true. So all those data sets,
what we're currently doing. So you find Listeria, I mean, you find it in small
numbers, even in data sets that you sequenced without any prior enrichment. So
what we've done for a couple of these projects is that we did a cultural
enrichment. So that takes actually a lot longer than an overnight culture and
all those things. So it's cultural enrichment like where Americans go abroad and
share their culture? Yeah. Sorry. I have to- Sorry. Restrain myself there,
growing up in the Netherlands. Yes, so culture enrichment, it means that you get
a sample and you expose it to an environment that's positively affecting a few
organisms or your organisms of interest. So it can be by using certain
antibiotics.  the medium, etc. So for Listeria, that would typically be
initially an overnight culture and it takes at least a couple of days to get
from like a soil sample to your Listeria cultures Okay, so maybe let's get deep
into the technical bit, right? Yes And I recall that you were involved somehow
in Big Z with Zamenhof, did his kind of work influence you in any way?
Absolutely, I have another piece of software called, and it's also in Rust, it's
called Color ID And it uses a version of Big Z that's a Rust version that can be
actually downloaded as a crate, it's publicly available So it builds a Big Z in
memory. So it's an in-memory use of a Big Z. So you have your index gets loaded
in memory, and that makes it really, really fast So you cannot make Big Zs that
are as big as like the entire SRA, NCBI's SRA, but you can index 10,000s of
strains in a relatively small data structure That kind of terrifies me because I
know that Big Z at one point is running on a half a terabyte of RAM Exactly,
yeah So how much RAM does your algorithm require? That all depends on how many
accessions you have To be able to do that just out of the top of my head. Like,
let's see. So if you have, if we're talking, are we talking about Big Z now, or
about We're talking about SEPA, sorry SEPA, okay, SEPA So a small data set. Let
me quickly. So if we have, for instance I guess what I'm asking is, can I run it
on my laptop or do I need a bigger virtual machine to run it on? That all
depends. So if you want to do, say, all the current GTDB version R202, so the
latest version of, and you want to have something, all reference strains Here we
have 50,000 references. So they are everything, archaea, bacteria, they include
everything from cultured organisms to metagenome amplified genomes Any humans?
Not yet. But if we look at the GTDB database with SEPA, so that is 98 gigabytes.
And that has to be loaded into your RAM. So it won't work on my laptop. It won't
work on your laptop. But 98 gigs is quite good, actually, compared to, I know
Kraken can require a fair whack of RAM. Yes, but here's the thing. So I've, some
of the things that I've been experimenting with is that the k-mer size versus
the minimizer size. And how much that influenced the accuracy of your read
classification. Like after playing around with some values, a k-mer of 31 and a
minimizer size of 21 actually gets you a significantly smaller database, even if
you use the same values in Kraken. So is that kind of indicating like, there
might be people who might want to know what the parameters are for lower memory,
or those who want to have it faster? Absolutely, absolutely. Yep. So are you
kind of, are you kind of documenting that or detailing that? Yes, I will. I
forgot if we actually said that on the recording, but we literally just got
access to the, to Sepia, just as we started this podcast. So we're kind of
talking and looking at the same time. Yes. And I really like these, these,
talking about these things, because I'm literally writing the markdown, the
updated and extensive markdown. So one of the things since I've been working on
this over several years is that the help functions are really, really helpful.
And I can tell you that because there are things that I didn't remember from,
say, over a year ago, and I look at my help function and, oh my. So, everything
works with, with the help function. Well done for doing it right. And I have to
say that one of the things that makes it really easy is this one crate. What do
you mean by a crate? Is that like a container or something? Crates are pieces of
our functions software containers, not really containers. Are they close to like
library files? Yes, they're library files that you can download from the, from a
central depository. But they're not just subset libraries or modules. Yeah, they
call them crates. Yeah. And you get them at crates.io. Yeah. And so the crate
that's responsible for me writing good help functions is actually called Clap.
So, the Clap crate is fabulous. So something that I wish probably tell people is
that we're recording this and you guys are in the middle of, is it a hurricane
or a tropical depression or something with tornado warnings. By now it's a
tropical depression and I think it's about just past us. So it's the tropical,
or tropical depression threat. Yes, there were tornado warnings this morning and
flash flood warnings. So it's dedication that you're on this podcast.
Absolutely. Yes. Absolutely. I have to cross some streams. It's I'm what like 30
miles away from Hank right now and we're getting. I guess it's, it's just like
really, really wet we're, we're fortunate that we're pretty inland but it was
thundering and lightning and everything early this morning. Did you have any
different kind of experience over there. Same, same here. It's, it's really kind
of sucks when you have to take out the dogs in the morning. I have three dogs
and they don't like rain. They don't like rain. How do you, how do you do it
Hank, how do you bring them out. I just dragged them out, and we go for a small
walk, instead of a longer walk. The thing is, is that they may not do their
business outside, because they refuse to, they don't like rain. Now the question
is, have you sequenced your business. I haven't sequenced my own business. Yet.
I think there are some people at CDC that are still interested, probably. Oh
yeah, I'm not supposed to advertise it but if you ask me offline I'll tell you
how you can donate. If you're here. Are we still talking about dogs or by humans
here. Oh Hank switched it to human. Yeah, we're talking about my business. We
also in Quadram look for anonymous donors as well to donate regularly. Do you
get a lot of donors over there. I know this is a tangent. Well, there is quite a
big requirement for ethical poo, so that you can do R&D that kind of stuff. Have
you guys donated. Me no no no. It's anonymous, it has to be anonymous. It's
anonymous. There we go. Yeah, same here I donated it anonymously, and nobody
knows it's mine. So we're going to go rooting through now the NCBI to see can we
find Lee's poo. Yeah, they're supposed to have scrubbed the human DNA but I feel
like that's. We all know that can be problematic. Yes, yes. Because people use
you know things like cracking just to find a human reads, maybe they should use
Sepia. Yeah, because that's one of the things that I have one function that I
want to add. And that's the read filter. And that would be it. It's going to be
integrated into CPI and now I'm going to have it as a standalone.  So have you
written a paper on this yet? No. So I'm working on a million papers. So after
this, I will get something out as soon as possible. Grand. So in the fullness of
time, you'll write a paper. Yeah, maybe whenever we're locked in next time for
the next hurricane. Yes. Or the next lockdown. I think that's always the thing
with writing software. You're writing loads of code and potentially help
functions. And then you have to write a paper. I mean, there are a couple of
neat things. For instance, the data structure that that uses the perfect hash
function needs to know the set of all k-mers or minimers that you want to index
for the perfect hash function. So I wrote a a variation on the compact hash map,
and that's the compact hash set. So it's a set that can take gigantic, ginormous
numbers of k-mers or whatever. And just you can infer the set of all k-mers in
your data set before you start building your your hash map. So can you take two
different databases and then do set operations on them and say, basically start
doing like GWAS? Oh, we're onto something here actually now. That would be
interesting. I haven't thought about that. Yeah, because you know, what's
common, what's different, and extract them out and then maybe go and mine for
interesting things. Yes, that shouldn't be too hard. And if you can do more
complex set operations, you can do some pretty phenomenal things. Yes. Okay, off
we go. Implement that and we'll write the nature paper then. That sounds quite
similar to Unifrac. I don't know that one. It's just a distance metric between
communities. You're just feeding it in the number supporting each taxa and then
it just meshes everything together. Okay. So I mean, if you could take this and
just make the right output and it could just plonk straight into that kind of
software. So what would that look like? I mean, you could have like a set of
reference genomes, which are your cases, a set of reference genomes that are
your controls. And then you say, okay, go and build me two separate databases,
then get say the intersection or whatever is not an intersection. And then you
have like a unique database, maybe for finding Listeria or whatever. That would
be pretty cool. Are you trying to scoop yourself on Plasmatron? It's basically
Plasmatron that I'm reinventing here. And obviously done in a better way because
Plasmatron was very much hacked together. Yeah. I use Unifrac a lot these days
in my microbiome work. Yeah. I was curious that you were talking earlier, much
earlier about the batch being able to run samples through in a batch. And then I
noticed in the source code that you've got some callouts to Redis. Is that
what's underwriting that or what's your use of Redis in this? Redis. So this is
one of the leftovers. What did I use? Oh, yes. Here. Before I started building
my own compact headset for those big things, I tried to do it with Redis, but it
ran out of... Okay. Basically, I couldn't use that to set operations there. So
that's actually vestigial. Yeah. All right. There's a lot of vestigial stuff in
the current code, which I may remove. So Redis, for those who don't know, it's
an in-memory or it's like a cache data structure store. And basically, it's just
a giant key value storage. You can use it for a whole bunch of different things
and you'll find it all over the place. So the way that taxonomy are stored
actually in the index is different from Kraken 2. So there are a lot of things
that are very similar to Kraken 2, but also different. So the taxonomies are
stored as directed acyclic graphs. So in that way, you can look up like a
taxonomy of a single organism or if you identify the k-mer fairly quickly. So it
goes from the lowest to the highest taxonomy level. So it's always like, say,
seven or eight steps that you need to infer a taxonomy. And then you can do some
set functions to figure out what the most recent common ancestor is. I was
curious about a couple of things in that. What would the output look like for
Sepia, actually? Because is it the sort of Kraken classification where each read
gets assigned a thing and that hierarchical number of reads or bits or whatever
chunks that support a particular taxon and then the number of absolute that
uniquely map to a particular taxon? So currently there are two outputs. There is
a summary file, but it doesn't use that hierarchical structure of Kraken. That's
just a straight assignment to a particular genus or species, much like the
Kraken output? Yes, exactly. So that's good. And so what it gives you, it gives
you like a taxon, like a number of reads that hit taxon. The average k-mer hits
similar or minimizer k-mer, dependent on what you're using, similarity per read.
And if you use the HLL, so the hyper log-log function, it will give you an
estimate. It will give you the total number of minimizers or actually k-mers
that were found for that specific taxon, the cardinality, and then the total
number divided by the cardinality. So you can infer like kind of a coverage per
organism. Okay, that's good because that sounds more digestible than sort of
your raw, the raw Kraken, like the Kraken report that you get. I mean, that's
not something you can just palm off to someone else who doesn't necessarily know
how to interpret it. So it's good that you've got something that sounds a lot
more, like a lot more human or digestible. Yes, yeah. So it's to keep
everything, like to keep the code fast, everything is like in U32 or whatever,
encoded like all your taxonomic designations. But then at the moment that you
have, so the summary file and the per read classification file, everything is
human readable. So I made sure that there is like a separate folder in the sepia
repository that says scripts. And that's a Python script that actually generates
chronoplots or the input for chronoplots from the classification file that that
CPI generates. And another file that they called the plus file. So it will give
you not only the average K-mer similarity, but also the distribution of how
those K-mer similarities are. So you can see what the curve looks like. And I
made that in the past to kind of see if I could use a machine learning algorithm
to filter the noise from the real hits. Yeah, that sounds good. And definitely
the Corona output is, professors like the Corona output. Nice and clickable for
them. Yes, exactly. Interactive. Yeah, exactly. I think we kind of touched on
this, but I am curious how, what happens if there are reads that are very
diverse, completely unrelated to your reference database? What is the chance
that this program is going to falsely assign it to one of those taxonomies just
because it has no idea? I think you touched on this confidence value that would
help.  but what would be the propensity here? So the propensity of, I think,
read classifiers in general is to just assign it to the lowest taxon possible.
That's where you get it. So that's where that K-mer similarity comes in. I mean,
then if you give a closer look, usually like a very, very low K-mer similarity
that really throws out those hits, as being true hits for that organism. The
other chance is that it just gets classified as no hits. So I have specifically
a no hits category, but I mean, the danger is real. I mean, I've heard a couple
of talks, I think virology talks where they used read classifiers and then were
thrown off by weird or disturbing classifications, which not turned out to be
the things they were. Yeah, like using your pestis on the subway or whatever
that was. Yes, yeah. I mean, that was a naive case, but yeah. Can you test your
software with us? That would be a good test. Yeah, I'm really curious. Is that
data set still out there or has it been retracted? We can make up a data set. I
mean, it's not that hard. I think you can find some, I think there are some
comparison papers out there for read classifiers, kind of like a semathon, but
it's read classifathon. I can't remember the name of the papers and they do have
some data sets that they use that are these kind of gotcha ones, which should
throw off some of these tools. And so that would be a good benchmark.
Absolutely. You know, pulling those down and having a go with those. A little
more seriously, maybe the first benchmark should be like something like the Zymo
mock communities, that kind of thing. Yep. Oh, the sky's the limit, right? Can
try and break it as much as we can. Now the code is out, we can stop trying to
break it as much as we'd like. What do you think that people should be looking
at first when they get to the repo? We're like, we're coming up with all sorts
of awesome things. Just give the software a run and see what you can do with it.
So the current implementation of HyperLogLog function, it's not something I
wrote myself. So it makes my code very slow. I wouldn't do that. The other thing
I use it for is read classification of Oxford Nanopore reads. Because you have
that flexibility and setting those parameters that well, you can really play
with the ideal parameters to read classification for Oxford Nanopore or noisy
reads. Okay, so somebody first coming to your repo should try out their Oxford
set on that? Yep. I'm curious. I mean, what you will see is that your average
K-mer values are of course highly effective. They're not comparable to what you
will find for Illumina data, but it does a pretty good job, I think. All right.
I know that Minimap has an error model to cope with PacBio and Nanopore. I don't
know if that's a thing you can just flag. I should have a look at that,
definitely. So currently I think that the best, if I use smaller databases, for
instance, just with a bunch like Calamari, if I make a Calamari database with
just a K-mer size of 21, which is fairly small. So you can wiggle your way past
those critical errors that Nanopore includes. That works pretty well. So smaller
K-mers definitely seem to do a good job as long as they're not too small because
then you need everything and that's not very valuable. Do you have like a two-
pass hierarchy kind of thing? So maybe you start off with K-mers of say 11 or
something crazy small and have a second pass. Yeah. Do you want to explain what
Calamari is? Oh yeah. It's a database of curated reference genomes, mostly
bacterial, mostly foodborne that we are using in-house over here, but I also
have it up on GitHub. And it's basically a list of accessions of these things
and a script to download them and documentation on how to build it for different
databases. And I'm looking forward to documentation on how to build it for
sepia. Oh, I just found the dataset that was used for the comparing the re-
classifiers and that is the CAMI, I've just blanked on the name, the CAMI
dataset. So that's Critical Assessment of Metagenome Interpretation. And I think
the paper is in Nature Methods 2017, Skirba et al. If you want to look that up.
Yes, absolutely. I mean, for the audience, you know, they want to use it
themselves. Me too. I want to see this. I haven't seen this before. Oh, there's
a couple of them. There's Skirba et al. There's one McIntyre 2017 in genome
biology, which I think is the sequel to that. And then what I've played around
with in the past is Chris Quince's Desmond tool has a simulated set of different
E. coli's all mixed together. So they're like co-infection or mixed infection
kind of things. And you can pull that down and use that as well if you want to
try that. Because I don't think the other two datasets really have that
intraspecies problem in those papers. So between that, you know, if you're able
to outperform everyone with those, then your thing is golden. If anyone wants to
play around with those datasets. Yeah. I have a feeling that really strain level
differences with reclassifiers, unless you use like a really big k-mer size and
all those things. So a former colleague played around with doing intraspecies
comparison. And his trick was to weight the classification based on sort of
these pools of the assignments. So basically the logic was if a k-mer or read or
whatever, was assigned to multiple strains in the same bucket, in the same
species, and not assigned to outside of that, then that was more convincing than
one that was sort of in, you know, across salmonella and E. coli and something,
something, something, something. Yeah. That was one of the tricks. And there's a
couple of other tricks he used, but you do have to have a very good
representative database of the species to be able to pull that apart properly.
So you kind of have to understand the whole population structure of the species
before you really do that effectively, which is annoying because you often don't
have the time to do that. Absolutely, yeah. I mean, for old hat stuff like
salmonella and E. coli, like you can just pull those down there, all of those
publications in that space. But if you're doing something a bit weird, I don't
know, like oral pathogens, that's, that's a fun one. Tannerella, what's the
other one? Tryptomonas, Tannerella, the stuff the dentist tells you to worry
about. Those ones are more difficult to kind of get you. Gingivitis? Yeah, these
cause gingivitis. They're the ones that, the red complex bacteria, if you look
them up, they're the ones that cause gingivitis and periodontitis. And there's
very little known about, like we know there's these species and there's these
communities that we don't really know much about, not like the same way we know
about enterics. Porphyromonas, that's the third guy. Oh yeah, Porphyromonas.
Porphyromonas, Tannerella facetia, and Tryptonema denticola. So yeah, if you
have a weird species, then intraspecies comparison becomes really tough. You
just don't know. Oh yeah, so one thing I want to mention is, CPI will also check
the consistency of your taxonomy. So if there are, for instance, if the same
genus name is found in different lineages, it will flag it. So you can have a
look at it, or if that's the main thing. So if you combine a plant taxonomy with
a bacterial taxonomy, you will find that there are some genus names that are
used in both domains of life. Yep. Is Candidatus one of those that is a bit
difficult?  Because they'll just suck that on anything, right? Yep, that would
be really difficult. And also bacteria, and also disease names as well, you
know, if it causes pneumonia, well, sure, we'll call the species pneumoniae.
Yep, so that's the thing is that notes in my taxonomy don't, they have, the name
is just the whole taxonomy string that fixes it, not just the genus name. I
learned that pretty quickly when I started to combine plant and bacterial
taxonomy with zoological taxonomy. Oh, yeah. Yeah, people might not realize that
the people who decide those things don't talk to each other. No, they don't talk
to each other. At least three different codes of nomenclature, or probably, I
mean, actually, there are four, like the botanical and the bacterial, and then
the zoological and the viral. Yeah, we didn't get into the whole part where you
know so much about taxonomy and how this has led you until this, actually,
another time, I guess. Yep, yep, that's a good subject for another time. Did we
say, Hank, where the name sepia came from? Oh, yeah. So, the name sepia is
actually a tribute to Kraken, because Kraken is a big octopus, so cephalopod,
and sepia is also a cephalopod, and it refers to the rust color, like the
pigment that you can make from its ink sac, which is rusty colored. So, it's a
humble cephalopod compared to the big Kraken, and it refers to the rust color.
Oh, that's interesting. So, cuttlefish. Yeah, exactly. Or the sepia, okay,
sepia. All right, well, thanks for a great discussion. This was a quick chat for
sepia, the classifier that Hinkley and Bakker created. There's always some
interesting facts about these tools, so I'm glad that we talked it through,
especially where the actual name came from. I love diving into rust and
everything else about that, too. For those who are listening, you can check it
out on GitHub. That'll be in the show notes, and that's all the time we have for
today. See you next time. Thank you so much for listening to us at home. If you
like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud,
or the platform of your choice. Follow us on Twitter at Microbinfee. And if you
don't like this podcast, please don't do anything. This podcast was recorded by
the Microbial Bioinformatics Group. The opinions expressed here are our own and
do not necessarily reflect the views of CDC or the Quadram Institute.