Hello, and thank you for listening to the MicroBinfeed podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both
Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work
on microbes in food and the impact on human health. I work at Centers for
Disease Control and Prevention and am an adjunct member at the University of
Georgia in the U.S. Hello and welcome to the MicroBinfeed podcast. Nabil and I
are your co-hosts today, and we are joined by Drs. Erin Young and Kelsey Florek.
Erin works as a bioinformatician in the Utah Department of Health. Kelsey is the
senior genomics and data scientist at the Wisconsin State Laboratory of Hygiene
and a steering committee member of StaffBee. You might know both Erin and Kelsey
from their work on the StaffBee toolkit and the StaffBee Docker repo. So first
question to you, Erin. What is container? So when people have asked me what a
container is, I try to liken it to, like, if you're going to do just
bioinformatics in your core environment, that's very similar to, like, citizen
science. You use what's there in your environment and you're able to still do a
lot of great things. Some things require, like, a laboratory bench and pipettes
and fume hoods and that kind of thing, and that's where you get, like, your
conda environments, where you can control quite a bit more for your experiments.
And then containers are like, if you're doing an experiment on the International
Space Shuttle, you get to control every aspect of your environment, which can
give you a lot of power and a lot of usability. Yeah. And every time you finish
using it, you just throw it away, just throw away the whole space station and
just get a fresh one, fresh copy. Yeah. So it's not a perfect analogy. When
developing containers, I have that in mind, like everything that I need to use
needs to be in the container and it all needs to work nicely with everything
else in the container, but it can also be optimized for everything else in the
container. So you guys started a whole repository of containers that I feel like
has become wildly popular. So Kelsey, do you want to describe what that
repository is? Yeah. So we put together, Curtis Kopczak and I kind of started
it, he started it and we kind of evolved it together. And it was really a
centralized resource of these containerized tools that a lot of public health
laboratories were using. And it was kind of an interesting story when it first
started, I had just joined the laboratory and I was looking at ways to work
through some of our workflows using our HPC environment, high performance
compute cluster at the university. And they had introduced me to this topic of
Docker and I was, I was completely impressed initially. I was like, this is
crazy. You can install all of the resources you need into a small environment
and then have that distributed across hundreds or thousands of nodes. So we
started looking at how we could potentially use this as a resource. If we
created a container in Wisconsin, you know, maybe Curtis could use that at the
time he was in Colorado or Kevin in Virginia. And so we started thinking like if
we could put together some sort of resource where we had all of these things
together in one spot, that might make it valuable for anybody that's trying to
follow in our footsteps instead of having to install all these different
dependencies and deal with conflicting Python versions, you know, maybe we can
have, you know, a resource for every tool that we use. And so we started putting
those together and then it became widely adopted pretty quickly among the public
health laboratories because it was something that we could easily point to and
say, oh, you want to run spades? There's a spades container. You don't have to
deal with installing anything, just install Docker and you're good to go. So
that was really the kind of the origin of the Staff BE Docker repository. And
then it's just since grown as, as more contributors have kind of come into the
picture as we've added documentation and even more resources on top of that,
workflows and toolkits to kind of leverage those containers. So the value add is
the fact that you've all gotten together and built the central resource. And so
now when you return to installing those, those tools and you want to deploy
these containers, it's sort of done. If you've done the work that people would
have to have done individually on their systems and their own sort of
environments, one by one, they would have done, you know, 50 labs worth of work.
You've got two people who've just solved it and everyone else can just piggyback
off it. Is that the main benefit here? It's a sort of economies of scale. Yes.
It means that when we have a container for certain things, it means we don't
have to teach a bunch of people in all 50 labs how to install Perl libraries.
That is an excellent reason to do anything to avoid teaching people how to use
CPAN. Perl is amazing. Come on. Yeah. Okay. Fine. Yeah. Yep. No, we love Perl.
We love Perl here. I mean, which containers, along those lines, which containers
are you most proud of or which ones have you seen be a lot more popular than
expected or along those lines? So this isn't a container that I maintain, but I
feel like Curtis and his pangolin containers are probably the most widely
utilized and maybe one of the reasons why the pangolin lineage determination is
so popular as we talk about the SARS-CoV-2 outbreak, because it is such an easy
container to use and he's generally up to date with the latest container. I
would have to say, actually, I did a recent design off of something that Yuli
has put together recently, the data sets, the SARS-CoV-2 data set repository. So
I put that into a container and it is quite a list of dependencies. Thank you.
Thank you. Big heart sign. I will have to say, I think whenever I give a talk
about Docker and containers, I just say, yeah, just go blag it off the staff B
site. They've got pretty much anything that if you're, if you have to come and
ask me about containers, they've got everything you want. So just go for it. I
think my favorite is, I think the whole Tostiverse is up there. So there's
Shuffle and Prokka. Did you do Nullabore or not? I don't think so. Not yet, but
we can definitely add that. No, don't. It's, it's a mess. It'll take you
forever. No, it has a lot. I think Tost will admit it has a lot of dependencies.
It does a lot for you, but yeah, no, I think there's other alternatives that are
a bit more lightweight and it sort of makes more sense to fit into this. I'm
more of a fan of the new paradigm where we do everything modular and run things
to workflow languages. I think that's where we've, we've moved towards.
Nullabore was written before that was really the zeitgeist. If there is a tool
that's missing, which there's still a lot of tools that are missing, like you
don't have to be part of the Staffie community to contribute to the Docker repo.
There's just some specific format we request for your pull request, like where
your Docker file goes and that kind of thing. Yeah. Give a, give the quick plug
on how somebody can go in and read up on how to do the standard contribution and
how they might contribute it. Well, the basic steps are go to the Staffie Docker
Hub repo, which hopefully this podcast has some sort of link to it somewhere,
and you'll want to fork it into your own repository. The Docker files are
organized in a tool version, then a Docker file and readme. And ideally your
Docker file works for you before you submit your pull request and that there's
some way to test it. And then it will end up, if it goes through all of the
checks and QC process, it will end up on Staffie's Docker Hub and Quay projects.
So I'm going to be a bit mean. And so open a question like there are other
resources, like say biocontainers, which is just hoovering up everything on
biocontainer and making Docker images for those. And I think Galaxy team does
singularity images as well, built around that. So what's the, what's the value
add here for Staffie? I mean, it must be a lot of work to keep, keep it up. Why,
why continue with it if you've got these other resources? That's actually a
really good question. And it's something that we've, we've kind of definitely
been considering going forward. What's the future goals and steps for this
project? And I think the one major use case for, for Staffie's Docker repository
being in existence is that it's designed specifically for state public health
laboratories. And one of the things, you know, going forward that all.  state
public health laboratories are thinking about at least now or in the future is
validation of workloads, pipelines, and tools. And we're trying to develop these
containers in a way that they can be utilized in those validated environments.
So one of the things that we've done is started looking at using, instead of
pulling down specific tags, we're advising to pull down the actual hash of the
image so that you know exactly the image that you're getting every single time.
There's also the potential to use secure or locked containers or images that are
available through Docker so that you can say that there is only one person that
has the keys to make any changes to that container. So again, just adding that
level of security to the containerization aspect. And then on top of that, some
of the future for the Staff B repository is including things like multistage
build to incorporate testing and things like that. So being able to have a test
data set that will run through the workflow or the tool and make sure that it
works and it's approved and makes all the QC and validation checks that would be
appropriate for a workflow that's utilizing that image. And so I think that's
really where the Staff B containers will continue to exist, as well as just
being a resource for these emerging pathogens or emerging needs that
laboratories will need. So. Yeah, I totally agree with that. I was doing a bit
of a reviewer three moment there. I found, I always point people to the Staff B
ones because for exactly the reason you're saying and it's good to hear, but my
feeling interacting with it was that the containers tended to be more stable.
Certain repositories, they do seem to just hoover up the latest, BioCondo or
whatever, and they're a bit broken. So I don't want the email, I go, I'll use
the containers on BioCondo, whatever, whatever. And then the next day they come
back and say, oh, it doesn't work for me. It's like saying, I don't know,
something something version is not there on the path. All right, okay, fine. No,
use the Staff B ones because they're probably gonna work and you don't have to
keep bothering me. Definitely there is scope for a more public health oriented,
more provenance and more conservative set of images rather than the latest and
greatest, which I think you find with more academic software where we're a bit
loose and lazy with it, I guess. Yeah, we like the LTS versions of things. Yeah,
exactly. I think you can badge it like this. Yeah, this is the LTS version of
the Procker LTS. You can, there you go for the container names. You can put some
extra thing like that. Yeah, it does smack of the kind of CentOS versus Fedora
kind of feel, isn't it? Or why would you use CentOS over Ubuntu kind of thing?
It's all just Linux. When are you gonna publish this? I already get onto Curtis'
case about this. Yeah, so we definitely do wanna get a publication out. We had
something in the works for a while and then the pandemic hit and really just
slowed a lot of that progress. And our number one priority is making sure that
we have the containers there and that our biggest next priority has been looking
more at continuous integration across the board, continuous testing, things like
that. And once those issues get worked out, I think we'll definitely be ready to
kind of push forward on a publication. But I think there's a few things that we
wanna push forward first and with the limited time that we all have, that's kind
of our focus. I'm sure they'll put a manuscript together very quickly, Lee.
They're not as slow as you. I deserve that, yeah. I'm kidding, I'm kidding, I'm
kidding. You're kidding and I deserve it. It can be both things. Yeah, that's
what makes it funny. So do you guys wanna talk about the Stuffby toolkit? What
is it? Yeah, so the Stuffby toolkit is a Python-based utility that is designed
to abstract away some of the more difficult points of working with a container.
So it was originally designed when we were at a phase where we were instructing
new bioinformaticians how to, or even just people in state laboratories, how to
work with command line and how to work with tools. And working with a container
directly on the command line is a bit challenging just because it requires a lot
of knowledge of how containers and images work. It requires knowledge of how to
mount things in Linux and how to deal with file systems. And so much of that is
not really important to the goal of analyzing sequence data and getting an
answer or a result. And so the initial scope of Stuffby toolkit was to create an
interface that has a help menu that allows users to have much more simpler
access to these Docker images. Aside from that, we also started thinking how
could we incorporate some of the smaller workflows that we were working on? And
then that kind of grew and grew and we've started just incorporating the
Nextflow workflows as well. Just, again, kind of adding that level of how can we
abstract away some of the more technical components of running a workflow and
add a user interface that makes anybody trying to do this on the command line
just a little bit simpler so that they don't have to worry about so many
different details of working with a Linux environment. So again, really just
focused on training but then it's kind of just been really useful as a route to
providing a single package of resources. So having Secret in there, having
Monroe in there. We also have Dryad in there as well as several other workflows
as just an aspect of, you know, you can pip install this toolkit and then have
everything you need. And so that's kind of been the direction that the toolkit
has been going in and it's been slowly evolving to take better use of the Docker
images and as well as the workflows that are being developed in StaffBee. Do you
guys remember Homebrew for macOS and like Bruceye? I mean, when I was going over
the StaffBee toolkit, what it did feel a lot like this easy off the shelf
package management solution. I don't think Bruceye is maintained anymore. Sean
Jackman, I think they abandoned it at some point because basically I think the
idea was because containers made a lot more sense to deploy that way. But then
you're back to the sort of nebulous obscuredness or opaqueness of dealing with
the container. And so now you've written again, a package manager on top of
that. What do you think about this sort of cycle? We seem to be going in the
cycle where we're now making containers more digestible for people much like
we've done with package management in the past. And what you've been talking
about sounds a lot similar to sort of NF Core and their module system as well.
So this seems to be a trend for me, like where this is going, what do you think
about that? And what are the advantages of what you're doing over what else is
out there? And I definitely see the field moving in this direction. And I kind
of do wonder about the long-term usage of the toolkit. As I said, it was really
kind of born out of as a training opportunity, a training tool to help introduce
people to working in the command line and using these tools. And as we kind of
move forward with some of these adaptations of like Terra utilizing a web
interface to run workflows, I think a lot of laboratories may kind of decide
that that is a better use of their time to work with these tools and resources
through an application interface rather than a command line interface. But it
still continues to surprise me how often the toolkit is downloaded and used. And
how sometimes some of the software in there, we often question if it hasn't been
maintained in a while, should we just remove it? And then we get a bunch of
pushback from people saying, no, it's incredibly useful, please don't remove it.
So it's definitely one of those ones where I'm not sure what the lifespan of
this tool will be, but it's definitely something that I think at the very least
will be a useful training measure for anybody kind of getting introduced to the
command line and wanting to run some of these Docker images. I've discussed this
before, we discussed this a bit, Robert, and I think we've been saying this on
the podcast for a while, but containers and workflow languages are just, that's
the mainstream now, that's the bioinformatics line of thought. And if you're not
jumping on that bandwagon, you're going to get left behind, to be honest. So I
think there is a definite lifespan if you're making it more easier, especially
Docker, especially Docker, but trying to explain to someone. So it's a computer
kind of in the computer and you have to do like a thing to make it see files
outside of the container. And they're just like, what, it's the same computer,
like, what are you talking about? And this maps to this, and so is it on the
inside or is it on the outside? So when I write LS, which way am I seeing it?
That's like a complete mind job, kind of inception kind of thing. So having any
tool that makes it more digestible for someone is definitely going to be used. I
think something like that, if it's maintained and if it's not, you know, it
doesn't.  become bloated or anything. It's all nice and sleek and easy to use.
You're going to see usage for the next couple of years easily. So, no, you're
not going to get rid of it that easily. And I don't plan to. And, you know, it
surprises me sometimes how often I will need to run off a one-off of like
SamTools or something like that. And I'll just do Safty Toolkit and run it just
because it's installed and I know it's there. And I don't have to do the whole
Docker command or worry about any of that. So, yeah, it's actually been quite
useful. So I think what people might be thinking is, yeah, yeah, yeah. But I can
just do Conda create X, Y, Z, SamTools. And then I'm up and running. Why would I
do this? And your problem is going to be when you start playing around with
workflows and you want to make your application more portable, then you run into
this thing that Conda doesn't port very well. It's not something you can drag
and drop onto the cluster, onto HPC, or even you're shifting laptops or
something like that. You're going to have a problem and the containers make more
sense, but then the containers are more oblique. So I think that there's
definite use case there for this kind of application, making it more digestible.
And it's going to hook into the workflows. And you're saying that you've already
packaged in workflows. So it's sort of like, why would you not want to use it? I
definitely hope that's the case. And I know Aaron has been really valuable in
contributing Secret to the Staffie toolkit. And so, yeah, actually, I'd like to
hear your perspective on it. The Staffie toolkit, so Secret is in the Staffie
toolkit as well as in its standalone repo. And that's an interesting dynamic to
keep them both updated. And generally, they sometimes get out of sync. Don't let
Kelsey know. But the Staffie toolkit is also, as these public health
bioinformaticians are looking for options on how to analyze their data, there
are a massive amount of workflows for basically anything. Like if you just
Google search, like AMR detection, you'll probably find 10 million antimicrobial
resistant gene workflows for all kinds of species and organisms. And that can be
a bit overwhelming. So one of the things the Staffie toolkit does is it curates
workflows for the Staffie community and ensures that these workflows have some
sort of public health focus. There's not a workflow for everything yet, but
there's a lot of really helpful workflows that are there. Yeah, I think now with
the improvement in the languages, it's much easier to prototype them and throw
something up to fill that gap. It's going to converge on a space where we're all
just picking up and using each other's workflows. I'm curious if we've got time
to jump back to the point about Terra and more web applications sitting on top
of this. Do you really think it's just going to be, we're going to just run
something like that? That this is just an intermediate step until we just sort
out how to do a web application sitting on top of these workflows. And then
that's just all anyone would interact with? I think there's a lot of commercial
interest for that. And I think the simpler you can get people to use workflows,
like the simpler you can get any sort of interface for workflow, whether it's a
SnakeMake workflow or NextFlow or CWDL, the easier you can get non-biometric
petitions to use it, the more likely that workflow is to survive the long-term.
And so I think there is a huge push and drive to go in that direction. I
definitely agree with some of that perspective. I think there's definitely a use
case in the United States public health laboratories where laboratories doing
sequencing need an easier access to some of the workflows and things that are
happening. And I think some of these commercialized approaches, NextFlow Tower,
Terra, are great examples of how some of these laboratories with very limited
resources might be able to get access to some of these things. Which I think
before the pandemic, it was really there was a big push to get these labs to
learn how to use the command line. And I think it's kind of been a realization
that a lot of these laboratories just don't have the resources or capability or
even the compute environments to do that. And so a cloud-based web tool, I
think, is a perfect solution for them. At the same time, I think for a lot of
laboratories that are developing a lot of expertise around genetics and genomics
analyses and sequencing, I think that it's going to be hard-pressed to get away
from the command line. I think there's always been things, quote unquote,
threatening the command line. And it's stood the test of time. And I think it
really stands to how flexible things are working with massive amounts of files
and various file systems or across networks and things like that. Stuff that
just really isn't, I mean, you can definitely abstract some of that away and put
it into a web interface. But at a certain point, you need to go back to that
root access and have the ability and flexibility to access those tools and
resources. And I think the Docker and Toolkit and all these things will
definitely continue to persist and provide a resource there for people who want
that more flexibility, who are going to be kind of sitting more on the cutting
edge of these development cycles and contributing to some of the resources that
hopefully contributing to some of the resources that we have at Staff B. That
all sounds very exciting. And either way, it sounds like everyone's actually
finding when to have a choice of how they want to approach their analysis. And
now they don't, hopefully in a few years, they won't have an excuse of saying,
oh, I couldn't look at it because I can't, I don't know how to use whatever. I
was like, no, no, no, no, no. You've got, it's all, installing is easy now. The
workflow is easy now, command line, web, whatever, it's all available. You've
got no excuse. What ST is the, what ST is it? Like what, whatever. Yeah, then it
will be fun to focus on like the analytic aspects of what questions can be asked
with the tools instead of training people like how to install the tools. We can
finally do some biology. Is that what you're saying? Yes. Let's make biology
easier for everybody. I think that's the end goal. I think we'll, by making the
tools more digestible, we'll realize we don't actually know very much about the
biology at all. Well, that gives you more podcast episodes. So, right. Oh, we're
here forever. It's always shopping and changing. We will be running this for
ages. That's a nice positive note. And I think we will draw this episode to a
close. Well, there you have it. You're a dinosaur. If you don't catch up with
containers in the Staff Bead Toolkit, guys, get on it. No excuses, it's too easy
now. It's too easy. You've got to jump on it. We appreciate Drs. Erin Young and
Kelsey Clark for coming on and talking with us today. And we hope to see you
next time. Thank you so much for listening to us at home. If you like this
podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud or the
platform of your choice. Follow us on Twitter at Microbinfee. And if you don't
like this podcast, please don't do anything. This podcast was recorded by the
Microbial Bioinformatics Group. The opinions expressed here are our own and do
not necessarily reflect the views of CDC or the Quadram Institute.