Hello, and thank you for listening to the MicroBinfeed podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Hello and welcome to the MicroBinfeed podcast. Nabil and I are your co-hosts today, and we are joined by Drs. Erin Young and Kelsey Florek. Erin works as a bioinformatician in the Utah Department of Health. Kelsey is the senior genomics and data scientist at the Wisconsin State Laboratory of Hygiene and a steering committee member of StaffBee. You might know both Erin and Kelsey from their work on the StaffBee toolkit and the StaffBee Docker repo. So first question to you, Erin. What is container? So when people have asked me what a container is, I try to liken it to, like, if you're going to do just bioinformatics in your core environment, that's very similar to, like, citizen science. You use what's there in your environment and you're able to still do a lot of great things. Some things require, like, a laboratory bench and pipettes and fume hoods and that kind of thing, and that's where you get, like, your conda environments, where you can control quite a bit more for your experiments. And then containers are like, if you're doing an experiment on the International Space Shuttle, you get to control every aspect of your environment, which can give you a lot of power and a lot of usability. Yeah. And every time you finish using it, you just throw it away, just throw away the whole space station and just get a fresh one, fresh copy. Yeah. So it's not a perfect analogy. When developing containers, I have that in mind, like everything that I need to use needs to be in the container and it all needs to work nicely with everything else in the container, but it can also be optimized for everything else in the container. So you guys started a whole repository of containers that I feel like has become wildly popular. So Kelsey, do you want to describe what that repository is? Yeah. So we put together, Curtis Kopczak and I kind of started it, he started it and we kind of evolved it together. And it was really a centralized resource of these containerized tools that a lot of public health laboratories were using. And it was kind of an interesting story when it first started, I had just joined the laboratory and I was looking at ways to work through some of our workflows using our HPC environment, high performance compute cluster at the university. And they had introduced me to this topic of Docker and I was, I was completely impressed initially. I was like, this is crazy. You can install all of the resources you need into a small environment and then have that distributed across hundreds or thousands of nodes. So we started looking at how we could potentially use this as a resource. If we created a container in Wisconsin, you know, maybe Curtis could use that at the time he was in Colorado or Kevin in Virginia. And so we started thinking like if we could put together some sort of resource where we had all of these things together in one spot, that might make it valuable for anybody that's trying to follow in our footsteps instead of having to install all these different dependencies and deal with conflicting Python versions, you know, maybe we can have, you know, a resource for every tool that we use. And so we started putting those together and then it became widely adopted pretty quickly among the public health laboratories because it was something that we could easily point to and say, oh, you want to run spades? There's a spades container. You don't have to deal with installing anything, just install Docker and you're good to go. So that was really the kind of the origin of the Staff BE Docker repository. And then it's just since grown as, as more contributors have kind of come into the picture as we've added documentation and even more resources on top of that, workflows and toolkits to kind of leverage those containers. So the value add is the fact that you've all gotten together and built the central resource. And so now when you return to installing those, those tools and you want to deploy these containers, it's sort of done. If you've done the work that people would have to have done individually on their systems and their own sort of environments, one by one, they would have done, you know, 50 labs worth of work. You've got two people who've just solved it and everyone else can just piggyback off it. Is that the main benefit here? It's a sort of economies of scale. Yes. It means that when we have a container for certain things, it means we don't have to teach a bunch of people in all 50 labs how to install Perl libraries. That is an excellent reason to do anything to avoid teaching people how to use CPAN. Perl is amazing. Come on. Yeah. Okay. Fine. Yeah. Yep. No, we love Perl. We love Perl here. I mean, which containers, along those lines, which containers are you most proud of or which ones have you seen be a lot more popular than expected or along those lines? So this isn't a container that I maintain, but I feel like Curtis and his pangolin containers are probably the most widely utilized and maybe one of the reasons why the pangolin lineage determination is so popular as we talk about the SARS-CoV-2 outbreak, because it is such an easy container to use and he's generally up to date with the latest container. I would have to say, actually, I did a recent design off of something that Yuli has put together recently, the data sets, the SARS-CoV-2 data set repository. So I put that into a container and it is quite a list of dependencies. Thank you. Thank you. Big heart sign. I will have to say, I think whenever I give a talk about Docker and containers, I just say, yeah, just go blag it off the staff B site. They've got pretty much anything that if you're, if you have to come and ask me about containers, they've got everything you want. So just go for it. I think my favorite is, I think the whole Tostiverse is up there. So there's Shuffle and Prokka. Did you do Nullabore or not? I don't think so. Not yet, but we can definitely add that. No, don't. It's, it's a mess. It'll take you forever. No, it has a lot. I think Tost will admit it has a lot of dependencies. It does a lot for you, but yeah, no, I think there's other alternatives that are a bit more lightweight and it sort of makes more sense to fit into this. I'm more of a fan of the new paradigm where we do everything modular and run things to workflow languages. I think that's where we've, we've moved towards. Nullabore was written before that was really the zeitgeist. If there is a tool that's missing, which there's still a lot of tools that are missing, like you don't have to be part of the Staffie community to contribute to the Docker repo. There's just some specific format we request for your pull request, like where your Docker file goes and that kind of thing. Yeah. Give a, give the quick plug on how somebody can go in and read up on how to do the standard contribution and how they might contribute it. Well, the basic steps are go to the Staffie Docker Hub repo, which hopefully this podcast has some sort of link to it somewhere, and you'll want to fork it into your own repository. The Docker files are organized in a tool version, then a Docker file and readme. And ideally your Docker file works for you before you submit your pull request and that there's some way to test it. And then it will end up, if it goes through all of the checks and QC process, it will end up on Staffie's Docker Hub and Quay projects. So I'm going to be a bit mean. And so open a question like there are other resources, like say biocontainers, which is just hoovering up everything on biocontainer and making Docker images for those. And I think Galaxy team does singularity images as well, built around that. So what's the, what's the value add here for Staffie? I mean, it must be a lot of work to keep, keep it up. Why, why continue with it if you've got these other resources? That's actually a really good question. And it's something that we've, we've kind of definitely been considering going forward. What's the future goals and steps for this project? And I think the one major use case for, for Staffie's Docker repository being in existence is that it's designed specifically for state public health laboratories. And one of the things, you know, going forward that all. state public health laboratories are thinking about at least now or in the future is validation of workloads, pipelines, and tools. And we're trying to develop these containers in a way that they can be utilized in those validated environments. So one of the things that we've done is started looking at using, instead of pulling down specific tags, we're advising to pull down the actual hash of the image so that you know exactly the image that you're getting every single time. There's also the potential to use secure or locked containers or images that are available through Docker so that you can say that there is only one person that has the keys to make any changes to that container. So again, just adding that level of security to the containerization aspect. And then on top of that, some of the future for the Staff B repository is including things like multistage build to incorporate testing and things like that. So being able to have a test data set that will run through the workflow or the tool and make sure that it works and it's approved and makes all the QC and validation checks that would be appropriate for a workflow that's utilizing that image. And so I think that's really where the Staff B containers will continue to exist, as well as just being a resource for these emerging pathogens or emerging needs that laboratories will need. So. Yeah, I totally agree with that. I was doing a bit of a reviewer three moment there. I found, I always point people to the Staff B ones because for exactly the reason you're saying and it's good to hear, but my feeling interacting with it was that the containers tended to be more stable. Certain repositories, they do seem to just hoover up the latest, BioCondo or whatever, and they're a bit broken. So I don't want the email, I go, I'll use the containers on BioCondo, whatever, whatever. And then the next day they come back and say, oh, it doesn't work for me. It's like saying, I don't know, something something version is not there on the path. All right, okay, fine. No, use the Staff B ones because they're probably gonna work and you don't have to keep bothering me. Definitely there is scope for a more public health oriented, more provenance and more conservative set of images rather than the latest and greatest, which I think you find with more academic software where we're a bit loose and lazy with it, I guess. Yeah, we like the LTS versions of things. Yeah, exactly. I think you can badge it like this. Yeah, this is the LTS version of the Procker LTS. You can, there you go for the container names. You can put some extra thing like that. Yeah, it does smack of the kind of CentOS versus Fedora kind of feel, isn't it? Or why would you use CentOS over Ubuntu kind of thing? It's all just Linux. When are you gonna publish this? I already get onto Curtis' case about this. Yeah, so we definitely do wanna get a publication out. We had something in the works for a while and then the pandemic hit and really just slowed a lot of that progress. And our number one priority is making sure that we have the containers there and that our biggest next priority has been looking more at continuous integration across the board, continuous testing, things like that. And once those issues get worked out, I think we'll definitely be ready to kind of push forward on a publication. But I think there's a few things that we wanna push forward first and with the limited time that we all have, that's kind of our focus. I'm sure they'll put a manuscript together very quickly, Lee. They're not as slow as you. I deserve that, yeah. I'm kidding, I'm kidding, I'm kidding. You're kidding and I deserve it. It can be both things. Yeah, that's what makes it funny. So do you guys wanna talk about the Stuffby toolkit? What is it? Yeah, so the Stuffby toolkit is a Python-based utility that is designed to abstract away some of the more difficult points of working with a container. So it was originally designed when we were at a phase where we were instructing new bioinformaticians how to, or even just people in state laboratories, how to work with command line and how to work with tools. And working with a container directly on the command line is a bit challenging just because it requires a lot of knowledge of how containers and images work. It requires knowledge of how to mount things in Linux and how to deal with file systems. And so much of that is not really important to the goal of analyzing sequence data and getting an answer or a result. And so the initial scope of Stuffby toolkit was to create an interface that has a help menu that allows users to have much more simpler access to these Docker images. Aside from that, we also started thinking how could we incorporate some of the smaller workflows that we were working on? And then that kind of grew and grew and we've started just incorporating the Nextflow workflows as well. Just, again, kind of adding that level of how can we abstract away some of the more technical components of running a workflow and add a user interface that makes anybody trying to do this on the command line just a little bit simpler so that they don't have to worry about so many different details of working with a Linux environment. So again, really just focused on training but then it's kind of just been really useful as a route to providing a single package of resources. So having Secret in there, having Monroe in there. We also have Dryad in there as well as several other workflows as just an aspect of, you know, you can pip install this toolkit and then have everything you need. And so that's kind of been the direction that the toolkit has been going in and it's been slowly evolving to take better use of the Docker images and as well as the workflows that are being developed in StaffBee. Do you guys remember Homebrew for macOS and like Bruceye? I mean, when I was going over the StaffBee toolkit, what it did feel a lot like this easy off the shelf package management solution. I don't think Bruceye is maintained anymore. Sean Jackman, I think they abandoned it at some point because basically I think the idea was because containers made a lot more sense to deploy that way. But then you're back to the sort of nebulous obscuredness or opaqueness of dealing with the container. And so now you've written again, a package manager on top of that. What do you think about this sort of cycle? We seem to be going in the cycle where we're now making containers more digestible for people much like we've done with package management in the past. And what you've been talking about sounds a lot similar to sort of NF Core and their module system as well. So this seems to be a trend for me, like where this is going, what do you think about that? And what are the advantages of what you're doing over what else is out there? And I definitely see the field moving in this direction. And I kind of do wonder about the long-term usage of the toolkit. As I said, it was really kind of born out of as a training opportunity, a training tool to help introduce people to working in the command line and using these tools. And as we kind of move forward with some of these adaptations of like Terra utilizing a web interface to run workflows, I think a lot of laboratories may kind of decide that that is a better use of their time to work with these tools and resources through an application interface rather than a command line interface. But it still continues to surprise me how often the toolkit is downloaded and used. And how sometimes some of the software in there, we often question if it hasn't been maintained in a while, should we just remove it? And then we get a bunch of pushback from people saying, no, it's incredibly useful, please don't remove it. So it's definitely one of those ones where I'm not sure what the lifespan of this tool will be, but it's definitely something that I think at the very least will be a useful training measure for anybody kind of getting introduced to the command line and wanting to run some of these Docker images. I've discussed this before, we discussed this a bit, Robert, and I think we've been saying this on the podcast for a while, but containers and workflow languages are just, that's the mainstream now, that's the bioinformatics line of thought. And if you're not jumping on that bandwagon, you're going to get left behind, to be honest. So I think there is a definite lifespan if you're making it more easier, especially Docker, especially Docker, but trying to explain to someone. So it's a computer kind of in the computer and you have to do like a thing to make it see files outside of the container. And they're just like, what, it's the same computer, like, what are you talking about? And this maps to this, and so is it on the inside or is it on the outside? So when I write LS, which way am I seeing it? That's like a complete mind job, kind of inception kind of thing. So having any tool that makes it more digestible for someone is definitely going to be used. I think something like that, if it's maintained and if it's not, you know, it doesn't. become bloated or anything. It's all nice and sleek and easy to use. You're going to see usage for the next couple of years easily. So, no, you're not going to get rid of it that easily. And I don't plan to. And, you know, it surprises me sometimes how often I will need to run off a one-off of like SamTools or something like that. And I'll just do Safty Toolkit and run it just because it's installed and I know it's there. And I don't have to do the whole Docker command or worry about any of that. So, yeah, it's actually been quite useful. So I think what people might be thinking is, yeah, yeah, yeah. But I can just do Conda create X, Y, Z, SamTools. And then I'm up and running. Why would I do this? And your problem is going to be when you start playing around with workflows and you want to make your application more portable, then you run into this thing that Conda doesn't port very well. It's not something you can drag and drop onto the cluster, onto HPC, or even you're shifting laptops or something like that. You're going to have a problem and the containers make more sense, but then the containers are more oblique. So I think that there's definite use case there for this kind of application, making it more digestible. And it's going to hook into the workflows. And you're saying that you've already packaged in workflows. So it's sort of like, why would you not want to use it? I definitely hope that's the case. And I know Aaron has been really valuable in contributing Secret to the Staffie toolkit. And so, yeah, actually, I'd like to hear your perspective on it. The Staffie toolkit, so Secret is in the Staffie toolkit as well as in its standalone repo. And that's an interesting dynamic to keep them both updated. And generally, they sometimes get out of sync. Don't let Kelsey know. But the Staffie toolkit is also, as these public health bioinformaticians are looking for options on how to analyze their data, there are a massive amount of workflows for basically anything. Like if you just Google search, like AMR detection, you'll probably find 10 million antimicrobial resistant gene workflows for all kinds of species and organisms. And that can be a bit overwhelming. So one of the things the Staffie toolkit does is it curates workflows for the Staffie community and ensures that these workflows have some sort of public health focus. There's not a workflow for everything yet, but there's a lot of really helpful workflows that are there. Yeah, I think now with the improvement in the languages, it's much easier to prototype them and throw something up to fill that gap. It's going to converge on a space where we're all just picking up and using each other's workflows. I'm curious if we've got time to jump back to the point about Terra and more web applications sitting on top of this. Do you really think it's just going to be, we're going to just run something like that? That this is just an intermediate step until we just sort out how to do a web application sitting on top of these workflows. And then that's just all anyone would interact with? I think there's a lot of commercial interest for that. And I think the simpler you can get people to use workflows, like the simpler you can get any sort of interface for workflow, whether it's a SnakeMake workflow or NextFlow or CWDL, the easier you can get non-biometric petitions to use it, the more likely that workflow is to survive the long-term. And so I think there is a huge push and drive to go in that direction. I definitely agree with some of that perspective. I think there's definitely a use case in the United States public health laboratories where laboratories doing sequencing need an easier access to some of the workflows and things that are happening. And I think some of these commercialized approaches, NextFlow Tower, Terra, are great examples of how some of these laboratories with very limited resources might be able to get access to some of these things. Which I think before the pandemic, it was really there was a big push to get these labs to learn how to use the command line. And I think it's kind of been a realization that a lot of these laboratories just don't have the resources or capability or even the compute environments to do that. And so a cloud-based web tool, I think, is a perfect solution for them. At the same time, I think for a lot of laboratories that are developing a lot of expertise around genetics and genomics analyses and sequencing, I think that it's going to be hard-pressed to get away from the command line. I think there's always been things, quote unquote, threatening the command line. And it's stood the test of time. And I think it really stands to how flexible things are working with massive amounts of files and various file systems or across networks and things like that. Stuff that just really isn't, I mean, you can definitely abstract some of that away and put it into a web interface. But at a certain point, you need to go back to that root access and have the ability and flexibility to access those tools and resources. And I think the Docker and Toolkit and all these things will definitely continue to persist and provide a resource there for people who want that more flexibility, who are going to be kind of sitting more on the cutting edge of these development cycles and contributing to some of the resources that hopefully contributing to some of the resources that we have at Staff B. That all sounds very exciting. And either way, it sounds like everyone's actually finding when to have a choice of how they want to approach their analysis. And now they don't, hopefully in a few years, they won't have an excuse of saying, oh, I couldn't look at it because I can't, I don't know how to use whatever. I was like, no, no, no, no, no. You've got, it's all, installing is easy now. The workflow is easy now, command line, web, whatever, it's all available. You've got no excuse. What ST is the, what ST is it? Like what, whatever. Yeah, then it will be fun to focus on like the analytic aspects of what questions can be asked with the tools instead of training people like how to install the tools. We can finally do some biology. Is that what you're saying? Yes. Let's make biology easier for everybody. I think that's the end goal. I think we'll, by making the tools more digestible, we'll realize we don't actually know very much about the biology at all. Well, that gives you more podcast episodes. So, right. Oh, we're here forever. It's always shopping and changing. We will be running this for ages. That's a nice positive note. And I think we will draw this episode to a close. Well, there you have it. You're a dinosaur. If you don't catch up with containers in the Staff Bead Toolkit, guys, get on it. No excuses, it's too easy now. It's too easy. You've got to jump on it. We appreciate Drs. Erin Young and Kelsey Clark for coming on and talking with us today. And we hope to see you next time. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud or the platform of your choice. Follow us on Twitter at Microbinfee. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.