Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Welcome to the Microbial Bioinformatics podcast. Nabil, Andrew, and I are your hosts today. If you haven't already, it might help you if you listen to the last episode discussing Whittle. We're talking about Terra.bio, an open source platform that manages genomic data and computational resources on a cloud backend. If that confuses you, stay to listen to the episode. Joining us to discuss Terra.bio is Dr. Joel Savinsky, the founder and CEO of Theogen Genomics. Dr. Savinsky has leveraged over two decades of experience in systems biology and has taken aim at transforming public health and infectious disease surveillance through an innovative implementation of NGS and bioinformatics technologies. During a three-year tenure at the Colorado Department of Public Health and Environment, he has led several initiatives to build NGS testing and bioinformatics capacity at the state, regional, and national levels. In 2019, Dr. Savinsky left CDPHE and created Theogen Genomics. Dr. Savinsky and his team now work with over 40 public health laboratories nationally and more than two dozen internationally in Africa and Asia, building bioinformatics capacity for public health. And also Dr. Danny Park, he is the group leader for viral computational genomics at the Broad Institute of MIT and Harvard. Over the past nine years, he has facilitated the conversion of research-grade viral genomic and metagenomic analysis pipelines to portable containerized cloud compute workflows that have been in use by collaborating genomics labs in West Africa since 2015. Dr. Park has also co-chaired the infrastructure working group in PHAGE, which is the Public Health Alliance for Genomic Epidemiology, which works to define standards and best practices for compute infrastructure for public health labs globally, and what is required of labs to support modern containerized bioinformatics workflows while maximizing the portability and reusability of these workflows. I'll get things kicked off with the opening question of what is Terra? Danny, do you want to give us what is Terra in your eyes? I can speak to that and, you know, I think it helps to clarify some of the different layers of things that are going on because it is a little unique. So Terra, in my mind, is actually the kind of, in a way, the web UI. It's the platform that most users are going to, most end users are going to interact with to manage their data, their compute, their pipelines, execute their work, and that type of stuff. Underneath all that is the execution engine and the languages that are, that are used in there. So you might hear about Cromwell and other types of things that actually handle the orchestration, but Terra is the glue. It's the thing above all that, that the end user will interact with. That's good. And I think, was that a design decision based off what we were talking about last time with Whittle of trying to keep things portable and modular? I think there's a big part of modularity in there and we can get into some of the things that I found to be unique about Terra in that kind of modularity space. The fact that it's execution engine, the actual software that interprets the Whittle dispatch, you know, spins up cloud VMs and like moves stuff around, but that's a standalone thing that you could run on a cluster, you could run it and, you know, but it's also not the only engine that runs Whittle, there's, there's other ones too, right? But that's somehow modular and separate from Terra, right? Or sorry, the, the overall kind of web layer that even some of the underlying bits of how Terra manages data or even moves the pipelines around are also modular. And there's a whole kind of set of APIs that are designed around those, not just at the Broad, but through that whole standards consortium around how to talk about data repositories, what they call tool repositories, execution engines, that type of stuff. Terra is kind of the glue above all that. And the pieces within it are meant like things you could break out and, and implement separately. Anything you'd like to add to that, Joe? Yeah. It's, you know, when I look at Terra, you know, I think about when I first became involved with workflow managers and working on the command line and all that was involved there. And then, you know, how, when you run something at the command line, where is it going to be stored and what directory and how deep down the rabbit hole do you have to go to find your results and things like that, or where Terra just creates this infrastructure platform around the Cromwell engine and also around some other Google resources to make, you know, cloud computing, workflow managing, containerization, much more accessible for mere mortals and, you know, adding not only an ease of use in, in visualizing through a GUI application and kind of point and click for those that might not be command line savvy to also adding in functionality for data management to capturing the results after workflow and organizing that in a way that could actually be useful for timely decision-making. So this is sounding very futuristic and quite different to what we're used to. I mean, the way you're describing it effectively, it's, you don't need to know or have compute because it's provided from a cloud backend. If you have the pipeline you want specified and whittled, you can just run it in whatever environment you want. You don't need to care about the execution engine, the thing that actually runs those tasks because Terra is doing it abstractly for you in the background underneath, under the hood. So effectively it's like, I have data, I spin this up, I insert data, I get out nature paper. I mean, that's the dream, isn't it? There's a little more in that last step there that you left as an exercise to the user, but yes, it'll, it'll get you up to the doorstep and you got to do the other bits there yourself. But this is cutting out the start where it's like, I need to buy server racks. I need to set up my, I need to, I need to, you know, I need to write the workflow myself because my architecture of my working environment is in a particular way. So I have to do it this way. No, I can't just use what someone else has done. All of that. You're just cutting that all out. And like, here's data. I'm going to get straight to work. And not only that, I was going to say is, is Terra allows you to cut out some of that sysadmin functionality of working with a workflow manager. So if you're working with workflow managers, whether it's Nextflow or Cromwell or something else, you know, there's a certain amount of configuration that goes on for the specific environment that you're working in. And with Terra, you know, you could just log in, create your workspace and you're off and running and you don't have to worry about that configuration because Terra is managing those resources for you. It really was designed for the end user kind of, I guess I would think it was like, think of the bench scientist, maybe not the informatician who's generated the data, knows what kinds of pipelines they want to run and they exist in some sort of form to them, either through a collaborator or off the shelf. And they just want to turn that into analyze results. They don't have to bring their own compute infrastructure or hardware or even their own tools necessarily. They just need to connect the data and the, and the pipelines together in the workspace. And from that, we've touched on a lot of generically, what is the problems Terra is trying to solve, but I'm interested from first you Danny, where did the conception of Terra and what was it trying to do at the beginning in its infancy? And then maybe we can bring in Joel to talk about like his perspective now in public health, where he sees it going as well. When Terra was created, you know, in the beginning, what was it tailored for Danny? Like, what was it trying to solve? I mentioned this last time I'm not in, I'm at the Broad Institute, but not in the group that makes it. So this is a little bit from the outside looking in, or at least the, the origin story that I'm aware of, but it, it was birthed maybe a little less than a decade ago, mostly out of human genomics and in particular cancer genomics, where there's just a lot of data and large size data. And a lot of it was, well, actually it's real origin kind of came, was motivated at the Broad because the scale of compute that needed to happen for human cancer genomics was simply outpacing our ability to grow our HPCs and server racks. Like we just could not add more servers. We actually were getting to the point we could not add more electrical power input. It was, and it just wasn't really making sense. And there was a couple of other things that weren't making sense. Why, why were we like sloshing around humongous data sets that had 50 copies of the same old humongous thing? And so again, these are things that are, I say a little bit as the outside looking in, because I'm not in human genetics, I'm not in cancer genetics. I don't have like 40 terabyte data sets to deal with. And I assume a lot of you don't either. I don't need 5,000 cores of compute because my compute needs aren't at that scale. But I think a lot of what Terra and a lot of cloud compute platforms in general, not just Terra, you know, DNA Nexus, 7Bridges, all that type of stuff, a lot of that went to cloud to scale. And that was their motivation. For me, again, I work in a group that has had a long running priority in basically handing pipelines that we developed in our group to institutions kind of globally, particularly some of the folks who work in West Africa. So I had no motivation around scale. We didn't need bazillion terabyte data sets or that much processor power. I mean, there's a few things that eat some memory, but, you know, we could get that done on a cluster generally. For me, actually, it was more about kind of the portability piece, the fact that like there's two pieces. One is that I learned pretty quickly that. web wrappings around a compute infrastructure, you know, like a nice web UI that gives them flexibility to do lots of different things, but actually abstracts away a lot of the command line stuff. I mean, that increased accessibility, not just overseas, but in my own group, right? Like our wet lab bench scientists didn't have to bother the informaticians for every little assembly or alignment or a little report that they needed. They could do it themselves. But also actually for our accessibility overseas, you didn't need to buy server racks. You didn't have to power those with UPSs and diesel generators. If you didn't have like reliable power, you could literally just, as long as you could upload the data, you know, your power could go out all day and all night, but like the compute would go on, right. And the data would survive. And, and especially for microbial stuff, it's kind of nice because cloud compute doesn't end up being that expensive because it's not actually that intensive computationally. It's partly about Tara, but it's partly about web-based platforms for cloud compute in general. The initial motivations came out of human genetics. It came out of scale. It came out of doing big things bigger than you could do in your own institution. But what's nice for us on the microbial side is that like, even if that wasn't our problem to solve, you know, we could benefit in a lot of the other ways, like portability to environments that didn't have a compute cluster or even stable power, that wasn't their priority, but it kind of happened to work out that the solution was the same in the end. And Joel, what about you in terms of using Tara in the public health space? What do you find the problems that Tara solves or is fit to solve really well? Yeah, that, you know, there's a, there's a few of them there because, you know, the, the first, I guess you could say four years of public health experience for me and trying to build biofrast capacity, we're really focused around training users at the command line, getting them on virtual machines, teaching them how to run scripts, find the results. And, you know, a lot of times you can have these courses or even just in time courses and teach people, but unless somebody's doing this every single day, they, they tend to forget. You're not getting the real full benefit of the training and the time that you're putting in there. And it's not something that's very sustainable unless, you know, that the laboratory has, you know, several bioinformaticists on staff that are constantly running these analyses. And in the U S maybe you have about a dozen laboratories that can do that, but then you still end up with, you know, another 40 plus laboratories that, that can't do that. For me, the, what Tara was really addressing was that, that need to get users away from the command line. Also a need for an open source solution where, you know, maybe the, most of the options you had to get away from the command line and get into more of a GUI. They were all mainly, so there's a number of commercial applications out there that you could do that. But they're very closed box where a lot of times you don't know what the algorithm is, how it's exactly being run, how it's being implemented. And so you, so not only do you need to get away from the command line, you need something, in my opinion, that's open source so that public health scientists can actually see what's going on. And then also, and this is, this is almost the hardest one, but you need a resource that's acceptable to government institutions, right? Because you're going to have to go through some type of IT approval and you're going to have to go through some type of procurement approval. And where Terra was able to address some of that is because it was on Google cloud and because Terra managed all the Google resources, you could generally get a IT buy-in because it was more of a software as a service in their mind. Like if they, if they wanted to put a label on it, software as a service where Terra was managing all those backend components, it was just a BYOB solution, as I call it, bring your own billing account, so you had to partner with some type of reseller for cloud to get a Google billing account. And then also from a data security privacy standpoint, we were able to have concessions where, okay, all the data that's going to be on there is a non- PII, non-PHI, and you can, you know, you could assure that and say, listen, the only thing that's going to be there is genomic data that is either already publicly available or will be publicly available based on, you know, grant deliverables. So Terra really that getting away from the command line, open source, and also in a very SOP driven fashion where once you taught users how to use Terra, which you could do very quickly, it becomes more of an SOP type procedure, which is what they're used to with their wet bench protocols, you know, you can follow an exact process to do an analysis and get results. So there is a one common thread I heard from both of you, which, which subtly worked its way in there is the fact that the amount of the demand on bioinformatics expertise far outstrips our ability to seemingly produce it, to supply it, and that seems to be one of the critical things Terra is providing. And a lot of other platforms, obviously, I mean, here on the channel, we love Galaxy or there's other tons of other options out there that to me, empowering the user to do their own work. And one, one quote we always say is you don't want everyone to have to be a sysadmin to get their work done, trying to move away from that. That for me was, was, was one subtle echo again, is that, yeah, we really can't produce bioinformatics capacity as quickly as we actually need it. Yeah. And I think there's two pieces to that, right? I mean, when I said, Oh, say I have a bench scientist in my own lab who maybe in the past would have to bug an informatics person to like, just do some basic analyses, it's not just that there isn't enough, you know, I wish I could hire more in petitions, it's more actually that I wish that informatician could spend their time doing something more useful than, you know, like what, why, like if I had the choice, they would be developing more pipelines or doing other kinds of things or running other analyses that require a little more thought than just kind of assisting everyone that in a way, I don't want them to simply be part of the analysis infrastructure, just a human piece, a cog in the machine, right? But the other side, and for us, this was a big part of a lot of our work in West Africa actually started out of the NIH and welcome H3Africa program. And a lot of that was about not just developing the capacity in a lot of these countries to have genome pathogen sequencing labs and genomics labs, but actually the ability to actually do all the analyses and the interpretation and that type of thing themselves. And so if you create a dependency on a human kind of informatician, if they don't have it in house, then you're actually creating a dependency across institutions or perhaps across countries that isn't necessarily the long-term healthy dependency that you want. Right. So that's part of why we were aggressively trying to push this into the hands of the people who actually, you know, the scientists who generated data knows best, like what they actually want to do with it. So better to kind of put it in their hands. And to me, Tara really increased the accessibility of the bioinformatics work that was being done by the bioinformaticists that were currently in public health. An example of this is the containerization and workflows that were being brought up for SARS-CoV-2 analysis, you know, that you could look at things that Kevin Liboud or Aaron Young were doing with Monroe or Secret, and they had these workflows designed on a chain of commands to different containers, and you could readily adapt their work using the Woodle workflow language, just some subtle changes in the, in the structure and then make it readily available to anybody that can get access to Tara. So it's really a way of increasing that accessibility of the work that's already being done for a larger audience and not requiring them to have some type of on-prem resource or an on-prem bioinformaticist to do things. I mean, Joel, you, I mean, part of your sales pitch as a company, you say like, Oh, we can get a new lab from in how many minutes or something like that? Like up and running. It can be 60 to 90 minutes, depending on the type of work that they want to get done. Joel, you've mentioned the S word, SARS-CoV-2, and we were, we also had Danny talk about human genomics and the big data problem that they were trying to solve, and that's sort of the part of the motivation for Tara to begin with. Let's talk about that. Let's talk about COVID-19, which I guess is the first microbial big data problem and where, and how Tara fitted into that story. That was a crazy time, but for many, many reasons, you know, that whole, you know, when the pandemic started in, in 2020, and then in February it was, I was working with a lot of different public health laboratories and then they became not accessible to me and the reason they became not accessible is they were dealing with a, a dumpster fire of specimens coming into the laboratory and getting up diagnostic testing to get a handle on what was actually going on with the pandemic. So I think in February, March, April, May, there wasn't a lot of conversation about bioinformatics and sequencing. And I thought, oh, this is really, you know, bad for business. You know, everybody's gone. The meetings are shutting down, but little did I realize it was kind of like one of those scenes before a tsunami when you're sitting there on the beach going, where'd all the water go, right? You know, as the water recedes and everybody goes out and then the water comes. And, and that's what happened in that summer of 2020 was programs like PulseNet and other CDC programs did a phenomenal job building up a workforce and infrastructure for sequencing, right? So all these laboratories nationally had all the, the sequencing capabilities they had to perform SARS-CoV-2 surveillance. What they didn't have was the bioinformatics or there was labs that had the bioinformatics, but it wasn't the majority of the bioinformatics. So you had to have something that was accessible to the, the average public health laboratory for them to take their sequencing results when surveillance started up in earnest that summer of 2020. To, to take those sequencing results, be able to analyze it on their own and start informing their leadership and decision makers as to what was going on in their region with the pandemic. And that's where, as I touched upon before, there was some workflows already available. with Secret and Monroe, you know, that were written in Nextflow, but they were built upon the containers of the Staphy repository. And so those containers were freely available. New containers were coming on for Bader and Pangolin and a lot of other SARS-CoV-2 specific workflows in not too short a period of time. I mean, it was actually a really short period of time. It was Andrew Lang, I think did the first iteration of it, just wrote a little workflow looking at what Secret and Monroe were doing and just copying, you know, let's not recreate the wheel of somebody else's best practice here. Let's grab one of those workflows and just rewrite it and got a working workflow in Whittle, which, and this is a key thing about Terra too, is what's really unique about Terra right now, in my opinion, as to where this whole workflow management stands right now, is one is the workflow repository or registry that you can get through Dockstore. So it's not like you have to clone something out of GitHub or something like that. From within Terra, you can click on a workflow tab and say, show me all the workflows available, and it brings you to Dockstore. And you literally just click on the workflow to bring it in. So you know that you have this easily accessible registry of workflows. So we're able to put that workflow in that registry. So anybody who had Terra could access it. And then the other thing that is really important about Terra, and Danny talked about this, is the data management side. The fact that you have data tables within your workspaces. So the workspaces aren't used just as a place to bring in a workflow and launch an analysis. The workspace also manages your data and gives you the ability to share that data between collaborators. And so laboratories could very quickly upload their data, import in a workflow, say I want to run this workflow, click on the number of specimens they want to run it on, and then click run. And 40 minutes later, they had all their results laid out in a data table, whether it was 10 specimens, 100 specimens, 1,000 specimens. And what's really great about this, and getting to the pandemic side of things, is there was instances where we could get a lab up and running in 60 to 90 minutes, right? Because all we had to teach them, and we were able to record it so they can go back and look at it because everything was done virtual at this time, is we'd get on with the laboratory, we'd make sure that they had Google credentials too, because Google authentication is used for Terra. We'd get them into their workspace and we could clone a workspace, which once again, it's that portable reproducible. We set up a workspace. We just make a copy of it, give it to them. The workflows could already be in there. And then all we have to teach them to do is say, this is how you upload your data into a Terra workspace, whether you're grabbing it from BaseSpace or whether you're grabbing it locally. This is the way you create your data table. And then here's the workflow you want to use. And then you click this workflow, you put in the input parameters, you select your specimens and hit go. And it's kind of very SOP bioinformatics there. When the analysis is done, they don't have to look through a directory structure for the results. It's actually populated in a data table. So they could sit there and look very much in a very bionumerics pulse net way of where you're looking for your experiment results. They could look at data tables where each row is a specimen and each column is a particular result or output from a bioinformatics algorithm in that workflow. And so it really helped bring on a lot of labs very, very quickly being able to perform their own analyses. Hence, why there's over 40 labs using Terra nationally right now. And it really addressed the needs of the end user. And that end user being the public health scientist that has an understanding of bioinformatics that is probably more on the wet bench side of things, but needs that easily accessible through a browser bioinformatics solution for their analysis and highly scalable. I mean, we've done runs where easily tens of thousands of specimens at once launching that on Terra. And once again, you have that cloud backend. Terra is exclusively cloud-based. So you have that scalability just baked in from the very beginning that you don't even have to worry about. So does that mean people didn't have to deal with any comma-separated files? They could just look at it straight in the browser? They didn't have Excel files, nothing? There's a tiny bit of Excel in there. So when you upload a data table in the beginning, when you set up your data table, you do need a comma-separated value. I think it's a tab-separated value. A tab-separated value form to upload to create your additional data table. And you can download all your results in that tab- separated value. But for the most part, you get away from that. Once you get your data in there, it's really nice because your data table is essentially, you know, a data table of strings, tab-separated value strings. But those strings are actually, for the most part of it, if it has to do with a file result, it's actually a pointer to a Google cloud storage bucket location, to an object in a storage bucket. And Terra, when it looks at that data table, it'll render those addresses as links. So if you want to download, you know, the BAM file for alignment, to look at something, you just click in the data table and you download that BAM file. And also what you don't see in Terra, which is kind of hidden, but you know, if you're a bioinformaticist is how to, all the backend, I guess you could say logging and auditing is still available in the Google cloud storage bucket where the execution directories are and where everything was done. So you could actually go in as a bioinformaticist and look at the exact command that was run in the bash script, right? You can actually look at that command and all the parameters that were given. So it gives this extensive detail on what was actually run, which makes bioinformaticists very comfortable, right? Because you could see everything that was done in the most minute detail, but it's still super accessible to all the bench scientists. So, and that's why it kind of took off in the pandemic was it was a solution that was available right now to provide public health labs with that infrastructure, with those tools to have situational awareness of what was going on with SARS-CoV-2 in their region and make decisions based on that. So I had one, one question, you mentioned data sharing between the different groups. Now, does Terra provide you sort of rich user access control for data? Because that's something that's not usually implemented or implemented rather poorly in other offerings, in my opinion. Yes, it does, especially from a public health perspective. And so we, we have multiple projects with different laboratories and some of them could be a, let's say a collaboration where you have multiple industry partners and state partners and county partners. And, you know, when you use something like Terra, because it's a centralized system and we can get into a whole other discussion between centralized and decentralized, but because it's a centralized system and you're using Google authentication, it makes it very, very easy to control permissions as to who can like view a workspace, who could actually compute in that workspace, who could add other users, who could share it, et cetera, et cetera. So there's a lot of granularity and hierarchy and the permissions that you could have for these different workspaces, which lends itself very well to multi-collaborator projects when you have to aggregate data across multiple sites in real time and perform very large scale analyses. Yeah. So yes, it has those great permissions. So I'm wondering errors will occasionally occur, you know, when you run stuff, storage will disappear, connections will drop, or maybe things run out of memory and that kind of thing. What does Terra do to counteract those or take them into account? Yeah. The major error we have is spaces and filenames. It's a bane of all our existence, but for the most part, the errors aren't that bad. And this is kind of, once again, the nice thing about having a system with workflow managers and containers is usually if one person's having an error, everybody's having an error and you can address it rather quickly. And so you could have a small team of bioinformaticists. And this is where I think Biogen's excelled in supporting workflows that we developed, that we made completely open source for everybody to use. But we can, you know, if you have 40 different laboratories using those workflows, errors are identified very, very quickly and they can be corrected very, very quickly because you can update something in GitHub, which immediately gets propagated through DocStore and into Terra and for a new version, which is another key thing of those workspaces is the versioning so that, you know, previous versions are always available. Laboratories don't have to change until they're ready to change, to re-verify for their own QC management requirements for the laboratory. But other errors, you know, this is where the collaboration between the Broad Institute and Biogen has been really useful is that we have a lot of access to the Terra engineers and different teams and even can get people like Danny who aren't directly involved with Terra to bug them and say, hey, there's something going on that, you know, when we have errors, they usually are identified quickly and resolved quickly. You know, and we also have another, we'd be remiss not to mention Christine Lorath from the Broad Institute who has been instrumental in these collaborations and, and helping out a lot of times. She used to be the first person we'd complain to. Now we have more formal channels and access to a wider group of people. But for the most part that there's rarely any connectivity errors. You know, you have the robust Google cloud backend supporting things. I think the vast majority of errors that we deal with, it's more syntax errors, you know, something wrong with the file name, something, a space. Do you have any fun acronyms for user error, like problem between keyboard and chair? Or the pepcac or something like that. Yeah. No, we, we just, well, I mean, and this is also the really nice thing is supporting. The approach we've really taken is everything's open source and what we want to promote is iteration. If we're going to fail, we want to fail early and often, and we want to have this practitioner led R&D where we really just want to open everything up. People use it and get the feedback as quickly as possible and improve it as quickly as possible. We're not afraid to be wrong. We're not afraid to make mistakes. We'd prefer not to. We prefer not to be wrong. We prefer not to be mistakes, but that's part of the process. And because we open things up so quickly and to so many people, we make things open source. Those are usually found pretty darn quickly. And usually it's by Andrew Gazorski in Nevada, who's an exceptional collaborator that uses everything that we build. And it's usually the first person to find an error or an inaccuracy and we love them for it. So, but I guess that does that answer your question, Andrew there? I guess maybe I come from a world where, you know, we use HPC a lot more and you get random PhD students, you know, will launch 10,000 jobs in a cluster and crash some random piece of kit that you've never heard of. And then your stuff goes down and, you know, you have to restart it and whatever. So it's great that Atera does, I guess, just by virtue of using the cloud, smooth over some of those. Yeah. One of the, so one of the things they say around here is that the, yeah, philosophically, I think when you pivot from on-prem resources, you have fixed pool of resources and you got to manage fair usage of it to cloud compute the, the naively people say, Oh, cloud compute is, you think of it as effectively infinite. You ask for as much as you want, and it's effectively like, you know, isolated from each other, but we always find ways to probe the limits of the infinite, you know, we discover some of the edges of those assumptions or the simplifying assumptions and those things get discovered over time. It's possible to run out of things. It's possible for, for some person's crazy work to impact the others. And those are always kind of like, Oh, we discovered something new about the system. I guess, you know, I was going to say that we, we, we broke in Tara a lot and, and it's made it better. And, and the Broad Institute has been really open and willing to cooperate and learn from us how we're using it, which is different than the way their traditional stakeholders have been using it. If they had a, in an academic sense or a clinical study sense, you have all this data that you're putting in at once, or things are much more organized and structured and public health does things differently. I like to consider public health, like the mail, right? It just, the specimens just keep coming, right? They don't stop. And there is no set procedure. A lot of times you, you have the data come in and then you have the metadata and other times you have the metadata and then you have the data come in and other times you've got to correct everything. So there was things that we were doing at scale in Tara that were, yes, the cloud is infinitely scalable, but say, you know, interacting with the cloud through a browser might not be infinitely scalable. So there, there was some learning curves there as to how public health wanted to use Tara. And then we did break some things and the Broad fixed them. Well, I mean, the scaling is on a different dimension, right? We're not working with thousands of human genomes, we're working with millions of tiny genomes. So discover a different way that the scalability needs to work. I mean, it's a bit beyond the scope of this particular topic. Maybe it's a more general kind of informatic kind of discussion point, but to your question around how do these errors get handled? I mean, there's, there's the user error. There's like the error in our code that we write and there's kind of errors that happen on an infrastructural level, but there are, there's a fuzzy space in between those, right? Like the spaces in the file name, a user error. So like, I guess I could harden my code to be more resilient to that. And so there's things you can do at the layer of like, I'm writing the pipelines to make life a little bit more fault tolerant for the user. But you know, so where's the responsibility like similarly, like you talk about like out of memory errors and stuff like, okay, well technically that's the fault of the folks who wrote the whittle that said this step needs that much memory. And maybe we should write it to actually say it needs that much memory. But you know, the Terra folks added a nice little, it's one of those common things where the Terra folks added this nice little like retry with more memory button that like you didn't have to bug the pipeline developer every single time. You just wanted to see if maybe that's all it was, you know, that type of stuff. So it's in a way it's like different modes of responsibility, but there is a fuzzy kind of boundary between each. I think we should start moving to the final points, which is the future and how do people get involved? We've actually kind of covered all of the perks of Terra just as we've been going along. Yeah. Well, I can, I can talk a little bit about kind of where it is and where it's headed, at least as far as I can see in some of the bits that I'm involved in. I mean, it is an actively evolving thing, you know, a few years ago, you know, what can Terra do and what can it not do? Well, I mean, in the past it was strictly a platform that would run pipelines described in Whittle on a Google cloud backend. And that's like what it would do. And if you want it to run it on Azure, if you want it to run some next flows or whatever, these are things that are just not within scope. It's because of various different grants and partnerships and priorities, its development team has ongoing, like a lot of those things are kind of actively evolving on the roadmap. When Terra was born several years ago, its main partnership was between the Broad Institute and Google Verily. So that's why all of it was really kind of baked into Google year or two ago. It announced a kind of new partnership, three-way partnership with Verily and Microsoft Azure and the Broad to bring it there as well. And there's all sorts of reasons for that, but you know, there's been active work under the hood. You can't yet spin up a workspace backed on an Azure bucket and compute resources, but within, I don't know, whatever period of time I'm not involved in their roadmapping, but like you will be able to. And there's some intermediate kind of checkpoints along the way. Presently you can run galaxy pipelines in Terra. It's not quite as smooth as it is in, is to run a Whittle, like you can't just quite click the import from doc store, but you know, it works for, for the folks who do that. Next flow is on the roadmap at some point. I know there's some like really alpha kind of, you know, experimentation with it, but you know, the idea is it's supposed to be a platform that can kind of generically work with a few different backends. One day, hopefully be better at being multilingual, manage your data, help you share your data, manage your compute backends. I know some of the projects that it has going with the NIH right now with DSI Africa has the ambitions of having some of that data locality and even compute, be able to work with on-prem resources in Africa. That's kind of a longer term timeframe and it's dependent on several key pieces actually being pulled off properly, but that's kind of like ambitions in the future. In the near term right now, today, what can a user do today? I mean, it is, it's Google and it's, it's Whittles and a little galaxies, if you know how to like tinker around with it. So Joel, anything you'd like to add to that? When thinking about the future, I think that with the past couple of years of this pandemic and terror utilization has taught us is, is this model of a browser interface workflow managed cloud backend model works for public health bioinformatics. Okay. And it works really, really well. So the, the idea now is, you know, how do we improve that? And what considerations do we have to take into account? And to me, you know, they focus around this whole concept of centralized versus decentralized resources, but sometimes you want all your eggs in one basket. Sometimes you don't. There's also the topic, which Danny just touched on about next flow. You know, we have a couple of different workflow languages that are predominant to having some better harmonization there, but would definitely work. And then some better data management capabilities. You know, that's one thing where Tara has excelled compared to other solutions is in that data management, just being baked into the solution right now. But even that data management can be improved quite a bit, especially in the context of public health. When you think about lens integration or reporting or something like that, that there's plenty of other improvements. The multi-cloud aspect is definitely something there too. Azure would be great. I think if you cover just Google and Azure, you'd cover 99% of the use cases. I do see a migration of public health towards using Azure resources. And I think a lot of this is driven by pre-existing contracts and familiarity with Microsoft products and enterprise solutions for Microsoft at the state level. So I think to adopt more Azure solutions could definitely help this. You know, overall, I think this is great. It's looking really bright. We're seeing a lot of other workflows being developed for other pathogens that are already being used. And I think we're going to have a lot of options. And I think we're going to see a lot of these solutions right now that may appear to be competitors actually complement one another and strengthen the overall public health system. So how does a listener get on board with Tara now after listening to us? They're so excited. How can they get involved and use Tara? Tara.bio. You just need a Google authentication and you can register for Tara online. And then there's plenty of training materials out there on how to use Tara and different workflows. Docstore.org is another resource for this. I guess we'll have to make sure we have all these links in the show notes. And then we could probably put some links to some of the theagin trainings that are publicly available and online resources. But there's already a lot of information out there. And the Tara group also has a lot of growth. great tutorial videos on the basic functionality of, of using Tara. So there's already a lot of information out there. Just need a Google ID. And it could be as simple as just a free Gmail attached to some, like there's some mechanism for just some free trial credits and stuff like that, just to get your feet wet. And yeah, there's a lot of materials out there, especially for, you know, I mean, the Broad has a bunch for the human genetics use cases, the agent has a bunch of good ones for the, for the viral work and bacterial kind of use cases. And I think those give a flavor of what it looks like for the end user. Even if you don't use those specific workflows for those who are interested in writing or porting their own pipelines, there are a decent number of kind of materials out there on how to write your whittles, right? Lin Lang, it has a few, and there's a couple others produced by the Broad team as well. All right. So I think I'll throw open one wildcard question to both of you, and then we'll today's recording to a close. So in the background, we've kind of been hinting at cloud computing as a key way of underwriting a lot of microbial bioinformatics infrastructure. And from the way both of you are talking, you're seeming to feel like that's the way we're going to go. What do you say to that? Well, I'll start off and say, so when I got involved eight, nine years ago to help port our work to the H3Africa project, to the, to a lot of our collaborators in West Africa, I did not at the outset think that cloud computing was a, you know, I didn't think that that was going to end up being the strategy until I, we actually dabbled in it and saw users using it and realize actually this is the fastest, simplest way. And it checks a ton of different boxes. I mean, there's a whole perhaps episode you could get into around like things, you know, like security ownership, all the, all the usual concerns around it. So we'll leave those to some, some other time. I, but I do think for me, again, it was realizing that it wasn't about going to scale, it wasn't about going to petabytes, all the reasons why all the other people went to cloud first, it's actually about access and being able to increase access to the code and to the data. That's, that's the thing that really worked for us. I definitely agree with Danny on the, on the access aspect of things. You know, I've been a cloud evangelist probably since around 2009, the first time I fired up AWS and was able to do a two week analysis in two hours for $80. And it's like sold, like this is just the greatest thing ever. And, you know, in public health, I don't see how you don't go in this direction. And especially when technology has changed so quickly and, and infrastructure can be expensive to maintain. When you look at how you use cloud computing, you're, you're kind of going from a model where your computational infrastructure is no longer infrastructure. It's no longer capital equipment. It's actually a consumable, right? You consume cloud resources. You can budget for cloud resources based on the number of specimens that you're going to run through your lab in a week. And so I think there's this going to be this big paradigm shift in the way that compute is used, especially in public health for infectious disease surveillance, that it becomes a consumable and no longer capital and infrastructure. On the cost side, realizing that for microbial, how small it is, especially compared to the data generation costs. Yeah. And that's what we emphasize a lot. Yeah. It's like a couple of dollars per specimen for analysis versus a hundred dollars for data generation. All right. And on that practical note, I think we'll draw today's episode to a close. That's all the time we have for today. We've been talking about Terra.bio, an open source platform that manages genomic data and computation resources on a cloud backend. Please check the podcast description. There will be links to a lot of the resources we've been talking about today. I'd like to thank our guests, Joe and Danny, and we will see you next time on the MicroBinfee podcast. Thank you so much for listening to us at home. If you liked this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at MicroBinfee. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.