Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention and am an adjunct member at the University of Georgia in
the U.S. Welcome to the Microbial Bioinformatics podcast. Nabil, Andrew, and I
are your hosts today. We're talking about the workflow description language,
WDL. Joining us to discuss WDL is Dr. Joel Savinsky, the founder and CEO of
Theogen Genomics. Dr. Savinsky has leveraged over two decades of experience in
systems biology and has taken aim at transforming public health and infectious
disease surveillance through an innovative implementation of NGS and
bioinformatics technologies. During a three-year tenure at the Colorado
Department of Public Health and Environment, he has led several initiatives to
build NGS testing and bioinformatics capacity at the state, regional, and
national levels. In 2019, Dr. Savinsky left CDPHE and created Theogen Genomics.
Dr. Savinsky and his team now work with over 40 public health laboratories
nationally and more than two dozen internationally in Africa and Asia, building
bioinformatics capacity for public health. And also Dr. Danny Park, he is the
group leader for viral computational genomics at the Broad Institute of MIT and
Harvard. Over the past nine years, he has facilitated the conversion of
research-grade viral genomic and metagenomic analysis pipelines to portable
containerized cloud compute workflows that have been in use by collaborating
genomics labs in West Africa since 2015. Dr. Park has also co-chaired the
infrastructure working group in PHAGE, which is the Public Health Alliance for
Genomic Epidemiology, which works to define standards and best practices for
compute infrastructure for public health labs globally and what is required of
labs to support modern containerized bioinformatics workflows while maximizing
the portability and reusability of these workflows. Thankfully. Congrats. That's
awesome. So to kick off the conversation, I talked about WDL in the
introductions, which is WDL. So please tell me, what is WDL? WDL is short for
the Workflow Description Language or WDL. It's affectionately pronounced WDL and
it's one of the languages that's commonly used to encapsulate bioinformatics
pipelines in a way that's portable. It's also worth saying that the whole
Cromwell thing and some of these things don't translate so well across the pond.
I think WDL translates really well across the pond because Andrew just typed to
us that WDL is a term used for pee in Ireland. So that was, I think we're off to
a great start. Yeah. You'd actually say, it's more for children. Will you go for
a WDL? There you go. You've learned something new today. I'm trying to think
about how to translate this to an analogy, but it's not really working. There's
probably a lesson in there somewhere. Anyway, back to actual WDL, the Workflow
Description Language, what is this? Right. So the idea is a lot of computational
pipelines, especially in the genomic space, consists of gluing together
invocations of lots of different tools. You take the data and you run it through
this tool and you run through that tool. And if you've kind of come to the place
where you've learned how to containerize those tools so that they're at least
executable in lots of different environments, then the question is how do you
actually describe all the other little bits, right? There's the file format
conversions in between each step. There's kind of, the file here goes from the
output of that to the input of this, and you kind of plug it all together. So
WDL is one of those languages that's meant to help folks describe how to glue it
all together. It starts with the presumption that your tools are already
containerized. And given that, how do the pieces come together? What are the
other bits you have to call around, you know, before and after each step to get
it ready for the next step? What's the flow look like? Does it have to break out
and come back together? Are there conditions? That type of stuff. It does it in
a way that's meant to be cross-portable across a few different execution
environments. So that's what the workflow description language is. So if
somebody gave me a WDL workflow and in it, I don't know, maps reads to the human
genome, I could take that WDL workflow and run it like on my Raspberry Pi or my
HPC? Presuming that those are platforms that can run Docker containers and have
whatever hardware requirements it, like many of these languages gives you the
opportunity to say, Oh, you know, some of these steps are memory intensive.
You're going to need this much resource or that much CPU, maybe this much disk.
So presuming that you're a place that you're trying to execute it on kind of has
those abilities, then yeah, that's the idea. Is this something that you can
potentially use maybe with like NextFlow, SnakeBake, that kind of thing? In a
lot of ways, it solves kind of similar problems for this type of thing. A lot of
folks may be familiar with NextFlow or SnakeBake. They all come at it slightly
differently. In fact, and the other one you haven't mentioned yet is the common
workflow language, which is quite popular in certain circles, that's CWL. But a
number of these languages were actually kind of independently birthed around the
same time, almost a decade ago, trying to solve some similar problems. I think
there's a lot of commonalities in what each of them are trying to do. They all
have slightly different unique features and how they do it and what it's like
for a developer or for a user. So WDL, if I'm not wrong, that's the language
specification. That's not strictly something that will execute the... So there
isn't like a program I run called WDL that actually executes the scripts. Right.
That's exactly right. Yeah. And in that sense, it's actually kind of quite
different from NextFlow or SnakeBake, which have both the language specification
and the kind of the execution engine all kind of tied together. WDL is more
similar to CWL in that it is a language spec with the idea of multiple different
implementations. And the language is thought of pretty orthogonally to the
common implementations where you might actually execute those things. So there
isn't a thing called WDL tool or something to run it. There are several engines
out there that you could use. And that's part of what we can talk about in terms
of its portability. WDL is actually kind of interesting because actually out of
all of these, it's probably the most formally defined as a language spec. You
can look up its formal spec on the OpenWDL GitHub page. There's a consortium of
people who modify the spec, add features to the language and stuff, but it's not
built on top of Python or Groovy or anything like that. It's a standalone
language that's meant to be parsed. So who are the main consortium members? A
lot of the consortium members for the OpenWDL consortium. It's a great question.
I'm not actually deeply involved in that one. I think anyone who has interest
and kind of participates by the guidelines can join that one, if I'm not wrong,
but the most active members tend to be the folks who are invested in the
implementation. So that will be people from the Broad Institute who operate
Terra and Cromwell. There's the people from DNA Nexus who operate their platform
and use WDL a lot. And the folks from CZI who run MiniWDL. I guess that answers
my next question, which is what kind of financial support and backing is behind
this? And it sounds like it's quite impressive with so many different large
organizations involved. The investment in the consortium around defining the
language, it's the people who are invested in the implementations of it, right?
And they have their own reasons to be interested in that and invested in the
success of it and making sure that the way that they want to extend the language
is something that's consistent and portable with the other folks who are using
the same language. So I want to step back and actually ask a more generic
question now that we've sort of explained WDL very briefly what it's about. What
is the benefit? What is the problem WDL is trying to solve, like generically
workflow languages are trying to solve? Might not appear obvious to people,
especially when they start looking at the documentation for WDL, they'll see
that this is a very complicated specification of tasks. Why are we going to this
level of complexity for bioinformatics analysis? Well, on that last note, I'll
say that the documentation for OpenWDL is actually kind of targeted towards, you
know, the people who maintain the language spec and that type of thing. I tend
to go there just as a language reference, but it's, you know, not actually
intended for the users of it per se, more of the folks who implement the engines
around it. But the, the problem it's trying to solve is inherently one of
portability. It was born around the time a lot of different folks were working
on kind of the cloud implementations of compute for genomics pipelines. And
inherently at that point, once you were lifting things out of everyone's
research clusters and kind of institutional set of tailored scripts and things,
you know, if you're going to put in the effort to move things to the cloud, you
might as well put in the effort to make sure it's done in a way that the next
time you move again to something else, you don't have to.  rework all that type
of stuff. So a lot of the portability concepts were kind of baked in from the
beginning. So I wanted to ask Joel because you're a fan of WIDL, right?
Absolutely. What's your take on WIDL? How do you interact with it in your work?
And why do you see this as something to invest your time in? Yeah, no, it's a
good question. And from a public health perspective, especially over the past
five or six years, it's been really satisfying watching things get to the point
where they are right now. And this isn't necessarily a WIDL specific answer.
It's more of a workflow language specific question. And on your previous
podcast, you chatted with individuals, Kelsey and Aaron and Curtis and Kevin and
others, and talk about staff being containers. All along the goal for
bioinformatics and public health is to make things more standardized,
reproducible. Danny just talked about portable, scalable and easy to use, kind
of those five main topics. And when I started in public health six years ago,
the only standardized bioinformatics we had at the state level was bionumerics
through PulseNet. And as we started expanding to other pathogens or wanting
alternative pipelines to some of the enterics for local use, you started seeing
bioinformaticists come into public health. We were doing things kind of the old
way. I think Nabeel, in a previous episode, you talked about the way
bioinformatics was done 10 years ago and the way that it's done now. Well, five,
six years ago in public health, we were doing things the way bioinformatics was
done 10 years ago, every place else. We were doing a lot of custom scripts.
We're using a lot of on-prem resources or just starting to use virtual machines
in the cloud. And one of the projects I was involved in when I was at Colorado
was, it was kind of like a primitive workflow management and containers where we
said, okay, and I can remember this too. It had to do with LiveSet, right? And
we wanted to run LiveSet and LiveSet, sorry, Lee, it was a bear to install from
scratch on a system and definitely suffered from the, it works on my machine
kind of syndrome. We wanted a way for other laboratories to use LiveSet and not
have to go through the, I'll call it process, the process of installing it
themselves. We essentially fired up a virtual machine on Google cloud and got
Lee to install LiveSet on there and then made an image of it. And then we would
distribute that image to colleagues to run LiveSet, which essentially for those
of you who don't know, LiveSet was used for enteric surveillance and looking at
phylogenetics. And this was kind of the beginnings of where public health
started to recognize the power of this modularity and portability. And, you
know, with the VMs, we had a way of standardizing things. We had a way of making
it reproducible and in some way portable in the sense that if you had access to
Google cloud, you could have access to these VMs, but it definitely wasn't that
scalable. And it certainly wasn't easy to use because you still had to be a card
carrying bioinformaticist or have extensive training on the command line to
successfully run a lot of these things. So where workflow managers come in is
they, and it's not only workflow managers, it's also the containerization. So
where you had a number of public health scientists with Curtis and Kevin and
Kelsey and Jake from Minnesota and all that really started contributing to the
StaffB Docker repository, which has been discussed on previous podcasts. And
once you had this repository, this registry of containerized algorithms, then it
really allowed leveraging of workflow managers to not only standardize and make
portable and reproducible workflows, but to also make them very scalable and
easy to use. So yes, I I've been a huge fan of Whittle and I know, and later
we'll get into Terra, which is where I think some of these workflow manager
languages really blossom is in that type of context. But yeah, it, it really
helped solve those five topics of being able to expand bioinformatics. Can I
take a moment to defend myself on liveset? Sure, sure. You have right of reply,
Lee. I'll just say, I'm going to say you're right. It's harder, it's hard to
install. You know, I, I don't know if I know anybody that successfully installed
it. I'm, I'm done with my defense. No, I mean, and that was a huge thing,
getting liveset into a container in the Docker repository. I mean, that was
fantastic because then, you know, you, I, I think in another episode, someone
talked about, you know, having this, I think it was with Robert Petit, you know,
having the sysadmin aspect of being a bioinformaticist and, you know, yeah, some
people might enjoy that. I never enjoyed it. Right. Because it was, it was all
peripheral to getting work done and asking interesting scientific questions. And
so if you can remove that sysadmin aspect of things, which workflow managers do
a lot of, then you get to do more of what you enjoy. As I said, all those stupid
biopearl things. Yeah, you got to cut that. So there was, there was one day on
our Slack board, Joel, where, where you said, when are you gonna make liveset
V2? And I forgot how the conversation went, but I eventually made a, what did I,
I made a, I made a meme with here lies Lee Katz, never made liveset version two.
Well, it wasn't even liveset two. It's like liveset, you had to make sure you
used version one point, like 1.4 F. Yeah. But it's a, it's sort of an inside
joke that with liveset, they, they really, they've been, and this is, this is a
problem that's never going to go away, a problem slash solution where, where
quality management, everything, it takes over everything and you have to version
everything. And I thought I fixed it. So it's almost like saying like final,
final, final on your, on your document that you submitted as a manuscript. I
had, I had patch number 1.1. Now it's no, I fixed one little thing. Now it's
1.1.1 and now it's 1.1.4. Oh no, someone found a bug. Let's, let's call that
version a, and I started poking at the management system a little bit. And I
said, well, here's your version. We finally finished it. Well, and that was a
great time in, in, in public health bioinformatics. And that gets back to, you
know, the, the origins of staff B, which have been discussed before and, and
Heather Carlton's role in that. And I think, you know, we jokingly say, Heather
said, stop bugging us. And no, it wasn't Heather saying, stop bugging us. I
mean, Heather is an amazing scientist and always willing to help. I think what
she recognized was that there were several bioinformaticists starting to join
public health laboratories at the state level, and they didn't know one another,
but they knew her and they kept reaching out to her. And she suggested, you
know, I think it was a while because this had to be back in, wow, around 2017,
January, I think. And, and she just suggested, I think on one of my calls to
her, she's like, you know, I'm, I'm having a number of other bioinformaticists
ask me similar questions. You know, you, you guys should meet up and chat and
talk about things. And I think at the time it was, you know, Kelly Oakison and
Kevin Libwood and Sean Wang were like the other three that were asking her lots
of questions. And I think our, our kind of official get together was that APHL
meeting in 2017, because we had virtually been chatting a lot and then met in
person in 2017. But yeah, so, so Heather didn't say stop bugging us. I think, I
think she just recognized that, you know, that there's stuff going on. There's
unique challenges at the state and you state public health bioinformaticist and
staff B should start chatting with one another. We finally got the story. I'm so
satisfied. We've touched a little bit on WIDL as a language specification that's
separate from its implementation and the differences of that with say Nexlo and
Snakemake. But I was wondering what makes both of you really excited about WIDL?
Why do you use it? Or why did you buy into it versus other workflow languages? I
can tell you why I ended up there. So one thing I should preface, actually, I, I
do work at the Broad Institute in the infectious disease program, but, and, and
WIDL and Terra, which you'll hear about later, also come out of the Broad
Institute, but completely different groups. I have actually kind of no
involvement in its origins or its ongoing development. And also actually, I had
no particular loyalty to it kind of going in. It's, it's kind of a chance, or
it, it turned out that it's the language and the platform that we ended up
adopting in our viral group for a number of reasons. But, but actually for, for
me personally, because a lot of my priorities in our research group actually
have to do with making sure that the experience that we have in our group
computationally is equivalent to a lot of our collaborators and the folks that
we work with, that everyone else can kind of do the same things we can do. I
actually wasn't necessarily looking to be using something out of the Broad
Institute, but you know, the reason we ended up there, I mean, I remember like
five, six, seven years ago, I was trying to figure out how to formalize a lot of
our pipelines in a way that would be more portable. I remember thinking at the
time, I mean, these are the early days for, I mean, many of these languages
exist at the time, but in, in more kind of earlier primordial form perhaps.
Right. And so I wasn't terribly committed to anything back in say 2015, you
could kind of tell, Oh, there's going to be something good in a couple of years.
I actually feel like at this point, there's many, there's multiple good kind of
things that, you know, back then we're still kind of in their early days, but so
I didn't really commit too much, but I think it was probably a year or two after
that, maybe 2016, 2017. I think what tipped the balance for me, I, I had been
partnering a lot with a commercial cloud vendor called DNA.  nexus in the U S
they, they just happen to be one of the, I think maybe one of the oldest
commercial genomic cloud compute vendors they're more used for, you know, kind
of big cancer projects and prenatal screening and other kind of, you know, that
type of stuff. We were working with them since say 2014 to expand compute access
to our partners in West Africa. And so actually a lot of our, our collaborators
were using their, and at that time we were wrapping our pipelines and their kind
of proprietary in-house pipeline language wasn't easy for us to do. We needed a
lot of their hands-on help to do it. Cause it's their own kind of thing. But
after a Whittle had been around for a few years, when we saw DNA nexus adopting
it actually, as, as their kind of interoperability language, they started
writing parsers that would, you know, pipelines and Whittle and, and turn them
into something that could run on their platform, mostly by compiling it to their
in-house language. Now they have that for CWL and next flow as well. But, you
know, at the time when they came out with for Whittle, I think that's when I
realized, oh, this is actually more than just a Broad. And all of this seems to
actually have legs kind of in a larger global way. And now it actually looks
like it's portable enough to like legitimately really run on more than one
place. And so that's around the time. I started porting a lot of our viral
pipelines to that language. Cause I realized, oh, this is how, well, first of
all, it became the primary way we deployed our pipelines on DNA nexus at first
Tara later, but also even just running stuff on the command line, you know,
whenever we need to get quick stuff done in-house as well. So I just had a look
there and I see that you've got a parser is implemented in Scala, Java, Python,
which is quite a, quite an array actually of quite a variety of parsers for
language, which is kind of cool. And I think it shows the success of it. It's
not just a one-hit wonder. I'm not involved in that, but you know, a lot of the
original parsers. So that the Broad team that works on Tara and its underlying
engine called Cromwell, a lot of that stuff's written in Scala, a lot of the CZI
stuff, MiniWiddle and so forth is written in, in Python, I believe. And once all
that whole ecosystem kind of came into place, I for our, just our viral stuff,
our CI kind of builds on GitHub and stuff, we just ran everybody's validators
and, and, and kind of automated tests just to make sure that like, oh, it'll run
on the Scala implementation and it'll run on the Python implementation. And, you
know, every once in a while you, you discover some edge cases where something
wasn't cross-portable, but actually it's surprising how much out of the box, you
know, and that's the thing, I think an informatician, you know, if we get into
some of what makes it different, not necessarily like better or worse, but just
different, an informatician who works with Widdle and works with Nextflow and
works with Snakemake will realize there's kind of a spectrum of flexibility for
the pipeline developer, you know, Snakemake gives you a lot of power. You can
write input rules in pure Python and do a lot of stuff like just in the, in the
make file, right? They'll find Widdle actually quite limiting. It's a, it's a
hard specification of a language. There's a very limited set of things you can
do, but in compared to a lot of these other things, you know, you, you can spin
a Docker, run anything that you can run inside that Docker and you can plug
files together. And you're not allowed to kind of make too many assumptions
about file paths and things like that, because that's all like non-portable
stuff, but in those limitations, it creates more portability. When the developer
is forced to assume that you can't like, you know, that the file is going to be
accessible in some shared directory or something like that on your NFS mount
there, that means that the implementation can. Abstract away a lot of that back
and you can put it on a cloud bucket, or you can put it on your local laptop and
it'll all run the same. So in a way it's interesting for the developer, it's
going to be, feel more limiting, but it makes it easier to run in many places.
What about you, Joel? Where, you know, I think we've touched a little bit, but
is there anything you want to add, add to Danny's response? For us, the choice
of Widdle more had to do in, in how it could be applied. And, and that gets to
the, the Tara story. If it was a different workflow description language, even
though that is Widdle, if it was any workflow management language, our choice is
essentially driven by, you know, the, the environment that we could run it in to
assist public health. And then also quite often your choice of a path to go down
is quite often by who you're collaborating with. And when Theagin Genomics
started, one of our first partners was the Massachusetts Department of Health
for their bioinformatics regional resource where Danny and the Broad also had
the workforce development work to do there. And we started interacting a lot.
And that was the same time containers were blossoming and just starting to get
to workflow description languages. And even though Widdle and, and Tara and all
had been around for a while, sometimes you have to be ready to hear things that
there there's a timing and a rightness to, to things happening. And, and that
happened, you know, probably in that summer of 2019 into the summer of 2020 in
interacting with the Broad Institute, interacting with Danny and learning more
about the way that they were performing their bioinformatics and then
recognizing how that could be applied to public health. That's all the time we
have today. I'd like to thank our guests, Joel and Danny, and we'll see you next
time. Will you be able to join us in a couple of weeks? Absolutely. Sure thing.
Really appreciate it. See you then.