Hello and thank you for listening to the MicroBibDi podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody really writes it down. There's no
manual, and it's assumed you'll pick it up. We hope to fill in a few of these
gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the
head of informatics at the Quadram Institute in Norwich, UK. Andrew is the
director of technical innovation for Theogen in Cambridge, UK. I am Dr. Lee
Katz, and I am a senior bioinformatician at Centers for Disease Control and
Prevention in Atlanta in the United States. Hi! Today, we have the privilege of
introducing Waitema Wirth, the brilliant author behind the revolutionary
WriteThe GPT package. With a unique blend of creativity and technical expertise,
Waitema has crafted a powerful tool that empowers bioinformaticians and unlocks
the full potential of language generation. Waitema focused on histopathology in
a past life, but now works at the Doherty Institute on Microbial Pathogens. Our
hosts today are Nabil and myself. If Andrew gets out of his meeting, he can join
us soon. And join us as we delve into the mind of this extraordinary innovator.
Just to get us started, I know that you had a totally different career in a past
life, histopathology. Can you describe it quickly? And then how did you
transition into this totally different field? Yeah, hi. Thanks so much for
having me. I really appreciate it and yeah, happy to be here. Yeah, so I did my
PhD in histopathology on a viral pathogen in freshwater turtles in North
Queensland at James Cook University. Yeah, so I spent a lot of time in the lab
and a lot of time in the field. And with the other third of my life, I guess I
procrastinated and did a bunch of like web development stuff. And so through
that, I sort of started getting some ideas around programming and it like really
interested me. And so I sort of like, you can't really change your mind when
you're a couple of years into a PhD. It's sort of hard to be like, oh, I
actually want to do something completely different. So I sort of like stuck
through it, finished up the PhD. And then a friend of mine introduced me to
Sebastian Duchenne, who's a researcher at the Doherty Institute who works in
phylodynamics. And he gave me a job to work there. And I started doing a lot of
tool development for them. So I built a bunch of different packages and I did a
bunch of different analyses in phylodynamics. And then about a year and a half
after that, I joined the microbial diagnostic unit as a bioinformatician or like
kind of a academic specialist. So yeah, now I work with Torsten and yeah, do a
lot of COVID work, but yeah, general like pipeline development and analyses and
yeah, different tool development. Yeah. So the way I understand it, the Doherty,
basically like you're on the same floor with Torsten, but like there's kind of a
split or something. Are you on his side of the floor or the other side of the
floor? Yeah, I'm on Torsten's side of the floor. So yeah, we, I think there's
nine levels or something like that. So we kind of, mainly the Doherty is like a
bunch of immunologists. And so microbiologists sort of, yeah, we stay on our
floor and we do a lot of service work out of that. But yeah, I'm in the Torsten
side of things, which is more on the service side of things at the moment, but
with plans to transition back to more researchy stuff pretty soon. That's
awesome. So we wanted to talk to you a little bit about this package that like
just came out of nowhere for us. Cause I think, I mean, basically LLM surprised
us at the beginning of this year. Like that was a major surprise. And then on
top of that, you surprised us with this total package called WriteThe. And
that's W-R-I-T-E-T-H-E. And what does it do if you can summarize it? Yeah. So
basically it's kind of a pun, I guess, on like read the docs. And so that was
the initial application that I saw was like these LLMs. I was spending a lot of
time basically copying code into like chat GPT and then getting it to like write
the right documentation or tests or something like that for the code. And so
then I think, yeah, I was at the pub and I was talking to someone about how,
yeah, this is like a good solid use case where you can constrain these models.
So I, yeah, I developed this tool, WriteThe, which basically is a command line
interface that lets you essentially pass files to a large language model so that
it can write documentation, write tests or like convert the files to different
formats. So kind of like trying to standardize that process without having to go
through like copying and pasting really to reduce the amount of copying and
pasting you do. So you can kind of say like write the docs and give it a
directory for a project. And then it will just write doc strings for all of the
functions and classes in that project. And then it uses make docs generation
tool to generate like a reference API and a documentation website that you can
then host on GitHub. So it's sort of very quickly, you can use two commands to
like bootstrap the documentation for a project. And then also use it to like
scaffold out a lot of the testing, like writing tests for a project as well.
Yeah. And so it's still a bit confusing, I'm sure to everybody, like how you get
into this. So what's kind of the engine behind this? Is it chat GPT or what is
it? Yeah. So yeah, it's chat GPT. So it uses the open AI API. So basically it
has essentially like a, an F-string like template, which is like this, this
prompt, like a pre-written prompt that says, basically I want you to write
documentation for this code. Here's an example of how to do that. Like a short
snippet of an example. Here's the response structure that I want you to return.
And then it inputs, adds that code to that. So then it, it formats all
everything in this consistent prompt. So that when you then send it to GPT, you
can get the sort of, you get this expected result back. So you sort of, each
time you can get this consistent result in a format. A lot of other applications
that are kind of doing this use JSON, but I actually decided to use YAML because
I just thought curly braces were hard for some reason for LLMs. But it, so yeah,
basically it tries to get it to format it in a YAML response and then
essentially extracts that response and passes it using like the YAML parser.
And, and then uses that to add the documentation to the code. Wait, so if I look
through this, I'm going to, in your code, I'm going to find a printf statement
with an actual prompt in here. Yeah. So if you go to write the, yeah, upper
directory, sorry. Yeah. It's in the commands. So each command has its own prompt
associated with it. So if you go to docs, yeah. Or something. And then the
prompt prompts. So we're, we're sharing screen and looking through this. I just,
I have to see this. Yeah. So that's the, the prompt. And so basically what you
do is this thing called prompt engineering, which is like what you do with these
large language models to try and get them to produce the results that you want.
So you give it sort of clear detailed instructions to start with. And then, so I
use two things here to do that. I do a thing called few shot for learning or,
and so basically it's where I give it an example where I have an example that's
like an add function and it's an undocumented add function. And then I say, this
is, I want you to return it in this sort of YAML format that describes, has a
description of the function, description of the arguments, the return values,
and an example of how to use it. And then you do this other thing, which is
essentially like code completion. So you sort of, you condition the model to, so
you make the next token it's going to reduce highly likely. So you sort of say,
you write, this is, here's the code and here's the formatted output. And here's
the formatted out, the formatted doc string for this code. And you insert that
into the prompt. And then it would just essentially continue like producing the
results. I love that. So yeah, it's all based around prompt engineering, really.
It's like this sort of term that people are coming up with around this. So
trying to get, yeah, make the models, produce the results that you want. And so
this sort of stuff, because it's a consistent output, right? Like docs, doc
strings have a very consistent format. So you can sort of, even though a lot of
the content changes, it's still in a structured way. So you can predict what the
output would be, which allows you to then pass and validate it and insert it
back into the code. So yeah, a lot of like all the documentation generation, all
the test generation and stuff, that's barely any part of the project. Really the
project is handling the responses from the LLM. So validating the response,
reading the, finding out if like extracting doc strings and inserting doc
strings. So yeah, it's, it's really the LLM does this generation, but then it's
sort of that tooling around it to make sure you can, this can happen
consistently every time that you run it. So that you can sort of use it in like
a, in an automated fashion versus like having to go to chat GBT, copy something
in, give it the right context every time. This sort of just templates that for
you. And in a very structured way. How do you verify that the doc string it's
generated?  is actually valid? So I basically just part, so it returns the doc
string in YAML format, which is, and, and so basically I just extract that doc
string and just check whether it is valid YAML really. So, so if it's valid
YAML, then I'll insert it. Cause there is some variation, basically the, you
know, some functions don't have return types, so it might not have the return
type in there, but yeah, you could do more validation on that saying like, oh,
this is the, this is how I expect the structure of the YAML function to be. Like
each argument should have its input and it should have a, and the type that the
input has something like that. So you could do that in a bit more detail, but
basically I just sort of trust that like chat GPT or the open AI models are very
good at producing these responses consistently with this, with this prompt,
basically. Yeah. I mean, what's, what's, I mean, we're looking at the example,
the template where it's sort of saying, add two numbers together. That's your,
your part of your prompt that he's saying, well, you know, lay it out this way
and here's a sample, add two numbers together. I'm curious, what's some of the
more complicated methods that you've, or functions you've fed to it to generate
doc strings? And is it able to read the data that you've fed to it? Or is it
able to read some, some of the spaghetti, some of the spaghetti code someone
might write? Yeah. If you, so if, if you go back to the main page, there's an
example. So basically I use write the docs to write the docs for the write the
docs command. So it's like gets very metal, but there's a link to it. If you
scroll down on the first page there. And so it's quite a complicated function.
That first one, right. So on the GitHub, there's some, and we'll put the link to
the GitHub in the show notes. There are real world examples and we're talking
about the first example, which is like write the docs to write the docs for the
write the docs command. Yeah. Yeah. So that's sort of like, do you want to click
into that Lee? Yeah. So it becomes distracting myself. No worries. Yeah. So it
becomes quite meta. So for example, in the examples of that, right, the doc
string, it has to call itself and return a formatted doc string. So it, and it,
it, and chat GBT is able to recognize that and create the correct example doc
string to sort of say, you know, given this example function, here's the sort of
example result I will return. So yeah, it becomes quite meta because it's
writing the doc strings about itself, which is, and providing examples about
itself. But yeah, so like it, do you want to click into that link? Leave the
first link on real world examples. Right here. Yep. Yeah. Sorry. Go ahead.  So
that doc string there with the arguments sort of describes, so yeah, describes
all the sort of different values and what effects they have. It adds notes in to
say like, if particular arguments are set, what the results of that will be, and
then provides an example and the returns return type. And then, yeah. So in the
example, it, yeah, it sort of says like if you call this function on a, on some
Python code that has an ad function in it, it will return, provide an example
result of that. And so, yeah, the whole sort of library is documented that way,
like using itself. And then I've gone through and like in that list of examples,
it's just a bunch of random repos that I've seen that don't have documentation.
So I just run this one, write the docs command on it. And it seems, and then I
go through and read it and it's like, Oh yeah, this will kind of make sense.
Like, and so I've made a bunch of PRS to open source projects to add
documentation to them, which has just taken me, you know, a minute or so to
generate documentation across because it, it sends all the requests to the open
AI in parallel. So you can document the source code essentially, you know, one
function at the time it takes to run one to document one function. And so, yeah,
it takes no time to do that. And maybe like a couple of cents. So the, the, the
open AI API is does cost money to run that. So you do need an API key to use
this tool, but yeah, it's, it's like, if you're just generating this
documentation for like a project, you're really doing that once off or something
like that. So really it's a few cents and it saves you a lot of time. Amazing.
Yeah. If you go to the website, so you might have to go to it from the main page
on the repo. Can I ask which model does he use by default? It just uses 3.5, but
there's a flag to use GPT4. And have you tried it to 16 K version yet? No, I
haven't. So I was looking at that. And as far as I can tell, it's just available
for the chat API. So there's a couple of different API endpoints that open AI
has. And so the chat one, I'm using the completion API for this project. And I
don't think that model is available in the completion, but that would be really
good because basically the increases the token limit. So then you can fit more
stuff up and it'll, it gives you a better context for when you're running this,
because if you essentially that it handles, if you're, if you're, if your code
base or the file you're running it on is too big, it will split that code base
up into chunks, essentially to try and to fit everything in the context so that
it can still get responses. So it tries to be a little bit smart about what it
sends to the end point to like reduce the number of tokens that you use. And
there's a bunch of configuration around that. So it might not be able to handle
someone's monolithic thousand line Python script. Yeah. So yeah, it will do it,
but it will break it up as yeah. You might have to refactor it to get, to have
more functions, but yeah, essentially it can send one function at a time because
the doc strings belong on the functions. But yeah, if it's, if it's a massive
month, if it's a massive, massive script that has a bunch of functions in it,
then yeah, it can break them up and send them all one at a time. But yeah, if
it's just a bunch of code, then I mean, that's other, there's other issues at
that point, I guess. No, no, definitely. That's a cheap people and save you now.
Yeah, exactly. That's good. That's a good segue to the next question that we
actually had ready to go. How do you keep up with that rapid progress? I don't
even know what the 16 K is. And I know what, I kind of know what GPT four is,
but like there's all these innovations. How do you keep going with all that?
Yeah. So the GPT channel on the Slack is helpful mainly. So yeah, what I've been
finding is Twitter and GitHub following the right people on GitHub to see what
projects are starring and then a lot of YouTube. So I've seen a bunch of people
start YouTube channels all based around large language models and particularly
around the sort of open source large language models. But yeah. So I think it's,
it's like an incredibly fast moving place and you feel like left behind like
constantly. So it's sort of like trying to figure out like, Oh, is this, is it
worth like investing time in like learning this framework if it's just going to
be gone or like playing with this application because a bunch of stuff is like
appearing and then dying off or getting replaced. It's all still very early
days. So yeah. Yeah. Have you used or do you listen to the This Day in AI
podcast, which is from some Australians as well, two Australian brothers? Yeah,
I've seen it. I haven't actually listened to it, but yeah, I think, have you
listened to it? I imagine. Yeah. I listen to it every week. That's how I keep up
because I just can't spend all the time, you know, reading Slack and Twitter and
whatnot. There's just too much. Yeah. Yeah. I think a lot of also like using
chat GBT to sort of process documentation and summarize things has been really
useful as well. So like there's like a lot of complicated documentation or
terminology or like new things. So like sort of using it to help understand like
things in the field has been really useful. Yeah. I mean, one of the things I
wanted to ask is how, so now that, that you've been using it and using it in a,
in a real world setting, like this is a real difficult task of writing
documentation and writing tests and so on. I'm curious, what's your feeling on
at the moment, what are the best use cases and what are some of the limitations
of these technologies? Yeah, it's a really good question. I think there's a, I
feel like there's a lot of hype around things. Like there are a lot of people
who will say like, you know, you can 10 X everything you do. And like, this will
change your life. And, and I haven't really seen that happen yet, but I think
there's obviously fantastic tools and they're revolutionizing a lot of things.
And it really sort of depends at like where we are on the curve of like the the
quote unquote intelligence of like these models. Like if it's just ramping up,
you know, in a couple of years, none of this will really matter. But in the, if
it's, if it's like around some sort of plateau or it's like, okay, this is going
to take, it's not increasing that much. Then I think, yeah, it's a lot of these
tools. Yeah. So I think, yeah. Does that make sense? No, definitely. But in
terms of what your use cases, so what are some of the use cases we've, we've
discussed writing documentation. What are some others?  are available, I mean,
you've got the convert function as well in this, what does that do? Yeah. So
that's a tool that Robert Petit, like create a PR to add to Writer. And so
basically it's a sub command in Writer that changes, converts any arbitrary file
into another file. So just based, so the way it is, the way it works is you say,
write the convert and then you give an input file. So it might be, you know,
script.py. And then you have an output file and it says script.ts. And then it
will convert that file from Python into TypeScript. And so basically it has a
prompt, like a well-formatted prompt similar to the docs. And then it extracts
the file types using the extensions of the files to sort of infer the file type.
You can use a, you can use a flag to specify the format if you don't want to use
the extension. But yeah, basically you can really easily convert between
different file types. And so, you know, you could use it to convert between
English and French or whatever. But yeah, so I've mainly been using that to
convert between different programming languages. And we ended up using it to
write a package called phylo.js, which is a TypeScript phylogenetic tree
manipulation library. So there aren't really many or any good libraries for
doing tree manipulation, like phylogenetic tree manipulation in TypeScript.
That's like a standalone NPM installable package. Everything seems to just be
tied to tree visualization. So you can't get tree manipulation without
visualization. So it's a very large dependency. So we talked to Tim Vaughn, who
wrote ICtree, which is like a tree visualization library. And he's, he has open
sourced MIT licensed that library and said that, yeah, we could take the, the
tree manipulation parts of that. And then I used write the convert to convert
that JavaScript into sort of modern TypeScript. And then we published that
package and, you know, it took us a day or something like that to produce like a
super nice battle tested. Well, because we use this other code, but a modern
sort of library for tree manipulation. Yeah. Using this command, which is just
essentially an input file, output file. Yeah. I'm really curious if it can
convert some old timers, stodgy Perl code into sleek Python, right? I'm kidding.
Solid. Okay. Can I ask a bit of a, a side question? We may, we may come late,
but why are you working in academia and not at an industry earning like a
million dollars a year? And would you be interested in leaving academia to earn
lots of money? I, yeah, I think if, I mean, to be honest, if I wasn't getting
paid to do this, I would still do it, I think. So it's not like a, like, it's
not like a money thing, really. I really love working in academia. I like
hanging out with interesting people. I really like hanging out at universities.
I like talking jargon with people. It's one of my favorite things to have like
coded messages and like very technical conversations with people. And I think
that's kind of hard to find outside of academia and universities are super nice
places to just to be around. So I've, yeah, I really enjoy what I'm doing and
yeah, I, yeah, I don't really have any plans of moving to private sector yet.
We'll see how funding and stuff goes. Until you need to buy a house. Yeah,
that's true. Yeah. But the mortgage dictates a lot of things. Yeah. Was that
your pitch, Andrew, to recruit him? No, it's just a million. Sometimes the right
person is doing the right thing at the right time. And like, I know I look back
at some, you know, some opportunities I passed up and, you know, I can see what
you're doing. Maybe you are the right person at the right time to do this kind
of AI stuff. You're doing like very deep stuff that, you know, whereas other
people are doing around waving their hands, wishing they could do it. And maybe
you could actually earn a boatload of money doing stuff in industry. Now, I'm
not asking you to employ everything. I'm just saying, you know, it's a very
useful skill that you have that and you're going to be underpaid like in
academia. Yeah. I mean, yeah, like I said, I mean, I would do it for free. Like
I did my PhD, you know, we got whatever, $10,000 a year or something like that.
And managed to survive and had a fairly good time, to be honest. Like it wasn't
a really horrible experience for me. But yeah, something I really just like
hanging out in academia. And that's all we have time for folks this week. Thank
you very much to Waitemaa for telling us all about Raitha. And we'll be back
shortly for the next episode where we continue this conversation. Thank you so
much for listening to us at home. If you like this podcast, please subscribe and
rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow
us on Twitter at Microbinfee. And if you don't like this podcast, please don't
do anything. This podcast was recorded by the Microbial Bioinformatics Group.
The opinions expressed here are our own and do not necessarily reflect the views
of CDC or the Quadram Institute.