Hello and thank you for listening to the MicroBibDi podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the head of informatics at the Quadram Institute in Norwich, UK. Andrew is the director of technical innovation for Theogen in Cambridge, UK. I am Dr. Lee Katz, and I am a senior bioinformatician at Centers for Disease Control and Prevention in Atlanta in the United States. Hi! Today, we have the privilege of introducing Waitema Wirth, the brilliant author behind the revolutionary WriteThe GPT package. With a unique blend of creativity and technical expertise, Waitema has crafted a powerful tool that empowers bioinformaticians and unlocks the full potential of language generation. Waitema focused on histopathology in a past life, but now works at the Doherty Institute on Microbial Pathogens. Our hosts today are Nabil and myself. If Andrew gets out of his meeting, he can join us soon. And join us as we delve into the mind of this extraordinary innovator. Just to get us started, I know that you had a totally different career in a past life, histopathology. Can you describe it quickly? And then how did you transition into this totally different field? Yeah, hi. Thanks so much for having me. I really appreciate it and yeah, happy to be here. Yeah, so I did my PhD in histopathology on a viral pathogen in freshwater turtles in North Queensland at James Cook University. Yeah, so I spent a lot of time in the lab and a lot of time in the field. And with the other third of my life, I guess I procrastinated and did a bunch of like web development stuff. And so through that, I sort of started getting some ideas around programming and it like really interested me. And so I sort of like, you can't really change your mind when you're a couple of years into a PhD. It's sort of hard to be like, oh, I actually want to do something completely different. So I sort of like stuck through it, finished up the PhD. And then a friend of mine introduced me to Sebastian Duchenne, who's a researcher at the Doherty Institute who works in phylodynamics. And he gave me a job to work there. And I started doing a lot of tool development for them. So I built a bunch of different packages and I did a bunch of different analyses in phylodynamics. And then about a year and a half after that, I joined the microbial diagnostic unit as a bioinformatician or like kind of a academic specialist. So yeah, now I work with Torsten and yeah, do a lot of COVID work, but yeah, general like pipeline development and analyses and yeah, different tool development. Yeah. So the way I understand it, the Doherty, basically like you're on the same floor with Torsten, but like there's kind of a split or something. Are you on his side of the floor or the other side of the floor? Yeah, I'm on Torsten's side of the floor. So yeah, we, I think there's nine levels or something like that. So we kind of, mainly the Doherty is like a bunch of immunologists. And so microbiologists sort of, yeah, we stay on our floor and we do a lot of service work out of that. But yeah, I'm in the Torsten side of things, which is more on the service side of things at the moment, but with plans to transition back to more researchy stuff pretty soon. That's awesome. So we wanted to talk to you a little bit about this package that like just came out of nowhere for us. Cause I think, I mean, basically LLM surprised us at the beginning of this year. Like that was a major surprise. And then on top of that, you surprised us with this total package called WriteThe. And that's W-R-I-T-E-T-H-E. And what does it do if you can summarize it? Yeah. So basically it's kind of a pun, I guess, on like read the docs. And so that was the initial application that I saw was like these LLMs. I was spending a lot of time basically copying code into like chat GPT and then getting it to like write the right documentation or tests or something like that for the code. And so then I think, yeah, I was at the pub and I was talking to someone about how, yeah, this is like a good solid use case where you can constrain these models. So I, yeah, I developed this tool, WriteThe, which basically is a command line interface that lets you essentially pass files to a large language model so that it can write documentation, write tests or like convert the files to different formats. So kind of like trying to standardize that process without having to go through like copying and pasting really to reduce the amount of copying and pasting you do. So you can kind of say like write the docs and give it a directory for a project. And then it will just write doc strings for all of the functions and classes in that project. And then it uses make docs generation tool to generate like a reference API and a documentation website that you can then host on GitHub. So it's sort of very quickly, you can use two commands to like bootstrap the documentation for a project. And then also use it to like scaffold out a lot of the testing, like writing tests for a project as well. Yeah. And so it's still a bit confusing, I'm sure to everybody, like how you get into this. So what's kind of the engine behind this? Is it chat GPT or what is it? Yeah. So yeah, it's chat GPT. So it uses the open AI API. So basically it has essentially like a, an F-string like template, which is like this, this prompt, like a pre-written prompt that says, basically I want you to write documentation for this code. Here's an example of how to do that. Like a short snippet of an example. Here's the response structure that I want you to return. And then it inputs, adds that code to that. So then it, it formats all everything in this consistent prompt. So that when you then send it to GPT, you can get the sort of, you get this expected result back. So you sort of, each time you can get this consistent result in a format. A lot of other applications that are kind of doing this use JSON, but I actually decided to use YAML because I just thought curly braces were hard for some reason for LLMs. But it, so yeah, basically it tries to get it to format it in a YAML response and then essentially extracts that response and passes it using like the YAML parser. And, and then uses that to add the documentation to the code. Wait, so if I look through this, I'm going to, in your code, I'm going to find a printf statement with an actual prompt in here. Yeah. So if you go to write the, yeah, upper directory, sorry. Yeah. It's in the commands. So each command has its own prompt associated with it. So if you go to docs, yeah. Or something. And then the prompt prompts. So we're, we're sharing screen and looking through this. I just, I have to see this. Yeah. So that's the, the prompt. And so basically what you do is this thing called prompt engineering, which is like what you do with these large language models to try and get them to produce the results that you want. So you give it sort of clear detailed instructions to start with. And then, so I use two things here to do that. I do a thing called few shot for learning or, and so basically it's where I give it an example where I have an example that's like an add function and it's an undocumented add function. And then I say, this is, I want you to return it in this sort of YAML format that describes, has a description of the function, description of the arguments, the return values, and an example of how to use it. And then you do this other thing, which is essentially like code completion. So you sort of, you condition the model to, so you make the next token it's going to reduce highly likely. So you sort of say, you write, this is, here's the code and here's the formatted output. And here's the formatted out, the formatted doc string for this code. And you insert that into the prompt. And then it would just essentially continue like producing the results. I love that. So yeah, it's all based around prompt engineering, really. It's like this sort of term that people are coming up with around this. So trying to get, yeah, make the models, produce the results that you want. And so this sort of stuff, because it's a consistent output, right? Like docs, doc strings have a very consistent format. So you can sort of, even though a lot of the content changes, it's still in a structured way. So you can predict what the output would be, which allows you to then pass and validate it and insert it back into the code. So yeah, a lot of like all the documentation generation, all the test generation and stuff, that's barely any part of the project. Really the project is handling the responses from the LLM. So validating the response, reading the, finding out if like extracting doc strings and inserting doc strings. So yeah, it's, it's really the LLM does this generation, but then it's sort of that tooling around it to make sure you can, this can happen consistently every time that you run it. So that you can sort of use it in like a, in an automated fashion versus like having to go to chat GBT, copy something in, give it the right context every time. This sort of just templates that for you. And in a very structured way. How do you verify that the doc string it's generated? is actually valid? So I basically just part, so it returns the doc string in YAML format, which is, and, and so basically I just extract that doc string and just check whether it is valid YAML really. So, so if it's valid YAML, then I'll insert it. Cause there is some variation, basically the, you know, some functions don't have return types, so it might not have the return type in there, but yeah, you could do more validation on that saying like, oh, this is the, this is how I expect the structure of the YAML function to be. Like each argument should have its input and it should have a, and the type that the input has something like that. So you could do that in a bit more detail, but basically I just sort of trust that like chat GPT or the open AI models are very good at producing these responses consistently with this, with this prompt, basically. Yeah. I mean, what's, what's, I mean, we're looking at the example, the template where it's sort of saying, add two numbers together. That's your, your part of your prompt that he's saying, well, you know, lay it out this way and here's a sample, add two numbers together. I'm curious, what's some of the more complicated methods that you've, or functions you've fed to it to generate doc strings? And is it able to read the data that you've fed to it? Or is it able to read some, some of the spaghetti, some of the spaghetti code someone might write? Yeah. If you, so if, if you go back to the main page, there's an example. So basically I use write the docs to write the docs for the write the docs command. So it's like gets very metal, but there's a link to it. If you scroll down on the first page there. And so it's quite a complicated function. That first one, right. So on the GitHub, there's some, and we'll put the link to the GitHub in the show notes. There are real world examples and we're talking about the first example, which is like write the docs to write the docs for the write the docs command. Yeah. Yeah. So that's sort of like, do you want to click into that Lee? Yeah. So it becomes distracting myself. No worries. Yeah. So it becomes quite meta. So for example, in the examples of that, right, the doc string, it has to call itself and return a formatted doc string. So it, and it, it, and chat GBT is able to recognize that and create the correct example doc string to sort of say, you know, given this example function, here's the sort of example result I will return. So yeah, it becomes quite meta because it's writing the doc strings about itself, which is, and providing examples about itself. But yeah, so like it, do you want to click into that link? Leave the first link on real world examples. Right here. Yep. Yeah. Sorry. Go ahead. So that doc string there with the arguments sort of describes, so yeah, describes all the sort of different values and what effects they have. It adds notes in to say like, if particular arguments are set, what the results of that will be, and then provides an example and the returns return type. And then, yeah. So in the example, it, yeah, it sort of says like if you call this function on a, on some Python code that has an ad function in it, it will return, provide an example result of that. And so, yeah, the whole sort of library is documented that way, like using itself. And then I've gone through and like in that list of examples, it's just a bunch of random repos that I've seen that don't have documentation. So I just run this one, write the docs command on it. And it seems, and then I go through and read it and it's like, Oh yeah, this will kind of make sense. Like, and so I've made a bunch of PRS to open source projects to add documentation to them, which has just taken me, you know, a minute or so to generate documentation across because it, it sends all the requests to the open AI in parallel. So you can document the source code essentially, you know, one function at the time it takes to run one to document one function. And so, yeah, it takes no time to do that. And maybe like a couple of cents. So the, the, the open AI API is does cost money to run that. So you do need an API key to use this tool, but yeah, it's, it's like, if you're just generating this documentation for like a project, you're really doing that once off or something like that. So really it's a few cents and it saves you a lot of time. Amazing. Yeah. If you go to the website, so you might have to go to it from the main page on the repo. Can I ask which model does he use by default? It just uses 3.5, but there's a flag to use GPT4. And have you tried it to 16 K version yet? No, I haven't. So I was looking at that. And as far as I can tell, it's just available for the chat API. So there's a couple of different API endpoints that open AI has. And so the chat one, I'm using the completion API for this project. And I don't think that model is available in the completion, but that would be really good because basically the increases the token limit. So then you can fit more stuff up and it'll, it gives you a better context for when you're running this, because if you essentially that it handles, if you're, if you're, if your code base or the file you're running it on is too big, it will split that code base up into chunks, essentially to try and to fit everything in the context so that it can still get responses. So it tries to be a little bit smart about what it sends to the end point to like reduce the number of tokens that you use. And there's a bunch of configuration around that. So it might not be able to handle someone's monolithic thousand line Python script. Yeah. So yeah, it will do it, but it will break it up as yeah. You might have to refactor it to get, to have more functions, but yeah, essentially it can send one function at a time because the doc strings belong on the functions. But yeah, if it's, if it's a massive month, if it's a massive, massive script that has a bunch of functions in it, then yeah, it can break them up and send them all one at a time. But yeah, if it's just a bunch of code, then I mean, that's other, there's other issues at that point, I guess. No, no, definitely. That's a cheap people and save you now. Yeah, exactly. That's good. That's a good segue to the next question that we actually had ready to go. How do you keep up with that rapid progress? I don't even know what the 16 K is. And I know what, I kind of know what GPT four is, but like there's all these innovations. How do you keep going with all that? Yeah. So the GPT channel on the Slack is helpful mainly. So yeah, what I've been finding is Twitter and GitHub following the right people on GitHub to see what projects are starring and then a lot of YouTube. So I've seen a bunch of people start YouTube channels all based around large language models and particularly around the sort of open source large language models. But yeah. So I think it's, it's like an incredibly fast moving place and you feel like left behind like constantly. So it's sort of like trying to figure out like, Oh, is this, is it worth like investing time in like learning this framework if it's just going to be gone or like playing with this application because a bunch of stuff is like appearing and then dying off or getting replaced. It's all still very early days. So yeah. Yeah. Have you used or do you listen to the This Day in AI podcast, which is from some Australians as well, two Australian brothers? Yeah, I've seen it. I haven't actually listened to it, but yeah, I think, have you listened to it? I imagine. Yeah. I listen to it every week. That's how I keep up because I just can't spend all the time, you know, reading Slack and Twitter and whatnot. There's just too much. Yeah. Yeah. I think a lot of also like using chat GBT to sort of process documentation and summarize things has been really useful as well. So like there's like a lot of complicated documentation or terminology or like new things. So like sort of using it to help understand like things in the field has been really useful. Yeah. I mean, one of the things I wanted to ask is how, so now that, that you've been using it and using it in a, in a real world setting, like this is a real difficult task of writing documentation and writing tests and so on. I'm curious, what's your feeling on at the moment, what are the best use cases and what are some of the limitations of these technologies? Yeah, it's a really good question. I think there's a, I feel like there's a lot of hype around things. Like there are a lot of people who will say like, you know, you can 10 X everything you do. And like, this will change your life. And, and I haven't really seen that happen yet, but I think there's obviously fantastic tools and they're revolutionizing a lot of things. And it really sort of depends at like where we are on the curve of like the the quote unquote intelligence of like these models. Like if it's just ramping up, you know, in a couple of years, none of this will really matter. But in the, if it's, if it's like around some sort of plateau or it's like, okay, this is going to take, it's not increasing that much. Then I think, yeah, it's a lot of these tools. Yeah. So I think, yeah. Does that make sense? No, definitely. But in terms of what your use cases, so what are some of the use cases we've, we've discussed writing documentation. What are some others? are available, I mean, you've got the convert function as well in this, what does that do? Yeah. So that's a tool that Robert Petit, like create a PR to add to Writer. And so basically it's a sub command in Writer that changes, converts any arbitrary file into another file. So just based, so the way it is, the way it works is you say, write the convert and then you give an input file. So it might be, you know, script.py. And then you have an output file and it says script.ts. And then it will convert that file from Python into TypeScript. And so basically it has a prompt, like a well-formatted prompt similar to the docs. And then it extracts the file types using the extensions of the files to sort of infer the file type. You can use a, you can use a flag to specify the format if you don't want to use the extension. But yeah, basically you can really easily convert between different file types. And so, you know, you could use it to convert between English and French or whatever. But yeah, so I've mainly been using that to convert between different programming languages. And we ended up using it to write a package called phylo.js, which is a TypeScript phylogenetic tree manipulation library. So there aren't really many or any good libraries for doing tree manipulation, like phylogenetic tree manipulation in TypeScript. That's like a standalone NPM installable package. Everything seems to just be tied to tree visualization. So you can't get tree manipulation without visualization. So it's a very large dependency. So we talked to Tim Vaughn, who wrote ICtree, which is like a tree visualization library. And he's, he has open sourced MIT licensed that library and said that, yeah, we could take the, the tree manipulation parts of that. And then I used write the convert to convert that JavaScript into sort of modern TypeScript. And then we published that package and, you know, it took us a day or something like that to produce like a super nice battle tested. Well, because we use this other code, but a modern sort of library for tree manipulation. Yeah. Using this command, which is just essentially an input file, output file. Yeah. I'm really curious if it can convert some old timers, stodgy Perl code into sleek Python, right? I'm kidding. Solid. Okay. Can I ask a bit of a, a side question? We may, we may come late, but why are you working in academia and not at an industry earning like a million dollars a year? And would you be interested in leaving academia to earn lots of money? I, yeah, I think if, I mean, to be honest, if I wasn't getting paid to do this, I would still do it, I think. So it's not like a, like, it's not like a money thing, really. I really love working in academia. I like hanging out with interesting people. I really like hanging out at universities. I like talking jargon with people. It's one of my favorite things to have like coded messages and like very technical conversations with people. And I think that's kind of hard to find outside of academia and universities are super nice places to just to be around. So I've, yeah, I really enjoy what I'm doing and yeah, I, yeah, I don't really have any plans of moving to private sector yet. We'll see how funding and stuff goes. Until you need to buy a house. Yeah, that's true. Yeah. But the mortgage dictates a lot of things. Yeah. Was that your pitch, Andrew, to recruit him? No, it's just a million. Sometimes the right person is doing the right thing at the right time. And like, I know I look back at some, you know, some opportunities I passed up and, you know, I can see what you're doing. Maybe you are the right person at the right time to do this kind of AI stuff. You're doing like very deep stuff that, you know, whereas other people are doing around waving their hands, wishing they could do it. And maybe you could actually earn a boatload of money doing stuff in industry. Now, I'm not asking you to employ everything. I'm just saying, you know, it's a very useful skill that you have that and you're going to be underpaid like in academia. Yeah. I mean, yeah, like I said, I mean, I would do it for free. Like I did my PhD, you know, we got whatever, $10,000 a year or something like that. And managed to survive and had a fairly good time, to be honest. Like it wasn't a really horrible experience for me. But yeah, something I really just like hanging out in academia. And that's all we have time for folks this week. Thank you very much to Waitemaa for telling us all about Raitha. And we'll be back shortly for the next episode where we continue this conversation. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at Microbinfee. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.