Hello, and thank you for listening to the MicroBitKey podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the head of informatics at the Quadram Institute in Norwich, UK. And Andrew is the director of technical innovation for Liagen in Cambridge, UK. I am Dr. Lee Katz, and I'm a senior bioinformatician at Centers for Disease Control and Prevention in Atlanta in the United States. We're back again this week with Waitem O'Worth from the University of Melbourne and continue our conversation on his new software, RightThe. So we'll just get right into it now. So what else is planned, if you care to share? What else do you think is possible use cases for this or similar? Yeah, so yeah, I think, so there's a bunch of really useful tools and libraries that are developing around like LLM package development and tool development and stuff like that. And one of the things is like using vector stores and vector databases to help kind of expand the memory limitations of these LLMs. So basically, like, like I was saying before, if you can't fit the massive monolithic Python script or whatever into the context, you sort of have to truncate it, which means that the LLM really doesn't know about the previous stuff. But what you can do is use these vector databases or vector stores to essentially record, like add specific parts of the code base that are useful to context when they're required. So you might be able to do this like summarization across a project to be able to say, do things like refactor a project or something like that. So you can sort of say, okay, this is this project. These are all of the function definitions or something like that from the project, or this is kind of a summarization of what's happening in this area. And when it needs to find something that's a similar sort of relation, it can like use that vector data store to look up that related content and add that into context. So that's sort of something that I'm looking at with right there at the moment is like being able to do like code base wide operations. So things like refactoring and optimizations across the code base. And you can teach it to create a UML. That would be great because I hated doing those. What is a UML? Oh, a UML. Do they still do it anymore? It's called the Unified Modeling Language. It's these, what would you call them Andrew? It's like a flowchart of all the different. If you want to design very, very complex technical systems, you have to kind of draw them out. And it's just a way of drawing them and saying this is where information goes. That's what it's like. Yeah. So all the attributes or variables and the methods in some, and you map it out. These are in this class. These are in that class. This interface relates to this and all of that stuff. And they're a pain in the butt to make. Right. Yeah. I'm sure that that's something that would be possible to do. It kind of reminds me actually. So one of the sort of main library that I'm using to sort of interface with the language model API and stuff like that is Langchain, which is this application sort of layer library for building prompts and creating agents that sort of do specific things using LLMs. And there's another library built on top of that called, I think it's called Langflow, which is a library that essentially lets you use like flow diagrams to create these large language model applications. So you can link different components together. So you might have an agent that does a specific thing. It has access to particular tools, and you can sort of like build out this, like a flow diagram graph network that sort of shows all the relationships between the different packages and tools that the large language model can use. I've started to see that a bit with the chat GPT-4 and the plugins where you are a simple use case I've seen is you ask it to go to, I don't know, pick up some information from the internet or some specialized piece of knowledge for an agent and formats it back into the text and the regular model is just doing some sort of summary for you. So yeah, common thing is like asking it, what are the best restaurants around for me? And it goes to a website and pulls down all of the restaurants that are around you, the booking links to that and the opening hours, and then it formats it for you and summarize it for you. Yeah. Yeah. So yeah, Langchain and these other sort of libraries like Haystack, they provide like a, like an interface that essentially lets you to, lets you define new tools and things like that, that the language model can interact with. So like I made one the other day that could write and execute Perl scripts. And so I gave that to an agent and that agent was then able to, you know, create some code and execute that code, which is not very safe. You shouldn't let it execute arbitrary code. But yeah, you can sort of define any sort of tool that you want using these interfaces and libraries and they provide a nice structured way of doing that. Are you trying to implement Torsted as a, as an AI? Yeah, exactly, yeah. It's, yeah, it's only a matter of time. But yeah, I think like for small tasks and stuff like that, there could be some sort of potential like that. I think part of the problem with like a lot of these autonomous agents, like AutoGBT and BabyAGI and stuff like that is they're just way too broad. They like have, the goal is to be a artificial general intelligence, which is like a super intelligence. And it's just the technology isn't there yet. But if you can constrain them to like doing specific tasks, and that's kind of like what I was saying about right there, it's like very constrained. But even if you're using one of these autonomous agents, if you can constrain it to like, you know, it's doing this very specific bioinformatics task, maybe you can get better results out of it that way where it's, you know, it doesn't get lost along the way as to what it's actually doing. And so I've had some luck doing some stuff like that. But yeah, I think it's still a little bit off before like Torsten gets replaced. I wouldn't say replaced, maybe augmented. Augmented, yeah, yeah, digitalized. So going back to your earlier question, Abil, I don't know, like does, for a UML, does it output mermaid style stuff? I've been like really getting into mermaid myself. And it's like, it's like you could, you could diagram out in ASCII what your flow chart is going to be. And maybe, and then you can, you can visualize it in a mermaid.live site. Is that, is that something you're looking at? I haven't looked at it, but it, yeah, I imagine it's something that could be done. Like, so yeah, I think that's, yeah, part of like, you, you basically need some description of the project that you can then sort of start doing these refactoring operations and stuff like that on. So then maybe that's a good way to just sort of represent the project, but. I mean, usually, I mean, traditionally you make, you go to the end of the project and you make, you go to the effort of making the UML diagram. So you, as the programmer can figure out, you know, where, based off the relationships, you realize like, oh, there's this function call that keeps happening over here. And that doesn't make sense that it's in this class. It should be somewhere else, you know, that sort of organization, or you realize that you've basically got the same function over and over again in different parts of the code. So you're going to actually, that should be broken down into a separate generic class that you're using. If the thing is just interpreting it and doing it for you, you kind of don't need that output. It might be useful just to generate that though, right? Like you could have some write the UML sort of command. Yeah, I mean, traditionally, I think even from the early 2000s or whatever, there's always software that you would sort of auto-doc, similar to auto-doc or any sort of document generation where it would take the code and sort of fetch this out for you. But it was still, you still had to munch it a bit, you know, so it'd be nice if one of these things could just smooth that over for you and it was more polished by the time you got it. I mean, maybe these days, I haven't done it since undergrad, maybe these days are really, really good. Yeah, I think that's where these like large language models can really shine is where it's sort of like doing this data augmentation stuff. So I use it a lot for like, you know, if I have to have some like JSON object or something like that, and I want to convert it to a different type. And so you can just sort of paste that into chadgbt or something like that and have it convert stuff really easily versus like having to write some weird orc command or something like that to like, just for like little things that I can think, I think it can help a lot. So if you can like somewhere along that process, you can sort of insert an LLM that has like a very well-defined task. It's like, you have to, you know, polish up this thing and that's all you're allowed to do. And it's the inputs and outputs are very tightly constrained. Then I think like, yeah, you can get a lot of benefit from integrating like those into a system like that. So I derailed us just a little bit, but you also had a question, Nabil, like what autonomous agents you're using. And it makes me think like, could you make a a subcommand for like, write the diagram or, or even like our favorite journal right now, one of our favorite journals, like write the Joss. And, and by the way, I don't want to, I don't want to overshadow the amazing GPT paper that, that Andrew and co- authors wrote earlier, but, but just to focus us on this, like, could you make like a, a write the Joss, like mark down and like make your whole publication? I, I, I think you can, like, as long as, like, to some extent, right? Like, you know, you can get like most of the way there or something like that. Especially if you're doing some sort of software announcement thing and you have, you know, reasonable documentation from your code base, or like you've got documented code, maybe use right there to generate the documentation. You can provide some examples and say, this is kind of what we're doing. And I'm sure chat GPT or a language model can generate some sort of like introduction and background that seems reasonable. Yeah. It's just a matter of sort of like, I think the code- based wide operations is sort of like holding it back a little bit to, so that it has, can understand the, everything across the code base. But yeah, like that's really a matter of context or how you summarize things or represent So yeah, I think, I think, I think one thing in our discipline, we're pretty good with writing the code. We're happy writing the code. We don't need the robot to do that, but when it comes to actually writing the paper, we're rubbish. So if even if the input was, here's the code base, here's the doc strings that we've checked and they're fine. And then here's like 10 dot points about what, you know, the sort of like header of what this actually does, dot points of that, take this all and sort of write this thing, intro, you know, we'll give you the little, like, you know, a couple of lines of what the motivation is, but try to make us intelligent when we, you know, when we say like, oh yeah, we wrote this read mapper, but we didn't realize someone had already done it. Can you write that in a way that doesn't make us sound like an idiot? That would be fantastic. Yeah. I think that, yeah, there's definitely an opportunity for that. I think probably, yeah, like chat GPT itself, or one of these chat applications, probably really good cause you can go back and forth with that and like sort of say like, oh, actually, you know, change this wording around that or like, what have this changed? But I think, yeah, maybe like the scaffolding or whatever, you know, like you can scaffold out a lot of the, like a lot of these journals use templates and stuff anyway already. So you sort of have a starting place there and I'm sure you can do a lot with that. Yeah. I mean, software announcements have a certain flow to it and you know, or again, it's just thinking like, yeah, the motivation was, I didn't, I didn't realize this was already embed tools, but yeah, it's taking, we're quite good at bashing out the text, but yeah, making it formal is a difficulty for a lot of people. I'm also curious about with, and speaking of that, in terms of languages, translating into languages, cause you sort of mentioned that with the convert function, but have you done much of that yourself? Because for people who are not native, like they could speak, the English is good, but they're not native English speakers. They always say that there's, it's a lot more effort to write this formal flowery academic prose. And they're like, look, I could just bash this out in French in like 10 minutes, but I have to write in English. So this is going to take me half a day. So have you tried any, any of that with converting between human languages rather than programming languages? No, I haven't really tried anything like that. I mean, I definitely, like I see the utility there, like extremely useful and maybe I guess probably worth having a, I've seen a lot of these projects, you know, the hub documentation written every single language imaginable because, you know, they use one of these language models to write the docs in a different language, you know, so you can sort of support other people who are non-native speakers where they have access to their documentation in a language that they want. So you, yeah, I think that would be a really good use case. And yeah, I've, I think it, yeah, definitely extremely powerful when something that people can use to sort of ease the, that sort of communication barriers where they happen. I don't know, like, obviously it has to work really. Like if it's like changing what you're meaning or something like that, then it doesn't, that, that's not good. But if, yeah, if it's faithfully translating the stuff, then I think it's, yeah, definitely a really awesome use case. I was just wondering, your argument supports, your help outputs are amazing. How on earth do you do that? What library do you use? Oh yeah. So that's the hyper, which is the Sebastian, I forget what his last name is, but he wrote fast API. So it's these two sort of big Python libraries fast API, sort of like the, I think it's become the gold standard for writing like restful APIs in Python and Typo is a CLI package for writing command line interfaces that's written on top of click. And so, yeah, it sort of lets you do a lot of this fancy stuff using like function decorators and does all the rich text output pilot color highlighting and different stuff like that. That's pretty good. Sorry. I was a bit quiet there. I've been running right there on my code and it's really awesome. Like Jesus Christ. Yeah. I'm blown away. Well done. Oh, great. Yeah. It definitely saved me some time. I haven't done tests yet. So hopefully it still compiles and runs and all that, but I trust you. Did you run at the make docs? So this is like a, I run write the docs dash, dash. If you run write the make docs on the same command, this is sort of a, this is a utility command that it doesn't actually use an LLM. It's just sort of some scripts and stuff that I put together, but basically it will generate the template for make docs, which is like a markdown material theme, markdown website that will then use all the document, all the doc strings that you've written as to auto-generate our API reference. So you get this like rich text searchable has dark mode docs. It's just one command at all. It'll create a GitHub actions script so that you can just click one thing and it will deploy it to GitHub pages. So it really streamlines the process of generating documentation. Like if you go to the, go up a little bit to the, keep going to the top. Yeah, sorry. If you go up to the right, the upper top. Yeah. Where it just says the little blue, right. The docs thing or yeah. Right. The dot by Tamra.com on the right. Oh, sorry. Yeah. On the right-hand side where the GitHub, yeah. Where the URL is. Yeah. Click on that one. So like this is documentation that's sort of auto-generated. So if you click on the, you can see on the hamburger, like up on the top left, there's like a reference section. So if you click on that, you can see this is all like auto-generated from the documentation. And so it formats everything. It says, you know, what the input parameters are, what the return types are, if there's any side effects, you know, this is all the stuff that chat GPT generated, but then it uses this library to essentially pass that and turn it into HTML. So you have a website and it's fully indexed and searchable as well. And it's all running on GitHub pages. So it's free for like open source repositories. We are rapidly running out of excuses for why we don't have documentation. Yeah. No, it's really, really good. So this is make docs or similar to kind of Java docs or the sort of web reference manual or human readable thing to that goes through all of the functions and describes them. So that's what we're looking at. And so it has the function, the, the inputs, the description, a table of the parameters, a table of the return and all the types and all the, all the, all the information about it, which is what you commonly see in most of this. And yeah, it is very much boilerplate stuff that is intuited from the code itself. So yeah. And, and, and when you're using the stuff for programming, people can see it on the, on the GitHub repository. It is useful to know these sort of caveats. I mean, I can see this for APIs, actual web APIs to be really, really powerful. Yeah. Yeah. Yeah. It, it like, yeah, like you said, it's, yeah, it's very sort of stock standard stuff, but it just is so like, there's so little activation energy to doing it. I think like you might as well have a website that's searchable that people can access, you know, for your project. Like it is literally just a command. So like, why not? You know, it's like really, and that, that means that, you know, people can go to it, they can find it. It, all the API references, like, people can go to it, they can find it. It, all the API references, like, yeah, the it's, yeah, all searchable. And then you can add other things to that on top of it. Like it currently doesn't generate tutorials or something like that, but it, it will give you just the API reference and add a readme file to the front page. But yeah, super, super simple to do. So yeah, it's a little bonus feature, I guess. Did I hear you say that this exact function does not use an LLM though? Yep. How is this possible? So it just has a, so there's a library that essentially goes through the repo and extracts the doc strings. So the, the doc strings that were generated with the LLM, it then goes through and extracts those. And because they're in a consistent sort of format, they're all in Google docs during format or whatever. It's kind of like a weird YAML format. They, you can then use that to essentially just populate this template that the, um, the framework uses. And so you're dabbling with Torsten's Perl, if I can just outright say that. So are you, are you making Perl documentation, like, like that, that really clunky Perl doc, and then are you able to, to look at that and even parse it with that? I haven't tried that, but it might be possible. So currently I think the parser is limited to Python files, I think. So yeah, like the documentation generation is limited to Python files, but I, there's a, the, so GitHub uses, I don't know if you've seen this feature that's on GitHub. Now, if you go to like a, a project and you open some code on there, it can identify all of the, all of like the tokens. Like if you click on a function or something like that, a function name, it'll highlight it all through GitHub. And so that's, they've open sourced this library that they're doing that with, which essentially enables you to pass the sort of concrete syntax tree of the file to be able to say like, what, what, what elements within this program relate to each other? And this, that, that, that library is open source now. So the, my plan is to integrate that into, so yeah, the symbols thing on the right-hand side. So yeah, you can see like what, what symbols are in the file. And if you click on something, it'll select them and show you like where they are throughout the file and through different files in the project. And so basically, yeah, my plan is to integrate that into right there. So then it will work better or work with other languages because it's just sort of this language agnostic library to sort of let you pass everything so that you can sort of extract the particular parts. Cause you need to identify where the doc strings are and the functions and what, where the functions need to go and that sort of thing. Amazing. Well, you might've lost the two people who program in Pearl who listened to this podcast. I'm sorry. Yeah. Well, you can run, write the convert and change to Python. It's all good. There we go. Run, write the convert, then run, write the docs and run, write the make docs and then write the tests and your projects all good. We can, we can tie this up. If, if you, we can tie this up here and I'm not sure. And there's a lot of more generic questions you can split into a second episode, I guess, if you have the appetite for that, so, but yeah, let's, let's close here that the, I think it's been a while. Yes. Great. So I, Tama, what, what time is it over there right now? It's pretty late. 1230, 1230 at night. Yeah. Thank you so much for joining us. No worries. I know it's so hard to organize that stuff over different time zones. Australia always seems to lose somehow. I think we're the odd ones out. I'm sorry. No, that's all good. Between Atlanta and Melbourne, like we're on opposite sides and I just really appreciate you joining us. Um, it's been a pleasure talking about, uh, GPT and, and the right, the package. Thanks for joining us. And I hope to see you next time, or maybe we'll, maybe we'll do something, uh, to, to get you sooner than later. Thank you. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at micro Binfy. And if you don't like this podcast, please don't do anything. This podcast was recorded by the microbial bioinformatics group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the quadrant Institute.