Hello, and thank you for listening to the MicroBitKey podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody really writes it down. There's no
manual, and it's assumed you'll pick it up. We hope to fill in a few of these
gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the
head of informatics at the Quadram Institute in Norwich, UK. And Andrew is the
director of technical innovation for Liagen in Cambridge, UK. I am Dr. Lee Katz,
and I'm a senior bioinformatician at Centers for Disease Control and Prevention
in Atlanta in the United States. We're back again this week with Waitem O'Worth
from the University of Melbourne and continue our conversation on his new
software, RightThe. So we'll just get right into it now. So what else is
planned, if you care to share? What else do you think is possible use cases for
this or similar? Yeah, so yeah, I think, so there's a bunch of really useful
tools and libraries that are developing around like LLM package development and
tool development and stuff like that. And one of the things is like using vector
stores and vector databases to help kind of expand the memory limitations of
these LLMs. So basically, like, like I was saying before, if you can't fit the
massive monolithic Python script or whatever into the context, you sort of have
to truncate it, which means that the LLM really doesn't know about the previous
stuff. But what you can do is use these vector databases or vector stores to
essentially record, like add specific parts of the code base that are useful to
context when they're required. So you might be able to do this like
summarization across a project to be able to say, do things like refactor a
project or something like that. So you can sort of say, okay, this is this
project. These are all of the function definitions or something like that from
the project, or this is kind of a summarization of what's happening in this
area. And when it needs to find something that's a similar sort of relation, it
can like use that vector data store to look up that related content and add that
into context. So that's sort of something that I'm looking at with right there
at the moment is like being able to do like code base wide operations. So things
like refactoring and optimizations across the code base. And you can teach it to
create a UML. That would be great because I hated doing those. What is a UML?
Oh, a UML. Do they still do it anymore? It's called the Unified Modeling
Language. It's these, what would you call them Andrew? It's like a flowchart of
all the different. If you want to design very, very complex technical systems,
you have to kind of draw them out. And it's just a way of drawing them and
saying this is where information goes. That's what it's like. Yeah. So all the
attributes or variables and the methods in some, and you map it out. These are
in this class. These are in that class. This interface relates to this and all
of that stuff. And they're a pain in the butt to make. Right. Yeah. I'm sure
that that's something that would be possible to do. It kind of reminds me
actually. So one of the sort of main library that I'm using to sort of interface
with the language model API and stuff like that is Langchain, which is this
application sort of layer library for building prompts and creating agents that
sort of do specific things using LLMs. And there's another library built on top
of that called, I think it's called Langflow, which is a library that
essentially lets you use like flow diagrams to create these large language model
applications. So you can link different components together. So you might have
an agent that does a specific thing. It has access to particular tools, and you
can sort of like build out this, like a flow diagram graph network that sort of
shows all the relationships between the different packages and tools that the
large language model can use. I've started to see that a bit with the chat GPT-4
and the plugins where you are a simple use case I've seen is you ask it to go
to, I don't know, pick up some information from the internet or some specialized
piece of knowledge for an agent and formats it back into the text and the
regular model is just doing some sort of summary for you. So yeah, common thing
is like asking it, what are the best restaurants around for me? And it goes to a
website and pulls down all of the restaurants that are around you, the booking
links to that and the opening hours, and then it formats it for you and
summarize it for you. Yeah. Yeah. So yeah, Langchain and these other sort of
libraries like Haystack, they provide like a, like an interface that essentially
lets you to, lets you define new tools and things like that, that the language
model can interact with. So like I made one the other day that could write and
execute Perl scripts. And so I gave that to an agent and that agent was then
able to, you know, create some code and execute that code, which is not very
safe. You shouldn't let it execute arbitrary code. But yeah, you can sort of
define any sort of tool that you want using these interfaces and libraries and
they provide a nice structured way of doing that. Are you trying to implement
Torsted as a, as an AI? Yeah, exactly, yeah. It's, yeah, it's only a matter of
time. But yeah, I think like for small tasks and stuff like that, there could be
some sort of potential like that. I think part of the problem with like a lot of
these autonomous agents, like AutoGBT and BabyAGI and stuff like that is they're
just way too broad. They like have, the goal is to be a artificial general
intelligence, which is like a super intelligence. And it's just the technology
isn't there yet. But if you can constrain them to like doing specific tasks, and
that's kind of like what I was saying about right there, it's like very
constrained. But even if you're using one of these autonomous agents, if you can
constrain it to like, you know, it's doing this very specific bioinformatics
task, maybe you can get better results out of it that way where it's, you know,
it doesn't get lost along the way as to what it's actually doing. And so I've
had some luck doing some stuff like that. But yeah, I think it's still a little
bit off before like Torsten gets replaced. I wouldn't say replaced, maybe
augmented. Augmented, yeah, yeah, digitalized. So going back to your earlier
question, Abil, I don't know, like does, for a UML, does it output mermaid style
stuff? I've been like really getting into mermaid myself. And it's like, it's
like you could, you could diagram out in ASCII what your flow chart is going to
be. And maybe, and then you can, you can visualize it in a mermaid.live site. Is
that, is that something you're looking at? I haven't looked at it, but it, yeah,
I imagine it's something that could be done. Like, so yeah, I think that's,
yeah, part of like, you, you basically need some description of the project that
you can then sort of start doing these refactoring operations and stuff like
that on. So then maybe that's a good way to just sort of represent the project,
but. I mean, usually, I mean, traditionally you make, you go to the end of the
project and you make, you go to the effort of making the UML diagram. So you, as
the programmer can figure out, you know, where, based off the relationships, you
realize like, oh, there's this function call that keeps happening over here. And
that doesn't make sense that it's in this class. It should be somewhere else,
you know, that sort of organization, or you realize that you've basically got
the same function over and over again in different parts of the code. So you're
going to actually, that should be broken down into a separate generic class that
you're using. If the thing is just interpreting it and doing it for you, you
kind of don't need that output. It might be useful just to generate that though,
right? Like you could have some write the UML sort of command. Yeah, I mean,
traditionally, I think even from the early 2000s or whatever, there's always
software that you would sort of auto-doc, similar to auto-doc or any sort of
document generation where it would take the code and sort of fetch this out for
you. But it was still, you still had to munch it a bit, you know, so it'd be
nice if one of these things could just smooth that over for you and it was more
polished by the time you got it. I mean, maybe these days, I haven't done it
since undergrad, maybe these days are really, really good. Yeah, I think that's
where these like large language models can really shine is where it's sort of
like doing this data augmentation stuff. So I use it a lot for like, you know,
if I have to have some like JSON object or something like that, and I want to
convert it to a different type. And so you can just sort of paste that into
chadgbt or something like that and have it convert stuff really easily versus
like having to write some weird orc command or something like that to like, just
for like little things that I can think, I think it can help a lot. So if you
can like somewhere along that process, you can sort of insert an LLM that has
like a very well-defined task. It's like, you have to, you know, polish up this
thing and that's all you're allowed to do. And it's the inputs and outputs are
very tightly constrained. Then I think like, yeah, you can get a lot of benefit
from integrating like those into a system like that. So I derailed us just a
little bit, but you also had a question, Nabil, like what autonomous agents
you're using. And it makes me think like, could you make a  a subcommand for
like, write the diagram or, or even like our favorite journal right now, one of
our favorite journals, like write the Joss. And, and by the way, I don't want
to, I don't want to overshadow the amazing GPT paper that, that Andrew and co-
authors wrote earlier, but, but just to focus us on this, like, could you make
like a, a write the Joss, like mark down and like make your whole publication?
I, I, I think you can, like, as long as, like, to some extent, right? Like, you
know, you can get like most of the way there or something like that. Especially
if you're doing some sort of software announcement thing and you have, you know,
reasonable documentation from your code base, or like you've got documented
code, maybe use right there to generate the documentation. You can provide some
examples and say, this is kind of what we're doing. And I'm sure chat GPT or a
language model can generate some sort of like introduction and background that
seems reasonable. Yeah. It's just a matter of sort of like, I think the code-
based wide operations is sort of like holding it back a little bit to, so that
it has, can understand the, everything across the code base. But yeah, like
that's really a matter of context or how you summarize things or represent So
yeah, I think, I think, I think one thing in our discipline, we're pretty good
with writing the code. We're happy writing the code. We don't need the robot to
do that, but when it comes to actually writing the paper, we're rubbish. So if
even if the input was, here's the code base, here's the doc strings that we've
checked and they're fine. And then here's like 10 dot points about what, you
know, the sort of like header of what this actually does, dot points of that,
take this all and sort of write this thing, intro, you know, we'll give you the
little, like, you know, a couple of lines of what the motivation is, but try to
make us intelligent when we, you know, when we say like, oh yeah, we wrote this
read mapper, but we didn't realize someone had already done it. Can you write
that in a way that doesn't make us sound like an idiot? That would be fantastic.
Yeah. I think that, yeah, there's definitely an opportunity for that. I think
probably, yeah, like chat GPT itself, or one of these chat applications,
probably really good cause you can go back and forth with that and like sort of
say like, oh, actually, you know, change this wording around that or like, what
have this changed? But I think, yeah, maybe like the scaffolding or whatever,
you know, like you can scaffold out a lot of the, like a lot of these journals
use templates and stuff anyway already. So you sort of have a starting place
there and I'm sure you can do a lot with that. Yeah. I mean, software
announcements have a certain flow to it and you know, or again, it's just
thinking like, yeah, the motivation was, I didn't, I didn't realize this was
already embed tools, but yeah, it's taking, we're quite good at bashing out the
text, but yeah, making it formal is a difficulty for a lot of people. I'm also
curious about with, and speaking of that, in terms of languages, translating
into languages, cause you sort of mentioned that with the convert function, but
have you done much of that yourself? Because for people who are not native, like
they could speak, the English is good, but they're not native English speakers.
They always say that there's, it's a lot more effort to write this formal
flowery academic prose. And they're like, look, I could just bash this out in
French in like 10 minutes, but I have to write in English. So this is going to
take me half a day. So have you tried any, any of that with converting between
human languages rather than programming languages? No, I haven't really tried
anything like that. I mean, I definitely, like I see the utility there, like
extremely useful and maybe I guess probably worth having a, I've seen a lot of
these projects, you know, the hub documentation written every single language
imaginable because, you know, they use one of these language models to write the
docs in a different language, you know, so you can sort of support other people
who are non-native speakers where they have access to their documentation in a
language that they want. So you, yeah, I think that would be a really good use
case. And yeah, I've, I think it, yeah, definitely extremely powerful when
something that people can use to sort of ease the, that sort of communication
barriers where they happen. I don't know, like, obviously it has to work really.
Like if it's like changing what you're meaning or something like that, then it
doesn't, that, that's not good. But if, yeah, if it's faithfully translating the
stuff, then I think it's, yeah, definitely a really awesome use case. I was just
wondering, your argument supports, your help outputs are amazing. How on earth
do you do that? What library do you use? Oh yeah. So that's the hyper, which is
the Sebastian, I forget what his last name is, but he wrote fast API. So it's
these two sort of big Python libraries fast API, sort of like the, I think it's
become the gold standard for writing like restful APIs in Python and Typo is a
CLI package for writing command line interfaces that's written on top of click.
And so, yeah, it sort of lets you do a lot of this fancy stuff using like
function decorators and does all the rich text output pilot color highlighting
and different stuff like that. That's pretty good. Sorry. I was a bit quiet
there. I've been running right there on my code and it's really awesome. Like
Jesus Christ. Yeah. I'm blown away. Well done. Oh, great. Yeah. It definitely
saved me some time. I haven't done tests yet. So hopefully it still compiles and
runs and all that, but I trust you. Did you run at the make docs? So this is
like a, I run write the docs dash, dash. If you run write the make docs on the
same command, this is sort of a, this is a utility command that it doesn't
actually use an LLM. It's just sort of some scripts and stuff that I put
together, but basically it will generate the template for make docs, which is
like a markdown material theme, markdown website that will then use all the
document, all the doc strings that you've written as to auto-generate our API
reference. So you get this like rich text searchable has dark mode docs. It's
just one command at all. It'll create a GitHub actions script so that you can
just click one thing and it will deploy it to GitHub pages. So it really
streamlines the process of generating documentation. Like if you go to the, go
up a little bit to the, keep going to the top. Yeah, sorry. If you go up to the
right, the upper top. Yeah. Where it just says the little blue, right. The docs
thing or yeah. Right. The dot by Tamra.com on the right. Oh, sorry. Yeah. On the
right-hand side where the GitHub, yeah. Where the URL is. Yeah. Click on that
one. So like this is documentation that's sort of auto-generated. So if you
click on the, you can see on the hamburger, like up on the top left, there's
like a reference section. So if you click on that, you can see this is all like
auto-generated from the documentation. And so it formats everything. It says,
you know, what the input parameters are, what the return types are, if there's
any side effects, you know, this is all the stuff that chat GPT generated, but
then it uses this library to essentially pass that and turn it into HTML. So you
have a website and it's fully indexed and searchable as well. And it's all
running on GitHub pages. So it's free for like open source repositories. We are
rapidly running out of excuses for why we don't have documentation. Yeah. No,
it's really, really good. So this is make docs or similar to kind of Java docs
or the sort of web reference manual or human readable thing to that goes through
all of the functions and describes them. So that's what we're looking at. And so
it has the function, the, the inputs, the description, a table of the
parameters, a table of the return and all the types and all the, all the, all
the information about it, which is what you commonly see in most of this. And
yeah, it is very much boilerplate stuff that is intuited from the code itself.
So yeah. And, and, and when you're using the stuff for programming, people can
see it on the, on the GitHub repository. It is useful to know these sort of
caveats. I mean, I can see this for APIs, actual web APIs to be really, really
powerful. Yeah. Yeah. Yeah. It, it like, yeah, like you said, it's, yeah, it's
very sort of stock standard stuff, but it just is so like, there's so little
activation energy to doing it. I think like you might as well have a website
that's searchable that people can access, you know, for your project. Like it is
literally just a command. So like, why not? You know, it's like really, and
that, that means that, you know, people can go to it, they can find it. It, all
the API references, like, people can go to it, they can find it. It, all the API
references, like, yeah, the it's, yeah, all searchable. And then you can add
other things to that on top of it. Like it currently doesn't generate tutorials
or something like that, but it, it will give you just the API reference and add
a readme file to the front page. But yeah, super, super simple to do. So yeah,
it's a little bonus feature, I guess. Did I hear you say that this exact
function does not use an LLM though? Yep. How is this possible? So it just has
a, so there's a library that essentially  goes through the repo and extracts the
doc strings. So the, the doc strings that were generated with the LLM, it then
goes through and extracts those. And because they're in a consistent sort of
format, they're all in Google docs during format or whatever. It's kind of like
a weird YAML format. They, you can then use that to essentially just populate
this template that the, um, the framework uses. And so you're dabbling with
Torsten's Perl, if I can just outright say that. So are you, are you making Perl
documentation, like, like that, that really clunky Perl doc, and then are you
able to, to look at that and even parse it with that? I haven't tried that, but
it might be possible. So currently I think the parser is limited to Python
files, I think. So yeah, like the documentation generation is limited to Python
files, but I, there's a, the, so GitHub uses, I don't know if you've seen this
feature that's on GitHub. Now, if you go to like a, a project and you open some
code on there, it can identify all of the, all of like the tokens. Like if you
click on a function or something like that, a function name, it'll highlight it
all through GitHub. And so that's, they've open sourced this library that
they're doing that with, which essentially enables you to pass the sort of
concrete syntax tree of the file to be able to say like, what, what, what
elements within this program relate to each other? And this, that, that, that
library is open source now. So the, my plan is to integrate that into, so yeah,
the symbols thing on the right-hand side. So yeah, you can see like what, what
symbols are in the file. And if you click on something, it'll select them and
show you like where they are throughout the file and through different files in
the project. And so basically, yeah, my plan is to integrate that into right
there. So then it will work better or work with other languages because it's
just sort of this language agnostic library to sort of let you pass everything
so that you can sort of extract the particular parts. Cause you need to identify
where the doc strings are and the functions and what, where the functions need
to go and that sort of thing. Amazing. Well, you might've lost the two people
who program in Pearl who listened to this podcast. I'm sorry. Yeah. Well, you
can run, write the convert and change to Python. It's all good. There we go.
Run, write the convert, then run, write the docs and run, write the make docs
and then write the tests and your projects all good. We can, we can tie this up.
If, if you, we can tie this up here and I'm not sure. And there's a lot of more
generic questions you can split into a second episode, I guess, if you have the
appetite for that, so, but yeah, let's, let's close here that the, I think it's
been a while. Yes. Great. So I, Tama, what, what time is it over there right
now? It's pretty late. 1230, 1230 at night. Yeah. Thank you so much for joining
us. No worries. I know it's so hard to organize that stuff over different time
zones. Australia always seems to lose somehow. I think we're the odd ones out.
I'm sorry. No, that's all good. Between Atlanta and Melbourne, like we're on
opposite sides and I just really appreciate you joining us. Um, it's been a
pleasure talking about, uh, GPT and, and the right, the package. Thanks for
joining us. And I hope to see you next time, or maybe we'll, maybe we'll do
something, uh, to, to get you sooner than later. Thank you. Thank you so much
for listening to us at home. If you like this podcast, please subscribe and rate
us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on
Twitter at micro Binfy. And if you don't like this podcast, please don't do
anything. This podcast was recorded by the microbial bioinformatics group. The
opinions expressed here are our own and do not necessarily reflect the views of
CDC or the quadrant Institute.