Hello, and thank you for listening to the MicroBitKey podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the Head of Informatics at the Quadram Institute in Norwich, UK, and Andrew is the Director of Technical Innovation for Theagen in Cambridge, UK. I am Dr. Lee Katz, and I am a Senior Bioinformatician at Centers for Disease Control and Prevention in Atlanta in the United States. You all might have heard about new artificial intelligence tools that have come out this year. We are especially amazed by the generative predictive text category of AI, also known as GPT. We're going to chat about ChatGPT, Bing, BARD, and the like. We use these tools for several general cases, such as generating boilerplate programs, making draft documents or emails, and summarizing documents. What have you guys been using it for? Well, I've been using it an awful lot in various different guises, but, you know, in my day-to-day, actually, I use it a huge amount to write code using VS Code and GitHub Copilot. It's just phenomenal, like, when you're writing a bioinformatics script, and it can kind of look at everything you've done, look at the context, and then suggest, like, the exact thing you want. Recently, when I've started writing, like, classes in bioinformatics programs, I'll just write a big text description of, you know, this is what I want to do, and here are the files that go in, and here's an example, maybe three or four lines of each input file. And then I say, then I just click tab, and then it goes and generates, like, the first stab at the code. And actually, it's often about 90% right. And you do have to go in and refine and add things and make some mistakes, whatever. But generally, it's all the kind of boilerplate stuff that goes really, really well. And so it's really made my life a lot easier when it comes to coding. And things like, I can never remember how to use pandas, say, in Python. And, you know, because you have to use things quite a lot to commit them to memory. And, you know, all I have to do now is just, you know, type a little comment about what I want to do. And then there you go. The correct syntax is there with the correct variable names. And it's just phenomenal. You have to try it, particularly Copilot, which I guess, I think it's a GPT-4 under the hood. It's just mind-blowing when you start using it, and it saves you so much time. And I think it's probably speeded up my coding by about 20 or 30%, it's just straight off, no bother. And it's also got other neat little features, like you can just say, explain this line of code, and then it won't just explain it, but it'll explain it in the context of all the other code you've written. And so it knows what you've been trying to do, and then it gives you like a relevant explanation. It's not just a generic one, which is really cool. And I've just literally, about five minutes ago, had a paper accepted, an editorial accepted for my MGen. So, what is it? Microbial Genomics on the ethics of using AI in microbial genomics research, which is coming out soon. Hopefully, by the time this podcast comes out, that article will be out. And so what we did actually was we wrote the article at a recent hackathon. We wrote the article with ChatGPT. And so this is from its own mind. And then he put a little human- written, artisanal paragraph of an explanation at the front to just introduce the topic and give our little opinion on it. But, you know, it's mostly AI generated, which is really scary because it's actually really good. With a companion podcast on my other research pages podcast as well, which that in itself was all AI created. ChatGPT created the entire episode on the ethics of using AI in academic research. And then I fed that into 11Labs. I cloned my voice and my partner's voice. And then I trained it on our different podcast episodes and then put that into 11Labs to generate the actual speech, text-to- speech. And then wrapped all of that up as a podcast episode. And then I got ChatGPT to suggest an episode title and do show notes and all this kind of jazz. Like, it just speeds things up phenomenally. And so it's very, very useful for anyone who uses a computer. That's my initial introduction over. So I'll pass it over to you guys. I think the keyword that Andrew mentioned was the word boilerplate. And that's, for me, where this thing really excels. I've been tinkering with different things. Issues I have are things like, I have a table with dates in different formats. And I ask it, can you make these all ISO formats, so year, month, date? And you've got a grab bag of American ones, or it is actually correct, or it's written in words. It's Tuesday, 2nd of May, whatever it is. And it does a good job of converting that across, something that would take me manually quite some time. And it would take me quite some time even to do programmatically to sort of pass that text. So that's one of the use cases that I found that it does fairly well, just in tinkering with it. It's something where I go, yes, this is something where I really, really don't want to do this task. And it's great. This thing can just do this for me. And I don't have to worry about it. Other use cases I've been tinkering with is, I've been feeding it very short summaries of, say, an abstract of a paper. And I've said, can you condense this down into a few shorter, make it even less? So instead of 300 words, make it, say, 100 words, or maybe just three lines long. And it's able to pick that out fairly well. Not perfect, but it's able to pick that out fairly well. I've also been playing around with it to, I don't use this, but I've sort of in tandem dipped back and forwards when I've replied about, say, someone's asking me about how do I use SSH keys. And I say, OK, well, I already have documentation on how to use SSH keys. So I send that. But then I go, what would chat GPT do? So I ask it, how do you do it? And it writes a pretty good explanation, pretty good tutorial on how to do that. But that is not a sophisticated thing to do. There is plenty of existing material out there that explains it. And so that's, to me, like a boilerplate tutorial that I would have to prepare for someone. Definitely, anything that's more complicated, it doesn't do very well. It has a wonderful habit of lying when it doesn't know. And if you're not too careful, you can, it sounds very, very convincing. So one thing I've been doing is feeding it stuff for Dungeons and Dragons and saying or asking it about computer games. I asked it, come up with the best, how do I get from Caldera to Ebonheart in Morrowind, which is a 20-year-old video game. So I'm like, here's something a bit esoteric. Can you solve that? And it gives a very good explanation. And if you know anything about the game, you go, okay, you go to the mage's guild and you can use this and you can use a Silstrider. But those aren't, yeah, those don't go directly to the play. So the information is like wrong. So it sounds very convincing, but it's wrong. And often I've found that with a lot of things that are a bit more complicated. It just says the wrong thing. And I think Andrew wants to comment on that. Yeah, so at the most recent Food Safety and Microbial Genomics, Bioinformatics Hackathon, I can never remember the names of these, that myself and Nabil ran recently in Cambridge. I built a script or rather chat GPT made a script to, which would take in just the name of a bioinformatics tool and then it would construct a prompt, which then generated a podcast episode. And then that was sent to 11 Labs to turn that text into speech, which is actually really, really good. It's really good at, 11 Labs is really good at actually having nice natural speech. And what we found and so then that generates show notes and it generates a title and then I upload that to a special kind of test podcast called Microbentry Bites, which I presume will go away because no one has really listened to it. But the idea was basically, can we automatically generate reviews of bioinformatics tools? And the answer is yeah, but it's not always accurate. In fact, I had to tell in the prompt, you know, don't output anyone's name or don't output any institution, basically no facts because it will get them horribly wrong all the time, like blatantly wrong, but it'll be so confidently incorrect that it is worrying, you know. And I can see because this is so easy to generate all this stuff, we're going to have, we're going to be awash with ChatGPT or GPT generated content quite soon because it's so easy to actually go and generate podcasts automatically, YouTube videos automatically, text automatically through websites. It won't be necessarily correct, but it'll look correct. And so you're not going to necessarily know what's real, what's not. And so it's going to be dangerous. And yeah, we're just at the very, you know, we can see we're on the beach. We can see, you know, the water flowing out and we know a tidal wave is coming of information and it's just going to absolutely slam. down on top of us. And so we're going to have to be very, very mindful of that, that we are going to have a lot of very low quality information that seems high quality, you know, it's the wolf and sheep's clothing. So anyway, that that little podcast was just an aside as a test, it seems to work. And it does have its uses, you know, because a lot of these problematic fields that we write, there is no audio description of them. So you know, someone is visually impaired, they can't actually very easily understand these problems that I'd have to do a lot of reading or whatever your screen readers and that that's a challenge. So it has a use, but I don't think we're there yet. We're, we're nearly there. But we're, you know, I think there does seem to be human in the middle, you know, making sure that these things are actually sane. Yeah, there's a lot to unpack here, actually. Again, you guys have gone over like a lot of different use cases. And I think maybe, maybe I'll take a beat to go back for a second. One thing that you both basically said that you use it in common is with boilerplate stuff. And I definitely heard like coding. So like, maybe, maybe I can just kind of go back to that and slow us down just a little bit on that one. How does somebody listening to this go to GitHub Copilot or VS code and start using that, that predictive text to make some code. And if somebody like that is a an avid Perl user, for example, then that person use it to quickly learn another language. So Lee just suggested something, which I think we should just try out live. He's going to pick up a normal Perl script of short, a very short, basic one, right. And I'm going to try it here on chat GPT, and you can try it. And we'll both just do a quick hackathon and ask these things to convert it to Python. And we'll just assess if it does a good job or not. And we know both languages well enough between us that we can just assess it better than we need to run. Actually, I want to, I'd be interested in, you know, taking it from Perl to Python back to Perl and see what happens. Yeah, yeah, let's do that. Let's sort of close the circle and see how it goes. So in order for us to get to the boilerplate stuff that we were just talking about and figure out if we can use chat GPT or just any GPT to, to read the code and spit it out, we're going to, we're going to try a little translation party. I pulled up a script under my LSK scripts repo. We're going to see if we can translate it to what Python and then back to Perl again and see if it does the same thing. This, this script in particular, takes a non-standard FASTA file where all of the entries are separated by a pipe inside of the sequence. And it will put it into a standard FASTA format with different entries per sequence. Oh, this is writing some very, very, very nice Python, actually. Proper arg parse and everything like that, you know. Are you doing it in chat GPT or VS code? I'm doing it in chat GPT 4. Yep. Yeah. So the original script had from, from Lee and we'll put the thing in the show notes. It does have a usage as, as old script should, and it asks for an output. So that's already specified, but this is create, it's ported that, it seems like it's ported that across to arg parse, which is a, which is a standard library in Python to do it. And it's using, well, my one is using my Python as well. Yeah. Mine's decided to use biopython, which is great. It's, it's kept the comments. Lee has some comments at the beginning that says, Oh, I'm the author and what the script does. And it's carried that across into the Python. Not mine. Mine has gotten rid of all authorship and it's just like, Lee, sorry, you're gone. It's frozen now. It creates the outer. So if settings outer at the path does not exist, make it uses OS dot make does to create the outdoor parameter, which is a thing you kind of need to do with Python. So that's fine. It's using biopython to guess the sequence format. Then it reads in the file. It reads in each record from the biopython object. Then it uses the sec ID to replace de novo pipe to nothing. And then in each of the out in the output, if it's specified, it then creates a faster file. If not, it gets ready to pipe directly to system out standard out. And then it's yeah. On each of the pipes, a vertical pipe, it's splitting them and it writes those out with the corrected header closes the file and it's fine. So that looks pretty good. I'm happy and I've taken the Python and turned it into pearl again. I have to say it's some very well written pearl. Okay, so I've pasted I've just sent a copy of the Python back to Lee. So he can have a look if it's if it is indeed what he does. I don't know exactly the full because it's a non standard faster than 3d as the input. I don't know if it's exactly doing what he's expecting it. So you can have a look at that. But and but Andrew, you're happy with the pearl as it come back on the other side. And it looks all right, at a first glance, obviously, you don't know to write it. So just paste it in there. It actually looks very neat, to be honest. It might put us out of business, you know, if it's writing code, this is good. It's extremely similar to the original. I don't mind if if it takes this I've got other take better things to do than write these kind of scripts anymore. Yeah, it is kind of mad. I wonder now it could could you make it even better? Like could you say turn this into a one liner? And try it and see if we can figure that much out. I'm assuming if you give it something that actually has to comprehend what it's doing, it's going to not do a good job. So it tells me at the end, please note that the conversation is done to the best of my the conversion is done to the best of my ability, but there still might be slight differences or areas that need further adaptation based on specific requirements for the versions of Python. That's very nice. It's covering its own its own ass. It might not do exactly what it's expected. I am quite impressed. The thing I can see immediately is is using argpass fairly intelligently. So it's taken the Perl code and created a corresponding Python version of it and it's done and it's done it well. I think it's great. I guess you're still assessing the Python code or the Perl code yourself to see if it really is doing what it's supposed to do. I mean, I'm not going to know until I run it. I'm probably not going to run it, but it looks right. Have you seen the Perl one liner just popped in there? It seems to have done an interesting job. Maybe we could obfuscate all of our code doing this. Well, you do that with JavaScript already, right? You minimize or minify code. We could do that here. I don't know why you'd want to. Yeah, I wonder what, you know, could you go like Perl, Python, Rust, Go, you know, go around like 10 different languages and then go all the way back to Perl and see what it looks like at the very end. Well, why don't we do that? So let's give it some, why don't we ask it to generate the initial script actually, because it should be able to do that. What's something really simple? So how about this? In Python, translate DNA into protein. Let's say if it can do that, right? So I'm going to give that as a prompt. And we'll see how it does. And then I'll just ask it, take the code above and convert it into Rust. I'm using a simpler example because I don't know Rust at all. So I'd need a simpler example for me to just make sure that it's doing, verify that it's doing what it's supposed to. So can I follow along with you? You're, you're just saying make a program to do what again? I'm just saying in Python, translate DNA into protein. So, sorry, in, in the chat, I just posted in the prompt that I put when I started in VS code, I just popped in this kind of prompt at the top of the file, which is just a text description of what I want the class to do. And an example of the different headers and files and example data. And then from that, it went and generated the full class and basically 95% of the actual code. Which is you feel like it does this, like, I mean, this is something you, you know, task you're familiar with. So you think it's doing what it's supposed to? Yeah, absolutely. Like, it's, it's super, super good. And we're talking, I mean, I'll read out your initial instruction. I mean, this is saying a class which takes in two spreadsheets, assembly metadata spreadsheet, and gambit files, gambit results file, and then compares them with pandas, like, it's not that specific in, you're not writing pseudo code. And then it's just transcribing that into, into, into the correct syntax, like you're giving a fairly high level conceptual, you know, instruction. Yeah. And then the rest of it is, I tell it why. So it's so they can work out how good gambit is at recording species. Then the columns are joined on the accession number, like a GCA, which is first column bullet spreadsheets, the species column from assembly metadata, the spreadsheet needs to be compared to equivalent predicted name column in the other file. And then I say the spreadsheets are in the following formats. And I said, you know, assembly metadata spreadsheet file format, and then I give the headers, and then I give four lines of base, the one of real content. And then similarly, I give the other file and then give the header and then some content. And from that, it has enough information to then go and make the code for you. Which is just phenomenal, because it was a lot quicker for me to type that stuff than it is to type code that works. Is it a little scary? You don't know if you're gonna not have it join on on one nuanced row or something? You do have to read through it and check it. But it's a lot easier to scan through a file and check something than it is to go and write it from scratch. And then you know, as a human typing on a keyboard, I'll make mistakes, or I'll forget things or I'll have to look up things. And you know, so it gets it mostly right. But if you don't know how to program, then that's scary. Because you won't know how to actually sanity check all of this, you know, me having done it 20 times, you know, or 200 times, I know exactly what needs to be there and where and what it will look like ultimately. So I can spot errors. But if you're coming at this, you know, as a, you know, a freshly minted undergrad or PhD student, and you think that this is going to write everything for you perfect each time? Well, no, it'll get mostly there, but it won't do it fully correct. Okay, so if I wanted to take this from chat GPT, the free version into like VS code, what do I have to do? Do I have to get like an API key or what? So you sign up on GitHub. So I pay $10 a month to get up for copilot. And that's super cheap. And then in VS code, VS code is just phenomenal as an ID, you just add an extension, and it gets you to log into your GitHub. And from there on, it just has this teeny little icon down in the corner. And it's always working. And as you type, you'll see a little kind of spinning, you know, kind of thinking thing, and it goes away, and it will suggest things. And then you'll see in gray text, you know what its suggestions are, and you just press tab to accept. And it goes a bit further, because there's like, GitHub copilot chat, which is a waiting list only at the moment. So you have to first of all pay for copilot. And then you can go to waitlist for chat. And that integrates even more like it's vastly better. It gives you basically chat GPT chat functionality within the browser. And then if it suggests some code, there's a little button, you know, to insert it where your cursor is in your text file. And so you can start to code directly, it knows the context of what you're right. So I write it when I write Python, I use lots and lots of classes and different files in a spread out. And it understands all of those within a project, you can get it to explain lines of code, you can get to fix errors, you know, when you get errors, it can fix them for you, which is just shocking, like, but mostly what I use it for is as I'm typing, I type in comments, I type in a comment, you know, this method will do blah, blah, and then it'll write the method, you know, to suggest a reasonable method name, reasonable things in and out. And, you know, sometimes it links other methods, which you haven't written, and then next time, when you go and press tab, it'll write those as well. Now, it does get things like confused and wrong. Sometimes you do have to guide it, but it is very good. And a lot of a lot of coding, you know, is just bashing keyboard, it's also very good for writing tests as well. Because it knows this kind of standard things you need to test, you know, boundaries and whatnot, or do files exist. And so it's very good at testing those. And again, writing lots of unit tests very quickly, because unit tests are boring. And you have to usually it's a typing challenge, or copy and paste challenge rather than than anything else. So it does really speed that up. And that then improves the quality of your code, you can get it to cut to explain the line of code in great detail, and give it a context. And you can get to write comments for you. So you can do the opposite way, you know, given some code, write some comments and tell me what it does. Like, it's just it's super cool. So I gave it while we've been talking, I gave it a prompt to just chat to you BTS in Python translate DNA into protein. And what it's done is it said, Well, to do this, you can use the bio Python library, it gave me some example code, it imports the library, it, it's written a method. And it's done a very simple thing of initiating a sequence object with the DNA sequences input, and then it's using the correct UPC flag to use the right to sort of use the right alphabet. Let's use the translate method to translate it, and it returns that protein sequence. And it's got a little example down the bottom of how it would work. And as far as I can read, having done this, as Andrew says, I've done this hundreds and hundreds of times, the syntax and the code is basically correct out of the box, it will just do that. I asked it to then translate this into equivalent row code and rust. I don't know rust, but based on the syntax on the code is generated, it looks pretty convincingly to you. What do you think? I read the code, it looks good. Yeah. I think it like, just wow. There's like every single line of rust I have to like, I might remember it, but I have to go look it up too. And just like Andrew was saying earlier with pandas, but like, I have to look up stuff. I don't remember how to how certain libraries look. I'm going to ask it to do the same thing in JavaScript, which I could actually interpret, but it very much likes to you. Okay. It's generating it now. Yeah, that looks right. Yeah. So like all this stuff that we have to look up, like it, instead of like searching like 15 different stack overflow pages, it's like, it just produces it for you. I think that's the thing. Yeah. It's this boilerplate stuff that you're going to stack overflow anyway. So maybe you forget how this comprehension works in Python and you just tell it, do that. And it'll just do that. But if you're doing anything beyond the stack of, because it's trained on that sort of material, right? It's probably read something along the lines of stack overflow. So it can just regurgitate, it can tailor it, your problem and regurgitate that back to you. But if you then ask it a question that hasn't been done before, it doesn't know what to do. It doesn't. And it lies. It tries because it's been trained to give a response. I don't think it's allowed to ever say, Hey, I, unless, unless it's a thing where it's like ethically, they put these constraints in, but the machine is never allowed to say, Oh, I don't feel confident about this. Which is sort of, and that's sort of where the problem lies with it. Have you, have you come across some code where it's like, I won't do that. Do you have to do some jailbreaking? I've seen people ask it. I mean, one of the things that says it can't do is translate to different, I think they might've changed it, but early on it was, it had a problem with that. It said, I don't want to do that. I've been trained on ignition. That's what I know. I think if you ask it to do code, that's unethical. So you, I don't know, you, you make it do something racist. I don't know what racist code would look like, but it doesn't want to do that. You know, it's like that sort of thing, but these are, these are constraints that have been set on artificially on top of it by the developers to, to help it avoid doing things. So I think I'm not going to do it. Cause I'm not, I just don't feel comfortable doing it, but I don't want, but I I'm assuming if you ask it to write malicious code, it won't. If you tell it the code is malicious. Like write me a deliberately malicious script that deletes everyone's files on their computer. It should hopefully say no. I'm thinking of that virus that, can you give me code to, to make it so that centrifuges are unbalanced, just slightly unbalanced so that we don't know if our nuclear program is working or not. Yeah, there we go. So I gave, I just gave it this prompt, write me a malicious script to delete everyone's data. And it says, I'm sorry, I cannot assist with creating or promoting malicious scripts as it goes against my ethical guidelines. My purpose is to provide helpful and responsible, responsible information to users. So there you go. I'm lead, I'm giving it a leading question because I'm deliberately asking it, write me a malicious script. So it's pretty easy to say, well, no, if I do it covertly enough, I'm sure you can squeeze it out. But, but then that you can do with anything and sort of break it. Hi. So that's all the time we have for, for this episode of the MicroBinFee podcast. We've been talking about new artificial intelligence tools, particularly chat GPT. And we've been doing some sort of live demo of how it would generate code for bioinformatics applications. And hopefully you've enjoyed this episode and we'll see you next time. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at MicroBinFee. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.