Hello and thank you for listening to the Microbidkey podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is the head of informatics at the Quadram Institute in Norwich, UK. Andrew is the director of technical innovation for Theogen in Cambridge, UK. I am Dr. Lee Katz, and I'm a senior bioinformatician at Centers for Disease Control and Prevention in Atlanta in the United States. So in our previous episode, we've been talking about artificial intelligence and using that for some basic live demo of how it would solve bioinformatics problems. And we've been just playing around with it in the last episode, but in this one, we want to delve into some of the literature that's slowly trickling in about these type of systems and how they can be applied and seeing more consistently how they've been performing for various tasks that we may encounter in microbial bioinformatics and genomics. So in the last episode, we were basically given very simple toy exercises to chat GPT and seeing how it performs. And there's already some literature where people have been trying this, but more systematically. So there's this paper from Piccolo et al. It's available as a preprint, and we'll have that in the show notes. Its title is many bioinformatics programming tasks can be automated with chat GPT. And what they do in the study is they basically take about 180 exercises that you would have in a bioinformatics course, and they basically feed it to chat GPT and say, here's the problem, as you would, you'd write it on an exam. They're just giving it to the AI. And they're saying, here's the exercise, write the output. They're taking the output. And then because usually they're programming exercises, they're then able to feed it into their standard automated testing suite that you would use for assessing programming tasks, like you would for any student assessment. And they're seeing how well the chat GPT does. So it's basically like if chat GPT was a bioinformatics student, how well would it do at an exam? And in the preprint, out of the 179 exercises they gave to chat GPT, it managed to solve 75% of them. So 139 exercises, it was able to solve on the first attempt. They, for some of them that it got wrong, they went back and they gave it a bit more prompting. Some cases they suggest that it wasn't, the exercise wasn't specific enough or it got stuck. And then they say that with more prompting, it was able to basically resolve majority of like vast majority of all of the different questions that they presented to it. So yeah, what that says to me is I'm going to have to change the way that I hire people. So all these technical tests that you might give people in advance, we just cannot do these tests anymore because those are simple problems you can do in a few minutes. We're going to have to like physically sit beside someone and go, right off you go. Can you solve this problem without actually using chat GPT? You know, can, do you have the fundamental knowledge there or are you just kind of a copy and paste jockey who is going to just copy and paste the question we've given you into chat GPT and then send it back to us, you know, two seconds later. So it is going to change things a lot. It's going to make life a lot harder. There's going to be this transition period as well, where some people are using this, people are not. I think we're the early adopters, but there's going to be a lot of other people out there who don't necessarily use it. And that's going to cause a bit of a bit of an issue, I think. Anyway, yeah, we have to change everything we do. So what do you think about the analogy that I've heard before, where this is a disruptive technology, like the calculator and it's legitimate for people to use a calculator on an interview or something like that. Is it legitimate for someone to, to answer a question that's really simple, like calculate the GC content? I want to, when I'm hiring someone, I want to know that they actually know what they're doing because they're going to need that same knowledge to assess what chat GPT says to figure out, is this accurate or not? And on a simple problem, that's fine. But in a more complex problems that, you know, they do have to have the base level of knowledge. And so like even people using calculators in school, you will teach them things from first principles. You know, you don't just hand them a calculator when they're five years old and say, off you go. You do teach them progressively all the way through, you know, this is what it is to say, estimate, to say check, is this right or wrong and, you know, and build up. So, and then it's only at the end, you give them a calculator and say, well, you know, now you know the basics, you're fine. You know, you can use this for most of, but if you ever need to, you can always go back to your, you know, pen and paper and work it out from there. So we, it is a great tool, but we do need to ensure that students and employees and scientists in general do fundamentally know what they're doing so that they can see something and go, that doesn't look right. Well, I mean, for me, when I did programming, when I was doing software engineering in undergrad, most of our exams was pen and paper in a closed room, which in visualators. So we were doing programming with pen and paper. We had to write, the exercise was, here's a problem, write the pseudocode or design the data structure or calculate the, you know, big O complexity, whatever it was. And we weren't able to touch a computer. We didn't have the internet or anything. And back then you would think like, okay, I'm doing an exam on programming. Why don't I just look it up and see what the answer is? So I don't think that's going to, that doesn't, that's resilient to chat GBT. That's resilient to any technology because you're there doing it with pen and paper. That ultimately, I see what you mean about disruptive technology, but that's not how I was tested on my programming knowledge in my undergraduate. A lot of marks for courses are now assigned through project work and work you do does not within an exam. And so that's going to be a problem because you won't necessarily know is this coursework done by chat GPT or not. And particularly if someone goes extra lengths to obfuscate it or whatnot, I guess the only, well, there will be clear signs, you know, like if they include texts that says as an AI chat bot, blah, blah, blah. And they, you know, they're too stupid to even, you know, read what's come out or, you know, if you have a student who's clearly at the bottom of the class and then suddenly they're producing something that looks like, you know, it's high, very, very, very high quality, you know, students don't normally go from the very bottom to the very top in one step. And also, you know, there's minor little tells as well. Maybe I shouldn't give this away, but like chat GPT is more American and saying, and spelling and things like that. And so you can tell from the spellings or you can tell from the style. Yeah. I've seen code where people have mixed different indentation styles within a project. And it's very clear. Okay. Well, you know, using tabs in this file and using spaces in this file, you know, four spaces and then use eight spaces. And it's, well, you've probably copied and pasted this from different places, you know, just based on the actual style people use, or even the, you know, using camel case or whatever, you know, mixing and matching different styles. So, you know, if, if you're stupid enough to have to rely on this stuff, because you don't understand that, you know, you can be caught by other means just by a human applying their own intelligence and going, that doesn't look right. But for how much longer, I don't know. When I was doing my undergraduate, I always in every single programming exam, in every single programming subject, they always had it that you had to pass the final exam. You had to get 50% on the final exam to pass the course. And I always wondered why. And now I think I know why. Because it's exactly what Andrew's saying. That's the final check. Like even if they fudge everything with, with the assignments and they copy and paste from everyone and they plagiarize and they don't understand what they're doing, they will get caught out on the final. They won't be able to perform to pass that. That's awesome. There is these sort of tricks to, I mean, it's, it's one way, I think you can, you can probably at the moment, based on the code we're seeing, and as I'm just saying, like, it has a syntax, it has a sort of a style or it makes a style that you can detect. One thing I've noticed with essays that, that these programs, right, they always say in conclusion, something, something, something. And you're like, yeah, in conclusion, no one writes in conclusion, da, da, da, da, it's boring. Yeah. So you, but it's, then it's very hard to prove that the students done it. That's the issue. How do you systematically demonstrate that the students done that? But when I've discussed to, to people who are involved in teaching, they're not too worried, I guess. They always come back and say, well, you know, we just changed the assessment. That just means we don't, okay, we lower the weighting for these assignments, or we make the exam like a mandatory pass. for that, the closed book exam, or they even say, well, you know what, this is a good opportunity to bring back some old school assessment methods. Like they have to stand up and give a sort of oral fiver on the subject to assess that they know what they're talking about, because I think you can figure out pretty quickly if no one, if someone doesn't know what they're talking about, if you make them stand up and actually explain the matter to you. There's, there's no, there's no chat GPT to help you there. So as I mean, in terms of the stuff to technology or the thing around, is this the next sort of calculator? I don't think it's going to break formal assessment like that, but the job hiring thing is, is a problem. How, how you assess that. I see, I see in table one, there's some tests that it can't pass or it didn't pass in this, in this preprint. Yeah. So coming back to the preprint where they, where they basically gave it, you know, the exam questions for, for an intro to bioinformatics course, there were things that wasn't able to do. And they highlighted, there was out of the 180 odd that they gave it, there were five that it just couldn't solve no matter how many attempts they get. They tried to prompt it like 10 times over and over to see if it would eventually figure itself out. And there were five things it just could not do. So, one of the ones, the exercise was to, I'll pick one, one out. It says, find words that start with a vowel. And this is an assignment around regular expressions. And the prompt was find words in a biological text that begin with vowel. So it's just reading, it's just comprehending texts. There's, there's no programming involved and it was not able to do it. It couldn't, it couldn't write regular expressions to solve that. It had problems trying to interpret extra spaces and punctuation marks in the test data. So when you think about it, that is a problem that if you're writing the sort of regular expressions, you need to actually understand the type of input that's coming in. And so it makes sense. You go like, okay, yeah, that would be a thing that if you don't know what you're doing, you're not going to be able to solve it if you can't conceptualize like what the sort of space of the different inputs you could get and write a regex around it, you'd probably have a problem. So for all of the Perl people out there, there's hope for you yet. You won't get replaced by chat GPT. Oh yeah. Regular expressions. I suppose it's because most people are pretty bad at it too. So there's probably not that much test data out there to feed it. Yeah. And so like, also, so I've been, I've been teaching that intro to bioinformatics course over this, I guess, now two semesters. It's like, do I need to be worried? Do I, is it okay? Is it enough that they are sort of bound by an ethics code? Like don't cheat on this. Or do I need to write more sophisticated multiple choice? It depends on if they have a computer available. They do have a computer. Then that's a problem. See, it'd be hard to fool an AI like that because you know, things would be very obvious. I think, yeah, it's a hard one. And being bound by ethics codes has never stopped anyone from trying to get a leg up, you know. I'll have to think more on that. Do you want to move on in the articles that you have here? I think that's mainly it. The last thing that they comment on in, I mean, if there's other things that they point out in the preprint, that's interesting. I mean, one of the things is they say that other people in the past have, in this case, that testing chat GPT, but other people in the past have tested other large language models. So it's not something magical that this particular implementation can do. I mean, things like alpha code or open AI's codex, which I think was a precursor was also able to solve these kinds of problems as well. Not the specific example of these specific exercises, but this is a longstanding thing. I mean, and that's what people tell me who have been in the field. They say like, this doesn't do anything that hasn't been in large language models for some time. What I think the magic here is that they've made this so incredibly accessible because it's just a webpage with a prompt and you can just ask it whatever, and you can feed it whatever input and off it goes, which is where people can now start playing around with it. I suppose that's, that's really the real revolution here. And that, that was from November, 2022, where they released the chat GPT for people, and I think one of the other things they point out, which is quite interesting as one of the interesting results is that the number of lines of code that the AI generates is comparable. It's basically the corresponds more or less to what the instructors best, you know, best answer would be. So it's not, it's not only solving the questions in terms of like, it's producing the answer. So when you run the code, you get the output through using various tests inputs, but, but the sort of shape of the code, at least in terms of the number of lines is comparable to what you'd expect. So it's probably what we found in the previous episode is it's able to solve the problem, but it's also able to solve to write good code. And that's what I've noticed. It writes decent code. It doesn't, for instance, it doesn't do something strange. Like it's trying, you ask it to evaluate something and it doesn't just do something ridiculous, like write a hundred if statements, you know, like something really dumb that will produce the correct output, but something like, you know, sort of some, you know, semantically, just like, no, don't do that. Don't just write like case, case, case, case, case over and over. I did find that the longer the block of code it produces, the more likely it is to basically start getting into a loop and repeating itself. And so I had a case where it was, it was just doing some simple stats calculations, like counting up numbers and doing percentages. And then it kept producing the same like block of maybe 10 lines over and over again. And it's like, okay, it's forgotten what it's been doing because this is so long and now it's just getting into this loop, which is kind of interesting. Yeah, that's an interesting point. And in this particular preprint, none of the, none of the answers required code longer than 30 lines as far as the instructor. So the instructor's like best implementation was never more than 39. So these are very, very, very short scripts. And so the fact that it's able to solve them so well, that it performs so well is probably due to that, that it is just doing something very simple. 30 lines is not a complex task. Most would be average script. I mean, average script. If, if I, if I give someone a task, I'm expecting a script about a hundred lines or, or some of these scripts that we were looking at, like two, 300 lines. They're fairly this, they're trivial tasks, but they're, you know, that's sort of complicated. I'm not sure at that point, if we, if we ask those sort of questions to the AI, would it be able to perform? And it probably would not. Do you think with the increasing prompt sizes and the number of tokens you can put in, will that basically just solve this stuff overnight? Because at the moment, what is it? You can put in about 4,000 tokens. So it's approximately 4,000 words at a time. And, you know, they're talking about chat GPT for some people have access to 32,000 tokens. And I know other people have in a research context doing like a hundred thousand tokens. And so if the prompt size just gets bigger and bigger and bigger, maybe we might just actually get better and better code squeeze the whole thing in rather than squeezing, you know, a smaller block into memory. It depends on the problem in general. I mean, what we're talking about here is, is the jet is, is chat GPT as the general language model, right? So if you go back and you do this kind of fine tuning, you probably can get much better performance and have seen claims. You see all these funny web pages and tools and things, people pointing out that there are these, Oh, this is better than chat GPT. It's been optimized for, for writing programs and things like that. There are other models out there. So in time, yeah, I think we'll fine tune it. We'll probably get better performance. Although are these other models really better because like chat GPT was trained on was a 10,000 high power GPUs, GPU cards, where they, I don't know, a one hundreds or something like that. Like some, I don't even know if that's right model number, but basically that phenomenally expensive Nvidia cards and they use 10,000 of them and each one probably costs $10,000 as well. You know, these are billion dollar calculations that people are undertaking just to create one model, which obviously then has compounding effect because it has it's worth even more, you know, but not everyone can just, you know, kind of stump up a billion, even, even Google, you know, with what it's part has tried to, you know, jump into this space and seems not to have done very well initially. There's still a lagging behind in terms of their art and their AI. I think possibly a lot of the problems there are because they have like about 20 different teams all trying to compete internally with each other to fix AI because, you know, suddenly open AI and Microsoft have gotten a jump on them. The jury's still out for me. I've seen people talk about this, like, yeah, it's specialized. So it's better. And I'm not convinced. It's early days, right? We'll have to see how it goes. I think the big thing is going to be when you have these models that don't need an internet connection. And if you can have like GPT on, say, a sequencing instrument, right? On, say, on Nanopore, on something like that, it's like you could potentially do very different things. You're opening up a whole new world there of potential applications. If it doesn't need internet, then it can do much faster in near real time on whatever device it's on. And in terms of actually processing data, maybe you might just start sequencing something and it'll guess the rest of the molecule or guess the rest of the species based on what it's seen. Because it is predicting the next word or whatever. Here's a pro tip if anyone is involved with Illumina. Here's a free idea. Why don't you get the machine to autocorrect my sample sheet when it's wrong? Instead of just telling me, hey, this doesn't work. That would be fantastic. If we can use the AI just to autocorrect the number of commas in the sample sheet or whatever the format is to help me out there or tell me that my indexes are garbage or something like that. Or when you've done 10x of sequencing, it goes and makes up the other 1000x. And then it's like, oh, wow, I've got a great sequencing run here. That's what I really hope. I mean, I'm not happy with some of the read correction stuff. And then it's telling you that these are the raw reads. You're like, no, these aren't the raw reads. You've gone back and done some shenanigans. I'm not happy about that. So what you're suggesting is like, no, I really hope they don't go that way. It starts making up data. I would hope base calling will drastically improve as well. Because effectively, a lot of this is similar stuff. Because a lot of what we read in bacteria are words written in amino acids and whatever. And that is a language. And I would wonder if a large language model like that could actually have a huge positive effect on read correction. Because, you know, it'll know, okay, well, I've got these six amino acids. It probably means, you know, that the seventh one is going to be this, because it usually is. And, oh, look, it looks like there's an error there. Maybe I'll correct that to what I think it probably should be. You could actually end up with some really, really cool stuff coming out the other end. So there's a, there's a very, not addressing this specifically, but just in terms of language, which is what most of these models are trying to solve. There's a very interesting blog from Stephen Wolfram, where he, you know, explains what ChatGPT is doing and why it works. And one of the interesting take-homes he has at the end is to talk about the fact that language is structured. And it's not random, and there is a trend. So there is, and you think about it, like, yeah, there obviously are certain tokens that follow or certain words that follow after another. And that makes it predictable, something that a machine can predict if it sees enough of them. And for him, one of the sort of conclusions or take-homes is the fact that we over, we sort of think language is very fluid, but it's not. It's a very structured, stodgy thing. And perhaps that's obviously going to apply to genetic code, because that is also a structured, stodgy thing, which could be interpreted using these sort of approaches. So since, since large language models are kind of like just predicting what the next word is after another, it makes me think of Markov models. Like maybe it's like, you know, next, next, next gen Markov models. Do you think that therefore it could be used for things that Markov models have been used for, like gene prediction? I guess so. I mean, I'm still trying to figure out what's the secret sauce here that's got everyone so excited versus to what people have been doing in the past beyond predicting the next token. And then, I mean, the trick with this is this idea of temperature, which is, you have these things which will just predict the next likely word and you can make a model of it. You can feed a bunch of text and you can train it. But the problem with that is if you're writing language, it gets boring very, very quickly because it just keeps picking the same word. And it sort of does that thing that Andrew was talking about where it starts to repeat itself because it just says the same thing over and over and over again. So it sort of hits this local maxima and doesn't, doesn't break out of it. So they have this idea of temperature, which adds this component of randomness to it to break out and do something more creative and go off onto a tangent. Now, with, if you took that same concept into gene prediction or, or any kind of genetics problem that we have, would it, you'd certainly think that that would also be applicable, all of that would be applicable. I mean, one fun idea would be, okay, well, I have a gene. Why don't, here's the gene sequences I've seen a million times. I have all these examples of this family, this ortholog family. Why don't you predict for me all the possible alleles? Predict to me new future alleles that would make sense, based on some constraints. Here's an antimicrobial gene, what are all the possible configurations that are possible? That wouldn't be, you wouldn't be able to do that just from the sequence, you'd have to take into account the protein structure as well. But that's, but that again, if we're talking about this in the sense of something that has structured and has rules, it would be predictable. Well, I think we have time for our folks. This has been a great little discussion on ChattGPT and the kind of uses of it and how it may impact us in our future lives. So stay tuned. This is a very fast moving space. Hopefully we'll see you next time. And we haven't been replaced by machines. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at microbinfee. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.