Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Hello, and welcome to the MicroBinfeed podcast. All three of us are here today, Nabil, Andrew, and me, Lee. We're here to talk about what languages to learn when you get into bioinformatics. We previously talked about this on episode 13, but it's such an important topic that we felt that we should update and expand on it, and we might have even changed our minds. So Andrew, what is your view on this? So I was giving a talk recently to some PhD students, and one of them asked, what language should I learn? And then after I've learned that language, what should I learn next? And it's actually a really, really good question, and I know it's very controversial as well because everyone has a totally different opinion on this. My personal opinion, you know, if you don't listen to the rest of the podcast, learn Python, then R, then something like C, C++, C Sharp, and Java, and then you can learn the rest. So, you know, there you go. You can turn off the podcast now. You've heard everything. I think some of us will have a slightly different opinion. But just to expand on it maybe, my general opinion is that you should start off with a scripting language. Like you need something within your repertoire of things to do. So you're going to find things like Python very, very useful as a quick scripting language. But then you also need a different type of language for doing basically munging data and doing stats and making pretty figures. And that's where R comes into its own. And then finally, if you're doing like really heavy duty programming, you're going to need a statically typed language like C++, C Sharp, Java, or if you really like torturing yourself, C. And, you know, that fills a different gap. And for me, I would say these are the languages you should start off with and learn to an advanced level first, before you move on to maybe other things like say, trendy languages like Rust or Haskell or Go or Ruby. There's a different one every year, you know, that's on trend. It's fine if you are an established programmer or established computer scientist, you know, you can jump in and out of languages because you've probably done a dozen of them already. But if you're just starting out learning this stuff, it's probably not the best idea to jump into one of these more obscure, trendy languages as a first try. And you can get easily misled as well, because people will be very enthusiastic about, you know, say things like Rust or Haskell. But actually they're, you know, it possibly does limit what you can do in bioinformatics if you learn something that isn't very commonly used. So that's my opinion. What do you think Nabil? I'm a little bit more agnostic and I would say, if you're trying, it depends, if you're trying to learn a language, like you're a PhD student and you're not have a formal course that you're applying for, you're going to need to learn off tutorials and you're going to need to learn off people who are in your environment around you. So you should look to who is in your space and what are they using and who's willing to help you and what do they know? Because once you get the concepts around programming, so what is a conditional? What is a loop? What are primitives? How does that all put together to do something in any language? You'll be able to transfer it to other languages. It's just a question of understanding the syntax. There are some that are quite different in terms of the paradigm, but for the most part, if you know, so a lot of the ones that Andrew's talked about, you can kind of, if you know one, it'll help you learn another and so on. So yeah, I'm not too fussed about it. It depends on, on what your environment is and what you're trying to achieve as well. Some people might really, in bioinformatics, you don't see it too much, but some people might really want to do web development or web application as a bioinformatics tool. That's online that you do stuff with it. Then Python can do that. JavaScript might be more appropriate for that. Depends on your project. So, you know, you'd sort of start towards what your final goal is as well. It's sort of an individual decision. What about you, Lee? Yeah, you guys are making a lot of sense on all that. I think I started off on, actually, you guys might not even know this. I started with C and, and I quickly forgot about that because I learned that in a course in college and went to PHP, which is also nauseating. And when I got to grad school, they taught us Perl. That was the language for bioinformatics. That was the latest and greatest in 2004. I think if I were in a vacuum, I would actually say start with Perl because I love it. But we don't work in a vacuum. We work with teams. We work on GitHub, which is inherently collaborative. And if you're going to do general purpose, Python. But after Python, I mean, there are like a whole host of languages you can go to for other reasons. Python is a sort of good generic language that everyone is, is using. But I mean, I totally agree with you guys. If you're going to get into something like stats, go with R. If you're going to get into, you know, there are lots of other important languages to learn. Like if you are going to get into web development, JavaScript is the language. That's sort of my general overview. Yeah, I'm curious, you brought up the thing of how you started, and I think all of us actually had different trajectories. So Andrew, I think you went, what did you start with? What was your order? Like X, Y, Z? You threw me off with Z. I started off as a junior in college with a course in C, and then I dropped it for a little while. And then I wanted to get into web development with PHP. So I learned basically what they called LAMP at the time, and we don't have to go into that whole thing. That's a whole Pandora's box. But you started with C, then PHP. So Andrew, what about you? I started with C++, then Java, and Matla, Maple, Prolog, Camel. I did software engineering as an undergrad, and then eventually into Perl, which I really enjoyed actually, and PHP. And yeah, I did a bit of LAMP as well, you know, it was all on trend for the time. Full stack developers, they call it now. And gosh, and then when I went out into the working world, I actually started off with Ruby, doing Rails. And then I went more into Perl, and then back again into C, and then into God, Python. Like you know, once you learn one or two languages, it becomes easier and easier to learn more and more and more. And so mine was different again. I started with Java, that was undergrad was a lot of Java. And then they sprinkle in Matlab, and R, and a little bit of C, but mostly Java throughout. Little bit of SQL kind of thing. And then when I did my first research project, as my informatics started out, I started out in Perl. I did one year of solid Perl, because that was what the lab did. And you know, I'm not just going to come in and just say, do something else. So it was one year of Perl, and then I managed to switch it over to Python after that, after I was more familiar and could be trusted. But I guess there's a big difference between just learning the basics of a language, and then really, really getting into the depths of it. Yeah, yeah. Frameworks, and the libraries, and then all the little quirks, you know, that takes so long to get into. And people don't necessarily appreciate it. They might do a little course, and they might know, okay, well, I know an F7 and a for loop. But actually really getting deep into it, and fully understanding language, it takes a lot of time. You know, it takes a year or two to get to a proper in-depth level. And that's from where you're converting from another language into that language, you know? You're like, you're already an expert in programming. It does take time. And it is something you need to appreciate that you can't just think you can do a one-week course in Python, Perl, and Java, and then be an expert in all of them. No, definitely. I'm talking, I think we're talking on a timeframe of at least a decade, right, when we're mentioning all these languages for between the three of us. My God, we're old. Yeah. And the two years was, is if you're experienced in it, like, I already knew PHP, which has a lot of roots with Perl, and learning Perl itself after PHP. It still took me, I would say like five years to become really good at it. It takes a long time to learn a language. Absolutely. And then there's other things that I don't really consider languages, but they, I suppose they are like say SQL and bash scripting. You just kind of use them all the time, but like, and I know they are languages, but I wouldn't even consider them myself personally to be languages that go out and explicitly learned or kind of things you pick up. Things like, well, SQL in particular, and I guess maybe bash to an extent, it's a different mode of working. It's actually quite different to programming really, like what you're doing, how you're solving your problem and getting your head around that mindset is useful, but it's, yeah, it's a bit different. It's a different, different thing to programming. I would say like, I think that shell scripting is, is a totally legit programming language. It's just not something I would use day to day. Like it, it kind of makes system calls like the top level priority and every language has their top level priority, right? Like when you're just doing like function after function, maybe Python or Perl is kind of, okay, Python is your, is your thing. Like if your top level thing is functions. like actual mathematical functions and statistics, then R is great. And then that's also in part because there are established libraries in R to do those things. A totally new language that just came around probably doesn't have all the bioinformatics libraries that you want. And then of course, there's all the web stuff, like HTML and Markdown and a bit of JavaScript and VB. These are things that people will often employ and don't necessarily realize they're kind of languages and in their own right. But they're useful and it is probably something you're going to learn and pick up as you go along. Although one important thing to note is that JavaScript and Java are not the same thing. Just like Paris and Paris Hilton are not the same thing. Yeah. I think that they actually, they wanted their language to become very popular. So they named JavaScript after Java, if I remember right. Yep. They're totally different. Like one is no typing whatsoever, you know, and it's just this free form language and the other is quite different. Yeah. Yeah. Half the jokes on the programming humor subreddit are about JavaScript. Like it's just such a weird language. It's become a bit better now. The old stuff is terrible. If you had to do stuff with the jQuery kind of stuff was just, it was terrible. Trying to figure out what the hell was going on. Your scope was leaky. You didn't, you're calling variables from like, you know, a hundred lines ago. Why is it here? That, that took me, that was, that was very difficult coming from a job, coming from a Java background, working in JavaScript. You just, you're like, what is going on? It's pure unadulterated chaos. It's just Mad Max kind of programming languages. Yeah. So using JavaScript, like it depends. I think that does depend on which framework you're using. So you mentioned jQuery. I was on a different one when I was doing it called Prototype. That was a little less mainstream. I've done, I've dabbled with React a little bit. And, and now with the new sort of ECMASix standards and things like that, it's a lot better. And I feel like it's more in line with other languages. So I feel more comfortable with it. So it, I, you know, I would be more comfortable, you know, if you asked me like four years ago, would you suggest a beginner to learn JavaScript? I'd be like, no, now I'd be more inclined to be like, yeah, okay. They can give it a go. So Nabil, if I gave you a task, you know, here's a file and gonna go pull some stats out from it. What language would be kind of your go-to without even thinking? I'd have to use R these days. What about you? I'm sorry, but my answer is Perl. R is a horrible language. I hate R so much. So are you, are you a fan of R, Andrew? I have done R. I don't like R because every time I do it so infrequently that every time I come back to it, I have to kind of relearn a lot of it. And so what I've been doing is actually getting other people to, you know, give me their code and I can modify it, which is, you know, obviously easy enough than actually going and remembering how exactly to do things in R. Because R is, the syntax is quite different and just getting your mind around like vector programming and, and all that, all that kind of stuff, it just takes time. I don't have headspace for that. So for me, that my go-to will always be, say for a quick, quick script will be Perl or sometimes Python, depending on if, if I think I might reuse it again, I'll, I'll write in Python straight off, but Python generally, you need more lines of code, whereas Perl you need vastly less. True. You, you did a few, I think I remember in Rory, you, you have a few scripts in there from R, is that right? Do I? Oh, I think it definitely outputs. It outputs stuff for R. Definitely. I can't remember. Have I put any R in there? Hopefully not because that'd be painful for people. What about you Nabil? I can't deny that ggplot and ggtree and a lot of these tools are very, these, these libraries for R are incredibly useful and you can't really get the same visualization yourself unless you draw it by, from scratch. And I absolutely hate that fact that I love the output because the syntax is, is difficult to work with. Not only do you have the, the language itself isn't familiar to me, like with other ones that I know, but you also have separate paradigms of approaching the same problem in R. For instance, you can write something in the native R language, which will be in one particular form. And then you have things like the D player logic, which is more functional and more sort of has a chain of callbacks that you apply to your data frame to produce a particular output. So it's like, you know, data frame dot mutate dot filter dot da da da da da. That's like a different thing you can do. And you can do both and you can mix that as much as you want. You can do native, you can do this, you can do something else. You can, there's no sense between the libraries. Like the libraries will have, one will say, you say something dot save, and then you'll say something dot save fig. And then you'll say something dot something, something. And then you're like, what, why is, and it makes sense. None of it is consistent because it's just random bits put together. So all of what I, all my feeling towards R is perfectly encapsulated by this quote from, from Paul Agapow who says, R's ultimate problem is the sum of its small madnesses. And it is precisely that. There are so many little things that undercut what you're trying to do and trying to keep track of it. It is an absolute nightmare as a language I find. And this is why Python is so good because there is one way to do it and that's it. It's not like R and Perl where they encourage you to have 200 different ways of doing the same thing and different conventions. Well, there's list comprehensions, which is a different thing, which I think is fun in Python, but I get a little annoyed with list comprehensions as well, although I use them all the time in Python. I love, I love list comprehensions. They're so useful. I know, but then it's like, you can do it in two ways then. And I'm like, this in the back of my mind, I'm like, this is fine, but don't start going into that R route where you're going to give me yet another third way of doing it, yet another way of doing it. Like I'm glad that they're tidying up, say things like string interpolation, right? For Python, because there was a point in that Python 2, Python 3, you could do just basic printing of script, formatting of strings in like four different ways. And like, why is this, just do it one way. You don't need four different ways to do it. That gets my hackles up. And I think that's my upbringing on a Java side because Java is quite strict with that sort of thing. But then you'd be spending 200 lines of code, you know, just to read in a CSV file. Yep. But then I, at least I know what type I'm getting. I think that's why, that's why people went to Python and that's why it's so mainstream. It's like, there's one way to do it. Well, there might be like two ways or three ways, but like, if you're going to code in Python, like the style of the language and the method that you do something is like so rigorous that it makes it a lot more collaborative. I think that's why we've arrived here. Python rides that rail really well. I think it doesn't stray too much in the ambiguity, but it doesn't burden you with, with extra verbosity in terms of trying to specify type all the time and things like that. And it is actually straightforward to get into as well. I know the other day, one of the lab techs was playing around with an OpenTrons robot. So these are like open source robots you can buy very, very cheaply and you can program them with Python scripts. And the person was just able to jump straight in, make some Python scripts, and then away you go. And they were programming a liquid handling robot within a few hours, which is a phenomenal thing to do. Yeah, definitely. And it's been a long time for things like microcontrollers that you can use Python as well to link into those. So Python has this, it has this pervasiveness. It's not particularly good at anything. I will point out that, you know, Python, when you want to start doing multi-threading, it is awful. And you're like, why is this so difficult? So that's where you have to see, and you're starting to go into, you know, the depths of, I don't know, multi- threading with AVX, you know, splitting up loops and all this kind of jazz, like that, that's where you really need to employ someone, you know, who really knows their stuff, if you want to get the absolute most out of a machine. And then, you know, if you want to go even at one level, even further than that, you know, you've got GPU programming like CUDA and stuff like that. And that's, you know, an even higher level because that's just mind-blowing how to get the most out of a GPU, you know, that's difficult. And for most mathematicians, it's probably going to be a bridge too far, you know, to jump straight into, but fair play to do. Yeah. I would also, I'll bring up another thing that for me, I like Python as a starting language rather than R, let's say, because coming back to the point we were saying about the underlying concepts of how you approach programming as a problem, regardless of the language, but just the concept of how do you take a task and break it into these set of steps that you feed into a stupid machine that doesn't know what you, what it's doing. Python is better at helping you get your head around those fundamentals. And I think R doesn't guide you. So I've seen a lot of R code that has some pretty nasty bad habits and how they're written, things that they're not modular. They're not, they're not easy to maintain. They're pretty rubbish with, with naming conventions and things like that. There's no like guidance with that sort of thing. And Python is a little bit more strict, especially if you start building and linting from the beginning. There is a style guide, for instance, for Python, there's a Pep8 style guide that tells you like, this is a convention of how you name things. This is the convention of how you approach particular tasks in Python. And so that I find is better because it helps you with the fundamentals. And the fundamentals for me is the key thing. If you know the fundamentals, you can program in any language. So maybe a controversial question there, right, for you, Lee. But what language should someone not learn if they're just kind of starting out, if they're early career? I would definitely go with the trendy languages for don't start with those. Not because they're bad or anything. They're probably good. But because you're getting into bioinformatics, which is already cutting edge. And you want to have something that has established code, established libraries. You don't want to be the one to write the first FASTA parser in this language. And by the way, take it from me. I wrote bio.php. I wrote a bio.js before they did bio.js on that paper. Like it was not fun to do. And you want to use someone else's stuff. So let's say you're getting into the newest language. It will be hard to read FASTA files. Harder to read FASTQ files. Really hard to read SAM, VCF. You don't want to get into that stuff. So the ones that you don't want to learn, trendy. Trendy languages. Go ahead. I guess it's easy to read a FASTA file once that you produce. But reading FASTA files or FASTQ files consistently that are produced by thousands of people around the world and building a library for that, that's a much bigger problem and much harder to solve. Because you might say, oh, well, a FASTQ file, sure, it's four lines. I'll just read it in. Until you get like a single Nanopore read, which is a million bases long, or the FASTQ file is split up into more than four lines. And then you find, oh God, you know, now I have to consider all these other different things. I've written GenBank passes in three different languages and I would never. And they don't work. Like they work for what I was doing. But as a general solution, they don't work. As Andrew's saying, don't do it. Pick a language where you don't have to do that. And then for me, I would say avoid languages like, I don't know, PHP. It's probably not going to be, you're not going to get very far in bioinformatics with that kind of language. Yeah, like what's, what is employable, I suppose, in the bioinformatics space? Because obviously PHP and web stuff is, it's a bit specialist because generally we don't do web applications at the moment. I would say, right, if you want to earn lots of money, right, the best languages to learn are the really trendy ones. However, it's a double-edged sword because those languages, you know, come and go like the changing tides. If you want to have, you know, an actual career, you should probably learn something that's more established, as Lee says, say like Python or whatever, because there's more job opportunities. Because you'll always get labs and always get companies that really want the trendy stuff because they want to, you know, they want to wear polonex or whatever to build the Tim Cooks of the world, you know, where these days, and you've got to be able to keep up with that. I think one thing as well, in terms of learning, is the new, shiny, trendy languages are usually constantly changing from release to release, from version to version. So not only are you struggling, so when you go and look up tutorials, right, and if there's, even if they're six months old or one year old, they might not actually work. And then because you don't know programming very well, you won't actually be able to figure out why. And you won't be able to figure out, oh, it's just because they're using version two and I'm using version three. And it's different between, you won't have that in your mind's eye. And that makes it really, really difficult to get your head around trendy languages. I will mention, I've read a report once upon a time that told me that the most highest paying languages to learn, like the languages, if you know them and you get employed to write them, these pay the most, is COBOL and FORTRAN. In terms of, those are really trendy languages. I mean, COBOL is one of the first programming languages ever produced. Yeah, go into that. Why would the older languages be more payable? Because their legacy, because the first thing they're a pain to work with, because they're old and janky. It's like, you know, when you go and, I don't know, play video games in the 1980s, they're fun, but like, there's no like UI, there's no like help things for you. There's no tutorial. The second thing is that there's a lot of legacy systems that run on it. Like if you think about railways, like, or you think about critical hospital software, things like that, they're running on FORTRAN. Old like companies are using old, old COBOL databases as well as legacy systems. And they really, really want to keep those going. And it's too difficult to migrate them to something else. So yeah, they pay a lot of money. Not relevant for bioinformatics, but just a fun thing of you're paid to the urgency of the task, I suppose, as well. So going into a different type of language, databasing, which is not really a programmatic language, right? It's just like inserting and retrieving data, or that's what, that's what the top level priority is. What do you guys feel about with SQL? I found it useful for myself to my work that I use. I tend, I've used SQL a lot more than I thought I would when I originally learned it. It is very handy to understand how to catalog and retrieve and search data to databases. It's not, unless you are someone, there are other ways of getting the job done. There are interfaces that allow you to interact with databases that you don't have to strictly learn the markup yourself as well. I think it is useful. It's useful to understand how data can be structured because everything is data. You need like understanding the concept. What is a primary key? What is a unique field? How do you map relationships between two tables of data? Conceptually, that's useful to know. And SQL is the implementation of it. So it's, it's good like that. Even if you don't strictly use it directly in your project, because you're not going to make a database. But I'm biased because I found it useful. Maybe you guys never found that, that had that click for you. Well, I mean, I love SQL actually. I had to use it a lot in grad school and I continue to use it. And I get in trouble sometimes because I use SQL Lite, which is a little bit slower, but I find it useful because it's portable. One thing I find really interesting in this area, this databasing language area, is that basically you're looking at SQL as a language is like the dominant language. And it's not going away unless you're looking at some big data databases like Hadoop. Like SQL is like the language and it's kind of like down to the nitty gritty. Like, do you use SQL flavored or do you use MySQL flavored or what flavor? I actually find it kind of comforting. It's very streamlined across the field. So I've done quite a lot of SQL. Like back in the day when I was doing my PhD, I actually made a search engine for like a social network and that had like tens of millions of rows in each table. And so we had to do optimization on just an insane level, you know? So like everything had to be absolutely nailed down because you didn't really have a big server. A small, you know, cheap server. And we had huge amounts of data in SQL database and having to do, you know, multi-table joins and all the indexing and, you know, indexing of indexes and all this kind of jazz. Like there's actually a huge amount of skill going from, say, a small little toy database to something with vast quantities of data. And it takes a lot of skill, I think, actually to create those kinds of SQL queries in such a way that the underlying lookups will work in milliseconds instead of seconds, you know? And that means a lot when you're building web pages. Yeah. And it's funny that I haven't done it to that extent, but I've had to think about that kind of problem. And then now in regular life, you sort of think, oh, I need to look through something. I need to cross-check a set of this versus that. But I should look in the table that has less elements first, because I'm not going to read through like this whole like, you know, thousand. So, you know, if I'm looking for something in one book versus another, look in the book that has 10 pages versus the one that has a thousand pages, because I find the relevant stuff there. And then I can cross-check it in the other, because otherwise I have to read the full 1000 page thing. You have those funny moments where this optimization problem, and then you start trying to solve everything with that. So I find it useful. It's a useful, it's a useful skill generally to understand how data works and how you manipulate it and how you just navigate it and get things out of it. Even I find it propping up all the time. And you understand why tools or other software or whatever behave the way they do. I don't know. I don't know where that leaves you though, in terms of someone coming to me and saying, what language should I learn? That's just more of a life lesson on databases. Everything is a database these days. Well, in computer science, everything is a network in my opinion. So if we just learn the fundamentals of networks and maths, we'll be sorted, and maybe a featuring machine is thrown in. One tip I give people is that in Python, everything is a dictionary. I've heard in Perl, everything is a hash. In Java, everything is an object. That's all the time we have for today. We've been talking about what programming languages to learn. So what do you all think? Write to us on Twitter with your stories on how you got into programming and bioinformatics. Have a great conversation and see you next time. Bye.