Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention and am an adjunct member at the University of Georgia in
the U.S. Hello, and welcome to the MicroBinfeed podcast. All three of us are
here today, Nabil, Andrew, and me, Lee. We're here to talk about what languages
to learn when you get into bioinformatics. We previously talked about this on
episode 13, but it's such an important topic that we felt that we should update
and expand on it, and we might have even changed our minds. So Andrew, what is
your view on this? So I was giving a talk recently to some PhD students, and one
of them asked, what language should I learn? And then after I've learned that
language, what should I learn next? And it's actually a really, really good
question, and I know it's very controversial as well because everyone has a
totally different opinion on this. My personal opinion, you know, if you don't
listen to the rest of the podcast, learn Python, then R, then something like C,
C++, C Sharp, and Java, and then you can learn the rest. So, you know, there you
go. You can turn off the podcast now. You've heard everything. I think some of
us will have a slightly different opinion. But just to expand on it maybe, my
general opinion is that you should start off with a scripting language. Like you
need something within your repertoire of things to do. So you're going to find
things like Python very, very useful as a quick scripting language. But then you
also need a different type of language for doing basically munging data and
doing stats and making pretty figures. And that's where R comes into its own.
And then finally, if you're doing like really heavy duty programming, you're
going to need a statically typed language like C++, C Sharp, Java, or if you
really like torturing yourself, C. And, you know, that fills a different gap.
And for me, I would say these are the languages you should start off with and
learn to an advanced level first, before you move on to maybe other things like
say, trendy languages like Rust or Haskell or Go or Ruby. There's a different
one every year, you know, that's on trend. It's fine if you are an established
programmer or established computer scientist, you know, you can jump in and out
of languages because you've probably done a dozen of them already. But if you're
just starting out learning this stuff, it's probably not the best idea to jump
into one of these more obscure, trendy languages as a first try. And you can get
easily misled as well, because people will be very enthusiastic about, you know,
say things like Rust or Haskell. But actually they're, you know, it possibly
does limit what you can do in bioinformatics if you learn something that isn't
very commonly used. So that's my opinion. What do you think Nabil? I'm a little
bit more agnostic and I would say, if you're trying, it depends, if you're
trying to learn a language, like you're a PhD student and you're not have a
formal course that you're applying for, you're going to need to learn off
tutorials and you're going to need to learn off people who are in your
environment around you. So you should look to who is in your space and what are
they using and who's willing to help you and what do they know? Because once you
get the concepts around programming, so what is a conditional? What is a loop?
What are primitives? How does that all put together to do something in any
language? You'll be able to transfer it to other languages. It's just a question
of understanding the syntax. There are some that are quite different in terms of
the paradigm, but for the most part, if you know, so a lot of the ones that
Andrew's talked about, you can kind of, if you know one, it'll help you learn
another and so on. So yeah, I'm not too fussed about it. It depends on, on what
your environment is and what you're trying to achieve as well. Some people might
really, in bioinformatics, you don't see it too much, but some people might
really want to do web development or web application as a bioinformatics tool.
That's online that you do stuff with it. Then Python can do that. JavaScript
might be more appropriate for that. Depends on your project. So, you know, you'd
sort of start towards what your final goal is as well. It's sort of an
individual decision. What about you, Lee? Yeah, you guys are making a lot of
sense on all that. I think I started off on, actually, you guys might not even
know this. I started with C and, and I quickly forgot about that because I
learned that in a course in college and went to PHP, which is also nauseating.
And when I got to grad school, they taught us Perl. That was the language for
bioinformatics. That was the latest and greatest in 2004. I think if I were in a
vacuum, I would actually say start with Perl because I love it. But we don't
work in a vacuum. We work with teams. We work on GitHub, which is inherently
collaborative. And if you're going to do general purpose, Python. But after
Python, I mean, there are like a whole host of languages you can go to for other
reasons. Python is a sort of good generic language that everyone is, is using.
But I mean, I totally agree with you guys. If you're going to get into something
like stats, go with R. If you're going to get into, you know, there are lots of
other important languages to learn. Like if you are going to get into web
development, JavaScript is the language. That's sort of my general overview.
Yeah, I'm curious, you brought up the thing of how you started, and I think all
of us actually had different trajectories. So Andrew, I think you went, what did
you start with? What was your order? Like X, Y, Z? You threw me off with Z. I
started off as a junior in college with a course in C, and then I dropped it for
a little while. And then I wanted to get into web development with PHP. So I
learned basically what they called LAMP at the time, and we don't have to go
into that whole thing. That's a whole Pandora's box. But you started with C,
then PHP. So Andrew, what about you? I started with C++, then Java, and Matla,
Maple, Prolog, Camel. I did software engineering as an undergrad, and then
eventually into Perl, which I really enjoyed actually, and PHP. And yeah, I did
a bit of LAMP as well, you know, it was all on trend for the time. Full stack
developers, they call it now. And gosh, and then when I went out into the
working world, I actually started off with Ruby, doing Rails. And then I went
more into Perl, and then back again into C, and then into God, Python. Like you
know, once you learn one or two languages, it becomes easier and easier to learn
more and more and more. And so mine was different again. I started with Java,
that was undergrad was a lot of Java. And then they sprinkle in Matlab, and R,
and a little bit of C, but mostly Java throughout. Little bit of SQL kind of
thing. And then when I did my first research project, as my informatics started
out, I started out in Perl. I did one year of solid Perl, because that was what
the lab did. And you know, I'm not just going to come in and just say, do
something else. So it was one year of Perl, and then I managed to switch it over
to Python after that, after I was more familiar and could be trusted. But I
guess there's a big difference between just learning the basics of a language,
and then really, really getting into the depths of it. Yeah, yeah. Frameworks,
and the libraries, and then all the little quirks, you know, that takes so long
to get into. And people don't necessarily appreciate it. They might do a little
course, and they might know, okay, well, I know an F7 and a for loop. But
actually really getting deep into it, and fully understanding language, it takes
a lot of time. You know, it takes a year or two to get to a proper in-depth
level. And that's from where you're converting from another language into that
language, you know? You're like, you're already an expert in programming. It
does take time. And it is something you need to appreciate that you can't just
think you can do a one-week course in Python, Perl, and Java, and then be an
expert in all of them. No, definitely. I'm talking, I think we're talking on a
timeframe of at least a decade, right, when we're mentioning all these languages
for between the three of us. My God, we're old. Yeah. And the two years was, is
if you're experienced in it, like, I already knew PHP, which has a lot of roots
with Perl, and learning Perl itself after PHP. It still took me, I would say
like five years to become really good at it. It takes a long time to learn a
language. Absolutely. And then there's other things that I don't really consider
languages, but they, I suppose they are like say SQL and bash scripting. You
just kind of use them all the time, but like, and I know they are languages, but
I wouldn't even consider them myself personally to be languages that go out and
explicitly learned or kind of things you pick up. Things like, well, SQL in
particular, and I guess maybe bash to an extent, it's a different mode of
working. It's actually quite different to programming really, like what you're
doing, how you're solving your problem and getting your head around that mindset
is useful, but it's, yeah, it's a bit different. It's a different, different
thing to programming. I would say like, I think that shell scripting is, is a
totally legit programming language. It's just not something I would use day to
day. Like it, it kind of makes system calls like the top level priority and
every language has their top level priority, right? Like when you're just doing
like function after function, maybe Python or Perl is kind of, okay, Python is
your, is your thing. Like if your top level thing is functions.  like actual
mathematical functions and statistics, then R is great. And then that's also in
part because there are established libraries in R to do those things. A totally
new language that just came around probably doesn't have all the bioinformatics
libraries that you want. And then of course, there's all the web stuff, like
HTML and Markdown and a bit of JavaScript and VB. These are things that people
will often employ and don't necessarily realize they're kind of languages and in
their own right. But they're useful and it is probably something you're going to
learn and pick up as you go along. Although one important thing to note is that
JavaScript and Java are not the same thing. Just like Paris and Paris Hilton are
not the same thing. Yeah. I think that they actually, they wanted their language
to become very popular. So they named JavaScript after Java, if I remember
right. Yep. They're totally different. Like one is no typing whatsoever, you
know, and it's just this free form language and the other is quite different.
Yeah. Yeah. Half the jokes on the programming humor subreddit are about
JavaScript. Like it's just such a weird language. It's become a bit better now.
The old stuff is terrible. If you had to do stuff with the jQuery kind of stuff
was just, it was terrible. Trying to figure out what the hell was going on. Your
scope was leaky. You didn't, you're calling variables from like, you know, a
hundred lines ago. Why is it here? That, that took me, that was, that was very
difficult coming from a job, coming from a Java background, working in
JavaScript. You just, you're like, what is going on? It's pure unadulterated
chaos. It's just Mad Max kind of programming languages. Yeah. So using
JavaScript, like it depends. I think that does depend on which framework you're
using. So you mentioned jQuery. I was on a different one when I was doing it
called Prototype. That was a little less mainstream. I've done, I've dabbled
with React a little bit. And, and now with the new sort of ECMASix standards and
things like that, it's a lot better. And I feel like it's more in line with
other languages. So I feel more comfortable with it. So it, I, you know, I would
be more comfortable, you know, if you asked me like four years ago, would you
suggest a beginner to learn JavaScript? I'd be like, no, now I'd be more
inclined to be like, yeah, okay. They can give it a go. So Nabil, if I gave you
a task, you know, here's a file and gonna go pull some stats out from it. What
language would be kind of your go-to without even thinking? I'd have to use R
these days. What about you? I'm sorry, but my answer is Perl. R is a horrible
language. I hate R so much. So are you, are you a fan of R, Andrew? I have done
R. I don't like R because every time I do it so infrequently that every time I
come back to it, I have to kind of relearn a lot of it. And so what I've been
doing is actually getting other people to, you know, give me their code and I
can modify it, which is, you know, obviously easy enough than actually going and
remembering how exactly to do things in R. Because R is, the syntax is quite
different and just getting your mind around like vector programming and, and all
that, all that kind of stuff, it just takes time. I don't have headspace for
that. So for me, that my go-to will always be, say for a quick, quick script
will be Perl or sometimes Python, depending on if, if I think I might reuse it
again, I'll, I'll write in Python straight off, but Python generally, you need
more lines of code, whereas Perl you need vastly less. True. You, you did a few,
I think I remember in Rory, you, you have a few scripts in there from R, is that
right? Do I? Oh, I think it definitely outputs. It outputs stuff for R.
Definitely. I can't remember. Have I put any R in there? Hopefully not because
that'd be painful for people. What about you Nabil? I can't deny that ggplot and
ggtree and a lot of these tools are very, these, these libraries for R are
incredibly useful and you can't really get the same visualization yourself
unless you draw it by, from scratch. And I absolutely hate that fact that I love
the output because the syntax is, is difficult to work with. Not only do you
have the, the language itself isn't familiar to me, like with other ones that I
know, but you also have separate paradigms of approaching the same problem in R.
For instance, you can write something in the native R language, which will be in
one particular form. And then you have things like the D player logic, which is
more functional and more sort of has a chain of callbacks that you apply to your
data frame to produce a particular output. So it's like, you know, data frame
dot mutate dot filter dot da da da da da. That's like a different thing you can
do. And you can do both and you can mix that as much as you want. You can do
native, you can do this, you can do something else. You can, there's no sense
between the libraries. Like the libraries will have, one will say, you say
something dot save, and then you'll say something dot save fig. And then you'll
say something dot something, something. And then you're like, what, why is, and
it makes sense. None of it is consistent because it's just random bits put
together. So all of what I, all my feeling towards R is perfectly encapsulated
by this quote from, from Paul Agapow who says, R's ultimate problem is the sum
of its small madnesses. And it is precisely that. There are so many little
things that undercut what you're trying to do and trying to keep track of it. It
is an absolute nightmare as a language I find. And this is why Python is so good
because there is one way to do it and that's it. It's not like R and Perl where
they encourage you to have 200 different ways of doing the same thing and
different conventions. Well, there's list comprehensions, which is a different
thing, which I think is fun in Python, but I get a little annoyed with list
comprehensions as well, although I use them all the time in Python. I love, I
love list comprehensions. They're so useful. I know, but then it's like, you can
do it in two ways then. And I'm like, this in the back of my mind, I'm like,
this is fine, but don't start going into that R route where you're going to give
me yet another third way of doing it, yet another way of doing it. Like I'm glad
that they're tidying up, say things like string interpolation, right? For
Python, because there was a point in that Python 2, Python 3, you could do just
basic printing of script, formatting of strings in like four different ways. And
like, why is this, just do it one way. You don't need four different ways to do
it. That gets my hackles up. And I think that's my upbringing on a Java side
because Java is quite strict with that sort of thing. But then you'd be spending
200 lines of code, you know, just to read in a CSV file. Yep. But then I, at
least I know what type I'm getting. I think that's why, that's why people went
to Python and that's why it's so mainstream. It's like, there's one way to do
it. Well, there might be like two ways or three ways, but like, if you're going
to code in Python, like the style of the language and the method that you do
something is like so rigorous that it makes it a lot more collaborative. I think
that's why we've arrived here. Python rides that rail really well. I think it
doesn't stray too much in the ambiguity, but it doesn't burden you with, with
extra verbosity in terms of trying to specify type all the time and things like
that. And it is actually straightforward to get into as well. I know the other
day, one of the lab techs was playing around with an OpenTrons robot. So these
are like open source robots you can buy very, very cheaply and you can program
them with Python scripts. And the person was just able to jump straight in, make
some Python scripts, and then away you go. And they were programming a liquid
handling robot within a few hours, which is a phenomenal thing to do. Yeah,
definitely. And it's been a long time for things like microcontrollers that you
can use Python as well to link into those. So Python has this, it has this
pervasiveness. It's not particularly good at anything. I will point out that,
you know, Python, when you want to start doing multi-threading, it is awful. And
you're like, why is this so difficult? So that's where you have to see, and
you're starting to go into, you know, the depths of, I don't know, multi-
threading with AVX, you know, splitting up loops and all this kind of jazz, like
that, that's where you really need to employ someone, you know, who really knows
their stuff, if you want to get the absolute most out of a machine. And then,
you know, if you want to go even at one level, even further than that, you know,
you've got GPU programming like CUDA and stuff like that. And that's, you know,
an even higher level because that's just mind-blowing how to get the most out of
a GPU, you know, that's difficult. And for most mathematicians, it's probably
going to be a bridge too far, you know, to jump straight into, but fair play to
do. Yeah. I would also, I'll bring up another thing that for me, I like Python
as a starting language rather than R, let's say, because coming back to the
point we were saying about the underlying concepts of how you approach
programming as a problem, regardless of the language, but just the concept of
how do you take a task and break it into these set of steps that you feed into a
stupid machine that doesn't know what you, what it's doing. Python is better at
helping you get your head around those fundamentals. And I think R doesn't guide
you. So I've seen a lot of R code that has some pretty nasty bad habits and how
they're written, things that they're not modular. They're not, they're not easy
to maintain. They're pretty rubbish with, with naming conventions and things
like that. There's no like guidance with that sort of thing. And Python is a
little bit more strict, especially if you start building and linting from the
beginning. There is a style guide, for instance, for Python, there's a Pep8
style guide that tells you like, this is a convention of how you name things.
This is the convention of how you approach particular tasks in Python.  And so
that I find is better because it helps you with the fundamentals. And the
fundamentals for me is the key thing. If you know the fundamentals, you can
program in any language. So maybe a controversial question there, right, for
you, Lee. But what language should someone not learn if they're just kind of
starting out, if they're early career? I would definitely go with the trendy
languages for don't start with those. Not because they're bad or anything.
They're probably good. But because you're getting into bioinformatics, which is
already cutting edge. And you want to have something that has established code,
established libraries. You don't want to be the one to write the first FASTA
parser in this language. And by the way, take it from me. I wrote bio.php. I
wrote a bio.js before they did bio.js on that paper. Like it was not fun to do.
And you want to use someone else's stuff. So let's say you're getting into the
newest language. It will be hard to read FASTA files. Harder to read FASTQ
files. Really hard to read SAM, VCF. You don't want to get into that stuff. So
the ones that you don't want to learn, trendy. Trendy languages. Go ahead. I
guess it's easy to read a FASTA file once that you produce. But reading FASTA
files or FASTQ files consistently that are produced by thousands of people
around the world and building a library for that, that's a much bigger problem
and much harder to solve. Because you might say, oh, well, a FASTQ file, sure,
it's four lines. I'll just read it in. Until you get like a single Nanopore
read, which is a million bases long, or the FASTQ file is split up into more
than four lines. And then you find, oh God, you know, now I have to consider all
these other different things. I've written GenBank passes in three different
languages and I would never. And they don't work. Like they work for what I was
doing. But as a general solution, they don't work. As Andrew's saying, don't do
it. Pick a language where you don't have to do that. And then for me, I would
say avoid languages like, I don't know, PHP. It's probably not going to be,
you're not going to get very far in bioinformatics with that kind of language.
Yeah, like what's, what is employable, I suppose, in the bioinformatics space?
Because obviously PHP and web stuff is, it's a bit specialist because generally
we don't do web applications at the moment. I would say, right, if you want to
earn lots of money, right, the best languages to learn are the really trendy
ones. However, it's a double-edged sword because those languages, you know, come
and go like the changing tides. If you want to have, you know, an actual career,
you should probably learn something that's more established, as Lee says, say
like Python or whatever, because there's more job opportunities. Because you'll
always get labs and always get companies that really want the trendy stuff
because they want to, you know, they want to wear polonex or whatever to build
the Tim Cooks of the world, you know, where these days, and you've got to be
able to keep up with that. I think one thing as well, in terms of learning, is
the new, shiny, trendy languages are usually constantly changing from release to
release, from version to version. So not only are you struggling, so when you go
and look up tutorials, right, and if there's, even if they're six months old or
one year old, they might not actually work. And then because you don't know
programming very well, you won't actually be able to figure out why. And you
won't be able to figure out, oh, it's just because they're using version two and
I'm using version three. And it's different between, you won't have that in your
mind's eye. And that makes it really, really difficult to get your head around
trendy languages. I will mention, I've read a report once upon a time that told
me that the most highest paying languages to learn, like the languages, if you
know them and you get employed to write them, these pay the most, is COBOL and
FORTRAN. In terms of, those are really trendy languages. I mean, COBOL is one of
the first programming languages ever produced. Yeah, go into that. Why would the
older languages be more payable? Because their legacy, because the first thing
they're a pain to work with, because they're old and janky. It's like, you know,
when you go and, I don't know, play video games in the 1980s, they're fun, but
like, there's no like UI, there's no like help things for you. There's no
tutorial. The second thing is that there's a lot of legacy systems that run on
it. Like if you think about railways, like, or you think about critical hospital
software, things like that, they're running on FORTRAN. Old like companies are
using old, old COBOL databases as well as legacy systems. And they really,
really want to keep those going. And it's too difficult to migrate them to
something else. So yeah, they pay a lot of money. Not relevant for
bioinformatics, but just a fun thing of you're paid to the urgency of the task,
I suppose, as well. So going into a different type of language, databasing,
which is not really a programmatic language, right? It's just like inserting and
retrieving data, or that's what, that's what the top level priority is. What do
you guys feel about with SQL? I found it useful for myself to my work that I
use. I tend, I've used SQL a lot more than I thought I would when I originally
learned it. It is very handy to understand how to catalog and retrieve and
search data to databases. It's not, unless you are someone, there are other ways
of getting the job done. There are interfaces that allow you to interact with
databases that you don't have to strictly learn the markup yourself as well. I
think it is useful. It's useful to understand how data can be structured because
everything is data. You need like understanding the concept. What is a primary
key? What is a unique field? How do you map relationships between two tables of
data? Conceptually, that's useful to know. And SQL is the implementation of it.
So it's, it's good like that. Even if you don't strictly use it directly in your
project, because you're not going to make a database. But I'm biased because I
found it useful. Maybe you guys never found that, that had that click for you.
Well, I mean, I love SQL actually. I had to use it a lot in grad school and I
continue to use it. And I get in trouble sometimes because I use SQL Lite, which
is a little bit slower, but I find it useful because it's portable. One thing I
find really interesting in this area, this databasing language area, is that
basically you're looking at SQL as a language is like the dominant language. And
it's not going away unless you're looking at some big data databases like
Hadoop. Like SQL is like the language and it's kind of like down to the nitty
gritty. Like, do you use SQL flavored or do you use MySQL flavored or what
flavor? I actually find it kind of comforting. It's very streamlined across the
field. So I've done quite a lot of SQL. Like back in the day when I was doing my
PhD, I actually made a search engine for like a social network and that had like
tens of millions of rows in each table. And so we had to do optimization on just
an insane level, you know? So like everything had to be absolutely nailed down
because you didn't really have a big server. A small, you know, cheap server.
And we had huge amounts of data in SQL database and having to do, you know,
multi-table joins and all the indexing and, you know, indexing of indexes and
all this kind of jazz. Like there's actually a huge amount of skill going from,
say, a small little toy database to something with vast quantities of data. And
it takes a lot of skill, I think, actually to create those kinds of SQL queries
in such a way that the underlying lookups will work in milliseconds instead of
seconds, you know? And that means a lot when you're building web pages. Yeah.
And it's funny that I haven't done it to that extent, but I've had to think
about that kind of problem. And then now in regular life, you sort of think, oh,
I need to look through something. I need to cross-check a set of this versus
that. But I should look in the table that has less elements first, because I'm
not going to read through like this whole like, you know, thousand. So, you
know, if I'm looking for something in one book versus another, look in the book
that has 10 pages versus the one that has a thousand pages, because I find the
relevant stuff there. And then I can cross-check it in the other, because
otherwise I have to read the full 1000 page thing. You have those funny moments
where this optimization problem, and then you start trying to solve everything
with that. So I find it useful. It's a useful, it's a useful skill generally to
understand how data works and how you manipulate it and how you just navigate it
and get things out of it. Even I find it propping up all the time. And you
understand why tools or other software or whatever behave the way they do. I
don't know. I don't know where that leaves you though, in terms of someone
coming to me and saying, what language should I learn? That's just more of a
life lesson on databases. Everything is a database these days. Well, in computer
science, everything is a network in my opinion. So if we just learn the
fundamentals of networks and maths, we'll be sorted, and maybe a featuring
machine is thrown in. One tip I give people is that in Python, everything is a
dictionary. I've heard in Perl, everything is a hash. In Java, everything is an
object. That's all the time we have for today. We've been talking about what
programming languages to learn. So what do you all think? Write to us on Twitter
with your stories on how you got into programming and bioinformatics. Have a
great conversation and see you next time. Bye.