The bioinformatics software we really need?


About the Author

Originally a laboratory scientist, PM stumbled into the world of computational biology in the dark days when Perl, CGI and C++ were tools of choice. Since then, he has lived and worked in Australia, South Africa and the UK on topics ranging from macroevolution through phylogenetics to computational embryology. Currently he works for the Health Protection Agency (UK), chasing and tracing pathogens and their evolution.


Software drives bioinformatics. We use every day to analyze data. We use them to build other tools. In our everyday work, software is often the rate-limiting step.
In that case, why is so much bioinformatics software so bad? Why is it so cumbersome or so user-hostile? Why does it have strange and arbitrary limitations? Why does it make the things we do every day so hard?
It’s a rhetorical question. As an author of scientific software myself, I know how thankless the task is. It’s hard, often unrecognised or unappreciated* and you open yourself up to providing indefinite and free support to users** for years. So take this as a wishlist, constructive criticism and a call for alternatives.
Here I’m thinking of the time a job interviewer described writing software as “merely functional”. Which makes me wonder if he would describe a billionaire as “merely wealthy” or a corpse as “merely dead”.
** Both users who know what they are doing and those who don’t.

Sequence and alignment editing

Surely this is one of the most common tasks in modern genomics. Yet, it’s a surprisingly common complaint that bioinformaticists can’t find a sequence editor they’re happy with. Not that there aren’t quite a few choices around but each seems to has its own quirks and shortcomings. I’ve never been quite been able to grok Genious and it’s proven near-impossible to use on a laptop screen. Day-to-day, I probably use BioEdit the most. It’s old and looks it, runs only on Windows and has a mish-mash of commands and analyses organised in a way thats never make sense to me. There are a variety of other contenders, but with their own problems (including using peculiar formats, see below).

User-friendly population genetics software

Recently, I had to do some simple pop gen analysis and turned to some of the more popular GUI packages. There followed a week of frustration as I struggled with data input, finding some (common and simple) functions in one program and some in another. One program produced obviously erroneous results while another simply stopped working every once in a while.
There is an answer to this: use the R package adegenet. But what if you’re not confortable with R and the commandline?

Let me use the format I want …

Converting between different sequence and alignment formats is a (mostly) solved task – the BioFoo libraries have shown that. So why do so many programs insist on dealing only in one or two? Why does programs X use Fasta and Y Stockholm? And why does program Z insist on using Clustal, meaning that I have to worry about taxa name length and whether a long informative identifier will be abbreviated into a meaningless short one?

… but not that one

Bioinformatics is plagues by old formats that are limited, ill-defined and just plain get in the way. For example Nexus is used by a whole host of programs, but comes in a variety of flavours and interpretations. This means that you have to edit files from one flavour of format to another in order to get them to work in different programs. Surely we could all use PhyloML or NexML instead?

More books and documentation on BioFoo

I know that documentation is a thankless task. But some parts of the Bio[Java|Perl|Python] libraries are described only as an API? This became apparent to me when I had to teach the libraries to students. What does this module do and why does it do it that way? Uh …

Phylogeny visualization and editing

Again, here we have a number of plausible candidates. FigTree offers a huge number of display options and produces nice, publication-ready figures. But the absence of any documentation, inability to save (and later reopen) a tree visualization and the occasional refusal to do draw some obvious and seemingly acheivable combination has made me look elsewhere. But the story is the same everywhere and meanwhile trees keep getting bigger and the things we need to do with them get more complicated …

Java scripting bioinformatics

BioJava is a good solid library and there are many other Java libraries that would be useful for bioinformatics. I’m less enthused about Java itself, but there are so many new and interesting scripting languages running on the JVM that there combining one of these with BioJava could open up a whole swath of opportunities. So why hasn’t anyone done this?
(Actually, I have just heard of ScalaBio. While Scala can be used for fast and correct compiled computation, it has a scripting form as well. So this might be promising.)

Better use of NGS data

A recent blog posting by Art Wuster asked if it was cheaper to resequence a genome than store the original NGS data for reanalysis later. (Answer: nearly but not quite.) Which brings up a good point: most NGS runs are assembled, analysed in a fairly shallow way and then forgotten about. We need more tools to look at variant data, examine quality and errors, combine sequencing data from different technologies. We’re generating terabytes of data and extracting only megabytes of meaning.

Statistics libraries

As great as R is, as wide the selection of libraries is, it’s still … R. It still has the slightly odd syntax and slightly weird behaviour. It’s also incongruent to do all your data production and analysis in one language (be it Perl, Python, Ruby …) and then have to import it into r to do a few graphs. So why hasn’t anyone done “R” in a “proper” language? Why don’t we have an statistics IDE using Python or Ruby with library installation from within and built-in visualization tools?
Actually, there are a few packages like this: Spyder, PythonXY etc. My take is that they are largely still for geeks: installation involves multiple compilation steps, ensuring the right prerequsites are there. It’s not a plug-and-play experience. And module installation is still depressingly manual. But, also in the Python world, stats packages are improving with things like Pandas and statmodels.