Sunday, July 4, 2010

Building a Simple Application Using 'libsequence'

Folks,

As I discuss in my programming language comparison post, for performance, C++ is a natural choice. Calculating the simplest set of summary statistics on a set of data may take 30 seconds when using my beloved Python, but just a few seconds using C++. When needing to crunch through thousands or even tens of thousands of datasets, this makes a HUGE difference!

Of course, if writing the script to do so in Python takes less than 20 minutes, writing the equivalent C++ program can easily take several hours or even the whole day depending on your memory gremlins. A lot of the development time can be saved by making use of existing libraries. One of these that is particularly relevant to phylogeography is Kevin Thornton's libsequence. This library implements a broad range of summary statistics calculations in C++, and also features a fairly robust FASTA-format parser.

Using this library, I wrote a simple program to calculate Fst statistics from a FASTA file in less than an hour, and a lot of this time was spent in just dealing with parsing/processing the user options and arguments. As I note in the comments in the post, however, for a production-grade application, I would rather use the much more robust and flexible command-line parser that I wrote for my Ginkgo application, which would cut down development time on this aspect of the program by a dramatic amount.

In addition, it took me a little while longer than I anticipated to figure out how to get the whole thing to compile and link using gcc. As such, I also thought that I would share a simple general recipe for compiling and linking a libsequence application using gcc.

Because blogger sucks at code layout (or because I suck at getting blogger to layout code correctly), the actual sample code and build instructions are actually presented in a post on my personal site instead of here.

I have also written up a "autoconf" and "automake" project skeleton that automates most of the build process. This can be found in an attachment to the post mentioned above (direct link here).