Folks,
As I discuss in my programming language comparison post, for performance, C++ is a natural choice. Calculating the simplest set of summary statistics on a set of data may take 30 seconds when using my beloved Python, but just a few seconds using C++. When needing to crunch through thousands or even tens of thousands of datasets, this makes a HUGE difference!
Of course, if writing the script to do so in Python takes less than 20 minutes, writing the equivalent C++ program can easily take several hours or even the whole day depending on your memory gremlins. A lot of the development time can be saved by making use of existing libraries. One of these that is particularly relevant to phylogeography is Kevin Thornton's libsequence. This library implements a broad range of summary statistics calculations in C++, and also features a fairly robust FASTA-format parser.
Using this library, I wrote a simple program to calculate Fst statistics from a FASTA file in less than an hour, and a lot of this time was spent in just dealing with parsing/processing the user options and arguments. As I note in the comments in the post, however, for a production-grade application, I would rather use the much more robust and flexible command-line parser that I wrote for my Ginkgo application, which would cut down development time on this aspect of the program by a dramatic amount.
In addition, it took me a little while longer than I anticipated to figure out how to get the whole thing to compile and link using gcc. As such, I also thought that I would share a simple general recipe for compiling and linking a libsequence application using gcc.
Because blogger sucks at code layout (or because I suck at getting blogger to layout code correctly), the actual sample code and build instructions are actually presented in a post on my personal site instead of here.
I have also written up a "autoconf" and "automake" project skeleton that automates most of the build process. This can be found in an attachment to the post mentioned above (direct link here).
Sunday, July 4, 2010
Subscribe to:
Post Comments (Atom)
Hello,
ReplyDeleteYou might be interested in the Bio++ set of libraries, that contains classes that deal with alignments and trees, and functions to compute population genetics summary statistics, phylogenetic likelihood, ML or neighbor joinning trees, etc...
Website:
http://biopp.univ-montp2.fr/