Sunday, July 4, 2010

Building a Simple Application Using 'libsequence'

Folks,

As I discuss in my programming language comparison post, for performance, C++ is a natural choice. Calculating the simplest set of summary statistics on a set of data may take 30 seconds when using my beloved Python, but just a few seconds using C++. When needing to crunch through thousands or even tens of thousands of datasets, this makes a HUGE difference!

Of course, if writing the script to do so in Python takes less than 20 minutes, writing the equivalent C++ program can easily take several hours or even the whole day depending on your memory gremlins. A lot of the development time can be saved by making use of existing libraries. One of these that is particularly relevant to phylogeography is Kevin Thornton's libsequence. This library implements a broad range of summary statistics calculations in C++, and also features a fairly robust FASTA-format parser.

Using this library, I wrote a simple program to calculate Fst statistics from a FASTA file in less than an hour, and a lot of this time was spent in just dealing with parsing/processing the user options and arguments. As I note in the comments in the post, however, for a production-grade application, I would rather use the much more robust and flexible command-line parser that I wrote for my Ginkgo application, which would cut down development time on this aspect of the program by a dramatic amount.

In addition, it took me a little while longer than I anticipated to figure out how to get the whole thing to compile and link using gcc. As such, I also thought that I would share a simple general recipe for compiling and linking a libsequence application using gcc.

Because blogger sucks at code layout (or because I suck at getting blogger to layout code correctly), the actual sample code and build instructions are actually presented in a post on my personal site instead of here.

I have also written up a "autoconf" and "automake" project skeleton that automates most of the build process. This can be found in an attachment to the post mentioned above (direct link here).

Sunday, June 13, 2010

Evolution 2010 Fast Approaching

It's hard to believe, but Evolution 2010 in Portland, OR is right around the corner. According to their website, it will be the largest Evolution meetings ever with > 1,800 registrants and > 1,050 talks. Judging from the recently posted program, the meeting will have quite a lot to offer to those interested in all things phylogenetic. I count 6 sessions on "phylogenetic theory", a whopping 13 sessions on "phylogeography", 12 sessions on "phylogenetics & diversification", and lots of other related topics. Unfortunately, many of these sessions are overlapping on Saturday and Sunday. I hope everyone arrives rested and with a large coffee mug! (I'm serious about the mugs. The organizers are pushing to reduce waste at the meeting and are asking everyone to bring their own reusable beverage containers.)

On a side note, I will be acting as a mentor on Saturday for the Undergraduate Diversity program. Part of the program's purpose is to help participating undergraduates network by meeting researchers in the field. So, if you see me on Saturday, please stop for a quick hello and introductions.

Looking forward to seeing everyone who can make it!

Friday, May 21, 2010

Transferring Support Values Between Trees

Papers reporting phylogenetic trees often label some point estimate of the phylogeny (e.g., an ML tree) with support values derived from several methods (e.g., posterior probabilities, ML bootstraps, and MP bootstraps). Perhaps I'm merely ignorant of the available software, but I have never been able to find a way to easily transfer multiple support values from various consensus trees to a point estimate without having to recalculate the consensus values from the original collections of trees. This problem is particularly acute when the consensus trees differ in topology.

As general Python programming practice, as well as a way to learn to use Jeet and Mark's great Dendropy library, I decided to take a crack at fixing this problem. I wrote a python script, creatively titled transferBranchLabels.py, that takes a single point estimate of the phylogeny (w/ or w/o branch lengths) and an arbitrary number of trees with support values as command-line arguments. It then returns the point estimate (retaining branch lengths, if originally present) with all the support values labeling the internal nodes. If a node is present in the point estimate (target tree) but not in one of the support trees, the node is labeled with '-'. The output tree simply includes all the support values separated by some delimeter (I've used '/' for now). TreeView X doesn't like that format, but FigTree handles it just fine.

Here's an example run of the program:


Note that the program reports how many of the clades in each support tree were (not) found in the target tree. Here are the corresponding trees from FigTree:


Upper: Original ML tree. Lower: Labeled ML tree.


Upper: Consensus tree from a posterior distribution. Lower: Consensus tree from ML bootstrapping.

If this would be a useful utility for you, feel free to download it and give it a try. Let me know how it goes.

Friday, December 11, 2009

Who's Hungry for DIM SUM?

Kevin Savidge, Emily McTavish and I have been working on a program for the simulation of demography and individual migration events on a continuous landscape. We've decided to call it DIM SUM (Demography and Individual Migration Simulated Using a Markov chain). DIM SUM was primarily written by Kevin as an undergraduate research project.

DIM SUM is a stand-alone Java program that takes one main XML input file, as well as files specifying carrying capacity and borders across an arbitrary landscape. These secondary files can be provided as numeric matrices or as images. If images, particular color channels correspond to the different input variables. Because it can take images as input, you can output your favorite map from GIS software (with some color corresponding to how suitable a given pixel of habitat may be) and simulate directly on that landscape! It also uses great circle distances, in case the landscape on which you're simulating has a large latitudinal span (since two lines of latitude are closer to one another near the poles as compared to the equator).

As output, DIM SUM provides trees of ancestor-descendant relationships, the locations of all individuals, and can also plot all individuals directly on the landscape. Here's a simple example using the Hawaiian islands, where all land area is specified as equally good habitat. In this example, we've used four separate starting populations of size 1, with each population of origin tagged with a different color. Note how the population starting on the island with the greatest expanse of suitable habitat (the Big Island) begins to dominate the populations on the other islands. Here's the population after 5 generations with a population size of 21...




...after 25 generations with a population size of 2,288...


...after 200 generations with a population size of 2,635...



...and after 800 generations with a population size of 2,636...



Our original motivation for creating such a simulator was to generate data for the testing of phylogeographic methods (although DIM SUM has many other potential uses). In particular, these types of simulations should provide a more evenhanded test of coalescent-based methods and allow the testing of methods that model migration on a continuous landscape (Lemmon and Lemmon, 2008, Syst. Biol., 57(4): 544-561). One very nice feature of continuous landscape methods is that they allow you to infer the geographic locations (think latitude and longitude) of ancestors. In fact, you can plot a likelihood surface directly on the landscape! To do this, you sequentially constrain the location for the ancestor of interest to a series of locations across the landscape, calculate the maximum likelihood score, and then use this matrix of scores to approximate the surface.

DIM SUM can generate data suitable for testing such a method, by comparing the true location of the ancestor to the inferred likelihood surface. We performed some simple simulations to try out the Lemmons' program (PhyloMapper) for inferring ancestral locations. In the plots below, the true location of the ancestor is a marked with a "*".

Here's a surface for a simulation in which we allowed a population to expand from a single individual in the middle of the range for 5,000 generations and then double in size for another 5,000 generations. The area in red gives the 95% confidence envelope...



...and here's a case where we allowed both the population and range sizes to double (the range was originally constrained to the left half of the landscape)...



Note that there is some bias when migration occurs more often in one direction (when the range expands), but it wasn't substantial enough to reject the true location in this case. Given the new perspective on phylogeographic data that it provides, I hope PhyloMapper sees more widespread use and development.

If you're interested in using DIM SUM, you can download it at http://code.google.com/p/bio-dimsum. Please let us know what you think!

P.S. Count yourself lucky if you can find an undergrad who can write a program like this in under a semester!

Saturday, October 3, 2009

An Idiosyncratic Analogical Overview of Some Programming Languages from an Evolutionary Biologist's Perspective

R

R is like a microwave oven. It is capable of handling a wide range of pre-packaged tasks, but can be frustrating or inappropriate when trying to do even simple things that are outside of its (admittedly vast) library of functions. Ever tried to make toast in a microwave? There has been a push to start using R for simulations and phylogenetic analysis, and I am actually rather ambiguous about how I feel about this. On the one hand, I would much rather an open source platform R be used than some proprietary commercial platform such as Mathematica or Matlab. On the other hand, I do not think that R is the most suitable for the full spectrum of these applications. There are some serious limitations on its capability and performance when handling large datasets (mainly for historical design reasons), and, to be frank, I find many aspects of its syntax and idiom quite irritating. I primarily use it for what it was designed for: numerical and statistical computing, as well as data visualization and plotting, and in this context I am quite happy with it. For almost anything else, I look elsewhere. R has an excellent and very friendly community of people actively developing and supporting it, and I might change my view as R evolves. But, as things stand, it simply is not my first choice for the majority of my programming needs.

Python

If R is like a microwave oven, then Python is a full-fledged modern kitchen. You can produce almost anything you want, from toast to multi-course banquets, but you probably need to do some extra work relative to R. With R, if you want an apple pie, all you need to do is pick up a frozen one from a supermarket and heat it up, and you'll be feasting in a matter of minutes. With Python, you will need to knead your own dough, cook your own apple filling, etc., but you will get your apple pie, and, to be honest, programming in Python is such a pleasure that you will probably enjoy the extra work. And what happens if you instead want a pie with strawberries and dark chocolate and a hint of chili in the filling? With R, if you cannot find an appropriate instant pie in your supermarket, you are out of luck, or you might be in for a very painful adventure in trying to mash together some chimeric concoction that will look horrible and taste worse. But with Python, any pie you can dream of is completely within reach, and probably will not take too much more effort than plain old apple pie. From data wrangling and manipulation (prepping data files, converting file formats, relabeling sequences, etc. etc.) to pipelining workflows, and even to many types analyses and simulations, Python is the ideal choice for the majority of tasks that an evolutionary biologist carries out.

C++

If R is like a microwave, and Python is like modern kitchen, then C++ is like an antique Victorian kitchen in the countryside, far, far, far away from any supermarket. You want an apple pie? With C++, you can consider yourself lucky if you do not actually have to grow the apples and harvest the wheat yourself. You will certainly need to mill and sift the flour, churn the butter, pluck the apples, grind the spices, etc. And none of this "set the oven to 400°" business: you will need to chop up the wood for the fuel, start the fire and keep tending the heat while it is baking. You will eventually get the apple pie, but you are going to have to work very, very, very hard to get it, and most of it will be tedious and painful work. And you will probably have at least one or two bugs in the pie when all is done. More likely than not, memory bugs ...

Stepping out of the cooking/kitchen analogy, if I had to point out the single aspect of programming in C++ that makes it such a pain, I would say "memory management". Much of the time programming in C++ is spent coding up the overhead involved in managing memory (in the broadest sense, from declaration, definition and initialization of stack variables, to allocation and deallocation of heap variables, to tracking and maintaining both types through their lifecycles), and even more is spent in tracking down bugs caused by memory leaks. The Standard Template Library certainly helps, and I've come to find it indispensable, but it still exacts its own cost in terms of overhead and its own price in terms of chasing down memory leaks and bugs.

For an example of the overhead, compare looping through elements of a container in Python:

for i in data:
    print i

vs. C++ using the Standard Template Library:

for (std::vector<long>::const_iterator i = data.begin();
        i != data.end();      
        ++i) {
    std::cout << *i << std::endl;
}

And for an example of insiduous memory bug even with the Standard Template Library, consider this: what might happen sometimes to a pointer that you have been keeping to an element in a vector, when some part of your code appends a new element to the vector? It can be quite ugly.

So what does all that extra work and pain get you?

Performance.

When it comes to performance, C++ rocks. My initial attempt at a complex phylogeography simulator was in Python. It took me a week to get it working. I could manage population sizes of about 1000 on a 1G machine, and it could complete 10000 generations in about a week. I rewrote it in C++. Instead of a week, it took me two and a half months to get it to the same level of complexity. When completed, however, it could manage population sizes of over 1 million on a 1 G machine, and run 2.5 million generations in 24 hours.

After that experience, when I am thinking of coding up something that might be computationally intensive or push the memory limits of my machines, the language that comes to mind is C++. More likely than not, however, I would probably still try to code up the initial solution in Python, and only turn to C++ when it becomes clear that Python's performance is not up to the task.

Java

Java, like Python, is a modern kitchen, allowing for a full range of operations with all the sanity-preserving conveniences and facilities (e.g., garbage-collection/memory-management). But it is a large, industrial kitchen, with an executive chef, sous chefs, and a full team of chefs de partie running things. And so, while you can do everything from making toast to multi-course meals, even the simplest tasks takes a certain minimal investment of organization and overhead. At the end of the day, for many simpler things, such as scrambled eggs and toast, you would get just as good results a lot quicker using Python.

I think that Java is really nice language, and I do like its idioms and syntax, which, by design, enforces many good programming practices. It is also probably much more suited for enterprise-level application development than Python. But I find it very top-heavy for the vast majority of things that I do, and the extra investment in programming overhead that it imposes (think: getters and setters) does not buy me any performance benefit at all. As a result, I have not found a need to code a single application in Java since I started using Python several years ago.


Thursday, October 1, 2009

msBayes Basics Part VI: Tolerances and Local-Linear Regression Weighting of the Posteriors

This is the sixth of a multi-part series on msBayes basics. In the previous part, I discussed how you might run "msReject" to carry out rejection sampling on your samples from the prior, to obtain samples from the posterior. One crucial setting for this previous step was the choice of tolerance, and this is something I discuss in detail here.

First, let us consider why there is a need for tolerances. In general, the more summary statistics we use, the better our ability to discriminate between models and correctly approximate the posterior becomes (as long as we are reasonable about the summary statistics we use). But, at the same time, the more difficult it becomes to simulate a sample from the priors that result in summary statistics that equal the summary statistics calculated on our observed data. The variation in some of the nuisance parameters that we are trying to integrate out, as well as stochasticity in the processes that we are trying to model, make it difficult to generate a sample with identical summary statistics even under a perfect (correct) model, even with a few summary statistics. So we allow some "jiggle room": we say that we will accept any sample from the prior that comes close to the observed in terms of the summary statistic, even if they are not exactly equal. And this jiggle room is what we call the tolerance. The more summary statistics that we use, the larger the tolerance that we will need to allow us to accept simulations at a reasonable rate.

However, as the tolerance gets larger and larger, we begin introducing more and more distortion in our approximation. This is because all samples accepted within the tolerance zone are weighted equally in the posterior. So a sample far from the observed in terms of summary statistic distance will be contribute the same proportion to the posterior as one very close, as long as they are both within the tolerance zone. In the limit, as our tolerance goes to infinity, we will accept all samples from the prior, and thus the direction of bias due to large tolerances is to weight our posteriors toward the prior.

As it happens, recent work by Beaumont et al. provides a solution to this problem. They use local-linear regression to weight the posteriors so that samples with summary statistics closer to the observed summary statistics are given a greater weight than those further away. With this, the analysis becomes relatively insensitive to choice of tolerances, as long as the number of posterior samples are large enough to allow accurate regression. This is vividly illustrated in the data below, which shows the results of rejection sampling at different tolerances, comparing the mean of the raw posterior distribution of a parameter vs. the mean of the weighted posterior distribution of the same parameter:

TOL.    NUM.    UNWEIGHTED           WEIGHTED
1e-05 500 0.148147697394790 0.188623205810913
2e-05 1000 0.178475544544545 0.138229444728950
1e-04 5000 0.217240549309862 0.126658030148325
2e-04 10000 0.235878721472147 0.130784297340281
0.001 50000 0.308834221344427 0.123652684056336
0.002 100000 0.355224938539385 0.123306496530203

As can be seen, both the range and variance of the regression-weighted parameter distribution across different tolerances are an order of magnitude less than the unweighted parameter distribution (0.0653167 vs. 0.2070772, and 0.0006326896 vs. 0.006153922, respectively). The posterior samples size under the first tolerance setting, 1e-05, is 500, and this small size probably led to a relatively inaccurate regression, as is evidence by the mean of the weighted distribution associated with it being rather far (> 7 s.d. units) from all the other values. Considering posterior sizes of 1000 or more, we find that the variance of the means of the weighted distributions vs. the unweighted is 3.843462e-05 vs. 0.005126309, and, regardless of our choice of the tolerance, we can safely place the mean between 0.12 and 0.14 (a difference of 0.02), while the mean for the unweighted parameter ranges from 0.18 to 0.36 (a difference of 0.18).

This informal analysis demonstrates that the local linear regression weighting approach of Beaumont et al. is remarkably effective in making the estimates insenstive to our choice of tolerances by correcting for the bias toward the prior introduced by large tolerances. This allows us to use larger tolerances in our analysis, which in turn allows us to use more summary statistics and yet collect samples from the posterior at a reasonable rate. Of course, the regression weighting only applies to estimates of continuous parameter values. When doing model selection (such as the numbers of divergence times), then this method is not suitable. In the case of numbers of divergence times, the best approach might be to use the regression to weight the variance and estimate of divergence times in your data (Var(t) and E(t)), and use the ratio of this (Omega = Var(t)/E(t)) to assess support for a single divergence time.

The story does not quite end here, however. A further complication is in the way that msBayes carries out rejection sampling. In all the discussion so far, we have referred to the process of accepting samples from the prior that come within some specified distance (the tolerance) of the observed. One would imagine carrying out simulations from the prior, and, for each simulation, accepting the sample into the posterior only if the Euclidean distance between the summary statistics of the simulated data and the observed data fall within some specified tolerance distance. In this scheme, we would run the simulations from the prior continuously, building our samples from the posterior as we go along, until our posterior samples reach some reasonable size. However, in implementing the rejection sampling, msBayes does things a little differently. As has been covered in painful detail, we generate samples from the prior of a specified size before we even begin the rejection sampling. The samples are then ranked in terms of the Euclidean distance of their summary statistics to the summary statistics of the observed data, and pre-specified proportion of those samples closest in distance to the observed data will be accepted into the posterior. It is confusing that this proportion is also called a tolerance, because it is not the same tolerance as discussed above. Rather, depending on the summary statistics values of the specific samples that happen to get accepted, an effective tolerance in the previous sense is implied or induced. For example, if we say that we will accept 0.02 of the 100000 prior samples into the posterior, then the 2000 samples out of the 100000 closest in the summary statistic distance will constitute our sample from the posterior, irrespective of the actual absolute distances. The tolerance in the sense discussed previously will then be given by the largest distance between the summary statistics of the 2000 samples in the posterior and the summary statistics of the observed data.

So where does this all leave us with regards to choice of tolerances? All I can offer here are some general habits that I practice, based on my experiences and understanding, which may or may not be applicable or appropriate in the context of your analyses (in other words, YMMV). I think that the most important guideline would be to have a posterior sample size at least 1000 or larger. I personally would probably not be too comfortable with less than a million samples from the prior, and prefer to work with prior sizes of 10 million or greater. Thus, I generally tend to have my "tolerance" (sensu msBayes) on the order of 1e-4 to 1e-5 or so, and ensure that my prior sample size is large enough so this choice of "tolerance" results in posteriors that are 1000 samples or more.

In the next post, I will provide an R script based on one that Mike Hickerson kindly sent to me to help carry out the local linear regression-based weighting of your posterior values, and discuss how to carry out this procedure.

References:


@article{beaumont2002approximate,
title={{Approximate Bayesian computation in population genetics}},
author={Beaumont, M.A. and Zhang, W. and Balding, D.J.},
journal={Genetics},
volume={162},
number={4},
pages={2025--2035},
year={2002},
publisher={Genetics Soc America}
url={http://www.genetics.org/cgi/content/full/162/4/2025}
}

@article{hickerson2006test,
title={{Test for simultaneous divergence using approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E.A. and Lessios, HA and Crandall, K.},
journal={Evolution},
volume={60},
number={12},
pages={2435--2453},
year={2006},
publisher={BioOne}
}

@article{hickerson2007msbayes,
title={{msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E. and Takebayashi, N.},
journal={BMC bioinformatics},
volume={8},
number={1},
pages={268},
year={2007},
publisher={BioMed Central Ltd}
}

Wednesday, September 30, 2009

msBayes Basics V: Rejection Sampling to Extract the Posteriors from Your Priors

This is the fifth of a multi-part series on msBayes basics. Here I will describe how to use programs distributed as part of the msBayes package to carry out rejection sampling on your simulations from the prior, and supply a Python script that makes some of the practical running of this a little bit less error-prone and simpler.

At this stage of the proceedings, you will have a large collection of samples simulated from the prior, in the form of a file of tab-delimited values. The first row will consist of column or field names, and each subsequent row will consist of two sets of fields that describe each sample: a set of parameters drawn from their prior distribution, and a set of summary statistics calculated on data simulated using those parameters. You should also have a file of summary statistics calculated on your observed data. Using the Python script to regularize columns between files, as discussed in a previous post, the columns in both the simulated and observed summary statistic files should now be compatible.

The next step is now to run "msReject" on your samples from the prior. The call to this program requires that you specify, in order:
  • the path to your summary statistics file
  • the path to your file of samples from the prior
  • the sampling tolerance
  • a list of columns (1-based column index numbers) which identify the summary statistic columns
For example, the following:
$ msReject obs_sum_stats.txt priors.txt 0.2 6 7 8 9 10 11 12
asks "msReject" to carrying out rejection sampling on the data in "priors.txt", using the summary statistics given in "obs_sum_stats.txt" as the observed summary statistics, and a tolerance of "0.2", with columns 6 through 12 taken to be summary statistic columns (and the remaining assumed to be prior/parameter columns).

It can be a little bit tedious, not to mention error-prone, needing to specify the summary statistic columns when invoking "msReject". To this end, I've written a Python script that automatically identifies the summary statistic columns in the file and composes the proper call to "msReject". By default it prints the command to the standard output, to show you what it is doing, and then invokes the command itself. Alternatively, it can be run in an "no execute" mode, where it just prints the correct command and exits, so that you can copy the command and use it in your own scripts.

The following example shows it in action, running "msReject" with a tolerance of 0.01 and directing the results to the file "posterior_accepted_samples.txt"

$run-reject.py obs_sum_stats priors1.txt 0.01 > posterior_accepted_samples.txt
Following columns identified as PARAMETERS:
[1]:"PRI.numTauClass"
[2]:"PRI.Psi"
[3]:"PRI.var.t"
[4]:"PRI.E.t"
[5]:"PRI.omega"

Following columns identified as SUMMARY STATISTICS:
[6]:"pi.b.1"
[7]:"pi.b.2"
.
.
(etc.)
.
.
[106]:"ShannonsIndex.Pop2.5"
[107]:"ShannonsIndex.Pop2.6"


Command to be executed:
msReject obs_sum_stats priors1.txt 0.02 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
101 102 103 104 105 106 107

Running with PID: 10927.
Completed with exit status: 0.

Once this process is done, you will have effectively sampled the posterior probability distribution of your parameters. The posterior probability of your model is the approximately the proportion of times it has been sampled in the posterior distribution. So, for example, the proportion of samples with 1 divergence time gives the approximate posterior probability of a single distinct divergence time between every population pair in your dataset. I have a small Python script that I use to calculate the proportions of the various parameters, but you can also use R or Perl or what-have-you.

One issue that I have not discussed in detail is the choice of tolerance and how that affects the analysis. As it turns out, this your analysis can be quite sensitive to the choice of tolerance, and there are steps that you can take to correct this. I will discuss the issue of tolerances in future posts, as well as provide an R script that weights the posterior probability of your parameters to reduce their sensitivity to your choice of tolerance.