Friday, December 11, 2009

Who's Hungry for DIM SUM?

Kevin Savidge, Emily McTavish and I have been working on a program for the simulation of demography and individual migration events on a continuous landscape. We've decided to call it DIM SUM (Demography and Individual Migration Simulated Using a Markov chain). DIM SUM was primarily written by Kevin as an undergraduate research project.

DIM SUM is a stand-alone Java program that takes one main XML input file, as well as files specifying carrying capacity and borders across an arbitrary landscape. These secondary files can be provided as numeric matrices or as images. If images, particular color channels correspond to the different input variables. Because it can take images as input, you can output your favorite map from GIS software (with some color corresponding to how suitable a given pixel of habitat may be) and simulate directly on that landscape! It also uses great circle distances, in case the landscape on which you're simulating has a large latitudinal span (since two lines of latitude are closer to one another near the poles as compared to the equator).

As output, DIM SUM provides trees of ancestor-descendant relationships, the locations of all individuals, and can also plot all individuals directly on the landscape. Here's a simple example using the Hawaiian islands, where all land area is specified as equally good habitat. In this example, we've used four separate starting populations of size 1, with each population of origin tagged with a different color. Note how the population starting on the island with the greatest expanse of suitable habitat (the Big Island) begins to dominate the populations on the other islands. Here's the population after 5 generations with a population size of 21...




...after 25 generations with a population size of 2,288...


...after 200 generations with a population size of 2,635...



...and after 800 generations with a population size of 2,636...



Our original motivation for creating such a simulator was to generate data for the testing of phylogeographic methods (although DIM SUM has many other potential uses). In particular, these types of simulations should provide a more evenhanded test of coalescent-based methods and allow the testing of methods that model migration on a continuous landscape (Lemmon and Lemmon, 2008, Syst. Biol., 57(4): 544-561). One very nice feature of continuous landscape methods is that they allow you to infer the geographic locations (think latitude and longitude) of ancestors. In fact, you can plot a likelihood surface directly on the landscape! To do this, you sequentially constrain the location for the ancestor of interest to a series of locations across the landscape, calculate the maximum likelihood score, and then use this matrix of scores to approximate the surface.

DIM SUM can generate data suitable for testing such a method, by comparing the true location of the ancestor to the inferred likelihood surface. We performed some simple simulations to try out the Lemmons' program (PhyloMapper) for inferring ancestral locations. In the plots below, the true location of the ancestor is a marked with a "*".

Here's a surface for a simulation in which we allowed a population to expand from a single individual in the middle of the range for 5,000 generations and then double in size for another 5,000 generations. The area in red gives the 95% confidence envelope...



...and here's a case where we allowed both the population and range sizes to double (the range was originally constrained to the left half of the landscape)...



Note that there is some bias when migration occurs more often in one direction (when the range expands), but it wasn't substantial enough to reject the true location in this case. Given the new perspective on phylogeographic data that it provides, I hope PhyloMapper sees more widespread use and development.

If you're interested in using DIM SUM, you can download it at http://code.google.com/p/bio-dimsum. Please let us know what you think!

P.S. Count yourself lucky if you can find an undergrad who can write a program like this in under a semester!

Saturday, October 3, 2009

An Idiosyncratic Analogical Overview of Some Programming Languages from an Evolutionary Biologist's Perspective

R

R is like a microwave oven. It is capable of handling a wide range of pre-packaged tasks, but can be frustrating or inappropriate when trying to do even simple things that are outside of its (admittedly vast) library of functions. Ever tried to make toast in a microwave? There has been a push to start using R for simulations and phylogenetic analysis, and I am actually rather ambiguous about how I feel about this. On the one hand, I would much rather an open source platform R be used than some proprietary commercial platform such as Mathematica or Matlab. On the other hand, I do not think that R is the most suitable for the full spectrum of these applications. There are some serious limitations on its capability and performance when handling large datasets (mainly for historical design reasons), and, to be frank, I find many aspects of its syntax and idiom quite irritating. I primarily use it for what it was designed for: numerical and statistical computing, as well as data visualization and plotting, and in this context I am quite happy with it. For almost anything else, I look elsewhere. R has an excellent and very friendly community of people actively developing and supporting it, and I might change my view as R evolves. But, as things stand, it simply is not my first choice for the majority of my programming needs.

Python

If R is like a microwave oven, then Python is a full-fledged modern kitchen. You can produce almost anything you want, from toast to multi-course banquets, but you probably need to do some extra work relative to R. With R, if you want an apple pie, all you need to do is pick up a frozen one from a supermarket and heat it up, and you'll be feasting in a matter of minutes. With Python, you will need to knead your own dough, cook your own apple filling, etc., but you will get your apple pie, and, to be honest, programming in Python is such a pleasure that you will probably enjoy the extra work. And what happens if you instead want a pie with strawberries and dark chocolate and a hint of chili in the filling? With R, if you cannot find an appropriate instant pie in your supermarket, you are out of luck, or you might be in for a very painful adventure in trying to mash together some chimeric concoction that will look horrible and taste worse. But with Python, any pie you can dream of is completely within reach, and probably will not take too much more effort than plain old apple pie. From data wrangling and manipulation (prepping data files, converting file formats, relabeling sequences, etc. etc.) to pipelining workflows, and even to many types analyses and simulations, Python is the ideal choice for the majority of tasks that an evolutionary biologist carries out.

C++

If R is like a microwave, and Python is like modern kitchen, then C++ is like an antique Victorian kitchen in the countryside, far, far, far away from any supermarket. You want an apple pie? With C++, you can consider yourself lucky if you do not actually have to grow the apples and harvest the wheat yourself. You will certainly need to mill and sift the flour, churn the butter, pluck the apples, grind the spices, etc. And none of this "set the oven to 400°" business: you will need to chop up the wood for the fuel, start the fire and keep tending the heat while it is baking. You will eventually get the apple pie, but you are going to have to work very, very, very hard to get it, and most of it will be tedious and painful work. And you will probably have at least one or two bugs in the pie when all is done. More likely than not, memory bugs ...

Stepping out of the cooking/kitchen analogy, if I had to point out the single aspect of programming in C++ that makes it such a pain, I would say "memory management". Much of the time programming in C++ is spent coding up the overhead involved in managing memory (in the broadest sense, from declaration, definition and initialization of stack variables, to allocation and deallocation of heap variables, to tracking and maintaining both types through their lifecycles), and even more is spent in tracking down bugs caused by memory leaks. The Standard Template Library certainly helps, and I've come to find it indispensable, but it still exacts its own cost in terms of overhead and its own price in terms of chasing down memory leaks and bugs.

For an example of the overhead, compare looping through elements of a container in Python:

for i in data:
    print i

vs. C++ using the Standard Template Library:

for (std::vector<long>::const_iterator i = data.begin();
        i != data.end();      
        ++i) {
    std::cout << *i << std::endl;
}

And for an example of insiduous memory bug even with the Standard Template Library, consider this: what might happen sometimes to a pointer that you have been keeping to an element in a vector, when some part of your code appends a new element to the vector? It can be quite ugly.

So what does all that extra work and pain get you?

Performance.

When it comes to performance, C++ rocks. My initial attempt at a complex phylogeography simulator was in Python. It took me a week to get it working. I could manage population sizes of about 1000 on a 1G machine, and it could complete 10000 generations in about a week. I rewrote it in C++. Instead of a week, it took me two and a half months to get it to the same level of complexity. When completed, however, it could manage population sizes of over 1 million on a 1 G machine, and run 2.5 million generations in 24 hours.

After that experience, when I am thinking of coding up something that might be computationally intensive or push the memory limits of my machines, the language that comes to mind is C++. More likely than not, however, I would probably still try to code up the initial solution in Python, and only turn to C++ when it becomes clear that Python's performance is not up to the task.

Java

Java, like Python, is a modern kitchen, allowing for a full range of operations with all the sanity-preserving conveniences and facilities (e.g., garbage-collection/memory-management). But it is a large, industrial kitchen, with an executive chef, sous chefs, and a full team of chefs de partie running things. And so, while you can do everything from making toast to multi-course meals, even the simplest tasks takes a certain minimal investment of organization and overhead. At the end of the day, for many simpler things, such as scrambled eggs and toast, you would get just as good results a lot quicker using Python.

I think that Java is really nice language, and I do like its idioms and syntax, which, by design, enforces many good programming practices. It is also probably much more suited for enterprise-level application development than Python. But I find it very top-heavy for the vast majority of things that I do, and the extra investment in programming overhead that it imposes (think: getters and setters) does not buy me any performance benefit at all. As a result, I have not found a need to code a single application in Java since I started using Python several years ago.


Thursday, October 1, 2009

msBayes Basics Part VI: Tolerances and Local-Linear Regression Weighting of the Posteriors

This is the sixth of a multi-part series on msBayes basics. In the previous part, I discussed how you might run "msReject" to carry out rejection sampling on your samples from the prior, to obtain samples from the posterior. One crucial setting for this previous step was the choice of tolerance, and this is something I discuss in detail here.

First, let us consider why there is a need for tolerances. In general, the more summary statistics we use, the better our ability to discriminate between models and correctly approximate the posterior becomes (as long as we are reasonable about the summary statistics we use). But, at the same time, the more difficult it becomes to simulate a sample from the priors that result in summary statistics that equal the summary statistics calculated on our observed data. The variation in some of the nuisance parameters that we are trying to integrate out, as well as stochasticity in the processes that we are trying to model, make it difficult to generate a sample with identical summary statistics even under a perfect (correct) model, even with a few summary statistics. So we allow some "jiggle room": we say that we will accept any sample from the prior that comes close to the observed in terms of the summary statistic, even if they are not exactly equal. And this jiggle room is what we call the tolerance. The more summary statistics that we use, the larger the tolerance that we will need to allow us to accept simulations at a reasonable rate.

However, as the tolerance gets larger and larger, we begin introducing more and more distortion in our approximation. This is because all samples accepted within the tolerance zone are weighted equally in the posterior. So a sample far from the observed in terms of summary statistic distance will be contribute the same proportion to the posterior as one very close, as long as they are both within the tolerance zone. In the limit, as our tolerance goes to infinity, we will accept all samples from the prior, and thus the direction of bias due to large tolerances is to weight our posteriors toward the prior.

As it happens, recent work by Beaumont et al. provides a solution to this problem. They use local-linear regression to weight the posteriors so that samples with summary statistics closer to the observed summary statistics are given a greater weight than those further away. With this, the analysis becomes relatively insensitive to choice of tolerances, as long as the number of posterior samples are large enough to allow accurate regression. This is vividly illustrated in the data below, which shows the results of rejection sampling at different tolerances, comparing the mean of the raw posterior distribution of a parameter vs. the mean of the weighted posterior distribution of the same parameter:

TOL.    NUM.    UNWEIGHTED           WEIGHTED
1e-05 500 0.148147697394790 0.188623205810913
2e-05 1000 0.178475544544545 0.138229444728950
1e-04 5000 0.217240549309862 0.126658030148325
2e-04 10000 0.235878721472147 0.130784297340281
0.001 50000 0.308834221344427 0.123652684056336
0.002 100000 0.355224938539385 0.123306496530203

As can be seen, both the range and variance of the regression-weighted parameter distribution across different tolerances are an order of magnitude less than the unweighted parameter distribution (0.0653167 vs. 0.2070772, and 0.0006326896 vs. 0.006153922, respectively). The posterior samples size under the first tolerance setting, 1e-05, is 500, and this small size probably led to a relatively inaccurate regression, as is evidence by the mean of the weighted distribution associated with it being rather far (> 7 s.d. units) from all the other values. Considering posterior sizes of 1000 or more, we find that the variance of the means of the weighted distributions vs. the unweighted is 3.843462e-05 vs. 0.005126309, and, regardless of our choice of the tolerance, we can safely place the mean between 0.12 and 0.14 (a difference of 0.02), while the mean for the unweighted parameter ranges from 0.18 to 0.36 (a difference of 0.18).

This informal analysis demonstrates that the local linear regression weighting approach of Beaumont et al. is remarkably effective in making the estimates insenstive to our choice of tolerances by correcting for the bias toward the prior introduced by large tolerances. This allows us to use larger tolerances in our analysis, which in turn allows us to use more summary statistics and yet collect samples from the posterior at a reasonable rate. Of course, the regression weighting only applies to estimates of continuous parameter values. When doing model selection (such as the numbers of divergence times), then this method is not suitable. In the case of numbers of divergence times, the best approach might be to use the regression to weight the variance and estimate of divergence times in your data (Var(t) and E(t)), and use the ratio of this (Omega = Var(t)/E(t)) to assess support for a single divergence time.

The story does not quite end here, however. A further complication is in the way that msBayes carries out rejection sampling. In all the discussion so far, we have referred to the process of accepting samples from the prior that come within some specified distance (the tolerance) of the observed. One would imagine carrying out simulations from the prior, and, for each simulation, accepting the sample into the posterior only if the Euclidean distance between the summary statistics of the simulated data and the observed data fall within some specified tolerance distance. In this scheme, we would run the simulations from the prior continuously, building our samples from the posterior as we go along, until our posterior samples reach some reasonable size. However, in implementing the rejection sampling, msBayes does things a little differently. As has been covered in painful detail, we generate samples from the prior of a specified size before we even begin the rejection sampling. The samples are then ranked in terms of the Euclidean distance of their summary statistics to the summary statistics of the observed data, and pre-specified proportion of those samples closest in distance to the observed data will be accepted into the posterior. It is confusing that this proportion is also called a tolerance, because it is not the same tolerance as discussed above. Rather, depending on the summary statistics values of the specific samples that happen to get accepted, an effective tolerance in the previous sense is implied or induced. For example, if we say that we will accept 0.02 of the 100000 prior samples into the posterior, then the 2000 samples out of the 100000 closest in the summary statistic distance will constitute our sample from the posterior, irrespective of the actual absolute distances. The tolerance in the sense discussed previously will then be given by the largest distance between the summary statistics of the 2000 samples in the posterior and the summary statistics of the observed data.

So where does this all leave us with regards to choice of tolerances? All I can offer here are some general habits that I practice, based on my experiences and understanding, which may or may not be applicable or appropriate in the context of your analyses (in other words, YMMV). I think that the most important guideline would be to have a posterior sample size at least 1000 or larger. I personally would probably not be too comfortable with less than a million samples from the prior, and prefer to work with prior sizes of 10 million or greater. Thus, I generally tend to have my "tolerance" (sensu msBayes) on the order of 1e-4 to 1e-5 or so, and ensure that my prior sample size is large enough so this choice of "tolerance" results in posteriors that are 1000 samples or more.

In the next post, I will provide an R script based on one that Mike Hickerson kindly sent to me to help carry out the local linear regression-based weighting of your posterior values, and discuss how to carry out this procedure.

References:


@article{beaumont2002approximate,
title={{Approximate Bayesian computation in population genetics}},
author={Beaumont, M.A. and Zhang, W. and Balding, D.J.},
journal={Genetics},
volume={162},
number={4},
pages={2025--2035},
year={2002},
publisher={Genetics Soc America}
url={http://www.genetics.org/cgi/content/full/162/4/2025}
}

@article{hickerson2006test,
title={{Test for simultaneous divergence using approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E.A. and Lessios, HA and Crandall, K.},
journal={Evolution},
volume={60},
number={12},
pages={2435--2453},
year={2006},
publisher={BioOne}
}

@article{hickerson2007msbayes,
title={{msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E. and Takebayashi, N.},
journal={BMC bioinformatics},
volume={8},
number={1},
pages={268},
year={2007},
publisher={BioMed Central Ltd}
}

Wednesday, September 30, 2009

msBayes Basics V: Rejection Sampling to Extract the Posteriors from Your Priors

This is the fifth of a multi-part series on msBayes basics. Here I will describe how to use programs distributed as part of the msBayes package to carry out rejection sampling on your simulations from the prior, and supply a Python script that makes some of the practical running of this a little bit less error-prone and simpler.

At this stage of the proceedings, you will have a large collection of samples simulated from the prior, in the form of a file of tab-delimited values. The first row will consist of column or field names, and each subsequent row will consist of two sets of fields that describe each sample: a set of parameters drawn from their prior distribution, and a set of summary statistics calculated on data simulated using those parameters. You should also have a file of summary statistics calculated on your observed data. Using the Python script to regularize columns between files, as discussed in a previous post, the columns in both the simulated and observed summary statistic files should now be compatible.

The next step is now to run "msReject" on your samples from the prior. The call to this program requires that you specify, in order:
  • the path to your summary statistics file
  • the path to your file of samples from the prior
  • the sampling tolerance
  • a list of columns (1-based column index numbers) which identify the summary statistic columns
For example, the following:
$ msReject obs_sum_stats.txt priors.txt 0.2 6 7 8 9 10 11 12
asks "msReject" to carrying out rejection sampling on the data in "priors.txt", using the summary statistics given in "obs_sum_stats.txt" as the observed summary statistics, and a tolerance of "0.2", with columns 6 through 12 taken to be summary statistic columns (and the remaining assumed to be prior/parameter columns).

It can be a little bit tedious, not to mention error-prone, needing to specify the summary statistic columns when invoking "msReject". To this end, I've written a Python script that automatically identifies the summary statistic columns in the file and composes the proper call to "msReject". By default it prints the command to the standard output, to show you what it is doing, and then invokes the command itself. Alternatively, it can be run in an "no execute" mode, where it just prints the correct command and exits, so that you can copy the command and use it in your own scripts.

The following example shows it in action, running "msReject" with a tolerance of 0.01 and directing the results to the file "posterior_accepted_samples.txt"

$run-reject.py obs_sum_stats priors1.txt 0.01 > posterior_accepted_samples.txt
Following columns identified as PARAMETERS:
[1]:"PRI.numTauClass"
[2]:"PRI.Psi"
[3]:"PRI.var.t"
[4]:"PRI.E.t"
[5]:"PRI.omega"

Following columns identified as SUMMARY STATISTICS:
[6]:"pi.b.1"
[7]:"pi.b.2"
.
.
(etc.)
.
.
[106]:"ShannonsIndex.Pop2.5"
[107]:"ShannonsIndex.Pop2.6"


Command to be executed:
msReject obs_sum_stats priors1.txt 0.02 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
101 102 103 104 105 106 107

Running with PID: 10927.
Completed with exit status: 0.

Once this process is done, you will have effectively sampled the posterior probability distribution of your parameters. The posterior probability of your model is the approximately the proportion of times it has been sampled in the posterior distribution. So, for example, the proportion of samples with 1 divergence time gives the approximate posterior probability of a single distinct divergence time between every population pair in your dataset. I have a small Python script that I use to calculate the proportions of the various parameters, but you can also use R or Perl or what-have-you.

One issue that I have not discussed in detail is the choice of tolerance and how that affects the analysis. As it turns out, this your analysis can be quite sensitive to the choice of tolerance, and there are steps that you can take to correct this. I will discuss the issue of tolerances in future posts, as well as provide an R script that weights the posterior probability of your parameters to reduce their sensitivity to your choice of tolerance.

Sunday, September 20, 2009

msBayes Basics IV: Preparing Your Files for Sampling from the Posterior

This is the fourth of a multi-part series on msBayes basics. Here I will discuss some glitches/incompatibilties in the msBayes pipeline when going into the rejection sampling stage that you have to watch out for. I will also present a Python script that fixes the incompatibilities in the pipeline and make the process easier.

The rejection sampling step requires two files: samples from the prior and summary statistics calculated on the observed data.

Our samples from the prior as generated by the script "msBayes.pl" are given as a file of tab-delimited values, with the first row of the file giving the field names. The first several fields consist of the parameter values drawn from the prior, while the remaining fields are the summary statistics calculated on data simulated with those parameter values. Parameter fields can be identified on the leading "PRI." associated with their names (e.g., "PRI.numTauClass", "PRI.Psi", "PRI.var.t"). The remaining fields are the summary statistics.

The summary statistics on the observed data are generated using the program "obsSumStats.pl". This is invoked with a single argument, the path to the master configuration file (discussed here), and by default writes the results to the standard output, but we will redirect this to a file "obs_sum_stats":
$ obsSumStats.pl msbayes_master > obs_sum_stats
The file "obs_sum_stats" is also a tab-delimited value file, and consists of two rows, a header row and a data row, with the latter comprising the summary statistics calculated on the observed data.

Now here is the problem with the msBayes pipeline: the observed summary statistics files generated by "obsSumStats.pl" and the file of simulations from the prior generated by "msBayes.pl" are incompatible!

In particular, the simulations from the prior file has a number of columns that are not present in the observed summary statistics. The file of observed summary statistics has 106 columns, with the first 4 columns being parameters and the remaining 102 columns being the summary statistics of interest. However, the file of simulations from the prior has a varying number of parameter columns, contingent upon your hyper-prior model settings, before being followed by the 102 columns of summary statistics. For example, if you do not constrain the number of divergence times ("numTauClasses = 0"), the simulations from the prior will have 107 columns, with the first 5 columns being parameters and the remaining 102 columns being summary statistics. On the other hand, if you constrain the run to have only 1 divergence time ("numTauClasses = 1"), then the simulations from the prior will have 109 columns, with the first 7 columns being parameters and the remaining 102 columns being summary statistics.

When dealing with just one file of observed summary statistics, and files of simulated samples from the prior generated under the exact same model, then fixing this incompatibility is trivial (IF you are aware of it; I was not the first few times that I ran msBayes, and because my results were not scarily unreasonable, it was only when I started to look at the files in detail that I realized what was happening). However, the moment that you begin doing anything more complex (such as runs under different hyper-priors for numbers of distinct divergence times), then it becomes really quite complicated and tedious.

UPDATE Sept 22 2009: It appears that "acceptRej.pl" not only correctly handles, but actually requires, that the file of summary statistics produced by "obsSumStats.pl" has one column short in relation to the file of simulations from the prior, i.e. 106 columns (of which the first 4 are parameters) instead of 107 (of which the first 5 are parameters). It is not clear to me at this point how it handles the situation when the file of simulations from the prior has > 107 columns, e.g., if the number of divergence times are constrained. One way or another though, to use "msReject", you will still want the column patterns consistent across all your files.

To this end, I've written a Python script that greatly simplifies the task of regularizing columns across multiple files. It depends crucially on a header row of column names as the first line of each file. You will pass it the path to a reference or "model" file that has a column pattern that you want to duplicate across all the source files. For each of the remaining source files you specify, it will examine the header row to identify the positions and names of dummy columns that it will need to insert to make the column pattern identical to the reference or model file. It then prints out the columns of data from the source file, inserting columns of dummy values as needed. If the column pattern of the source file matches, then it simply outputs the data rows as they are given. If you pass it multiple files as input, the data rows (i.e., all lines from the second line onwards) from all these files will be written out seamlessly so that the final output will consist of a single header row of column names followed by the data rows of all the source files, with dummy fields inserted into each row as neccessary so as to have a consistent column pattern.

For example, if you want your file of summary statistics calculated on your observed data (using "obsSumStats.pl") to match the column pattern of your file of simulations from the prior (generated using "msBayes.pl"), you would:
$ regularize-cols.py -m prior_sims obs_sum_stats > obs_sum_stats_regularized
(found 107 fields in model file)
-- Processing 1 of 1: obs_sum_stats
* Source file has 106 columns instead of 107.
* Columns to be inserted: [1]:"PRI.numTauClass"
As mentioned, the files of simulations from the prior generated by "msBayes.pl" have different numbers of columns, depending on the simulation model hyper-parameter settings. Since "regularize-cols.py" concatenates the data rows of all files passed to it as arguments, you can use it to merge simulations from the priors from different runs of "msBayes.pl" into a single large file for rejection sampling, and it will inspect each file individually to ensure that column patterns are identical to the model file. For example, the following takes 4 files containing data generated under two different models and concatenates the data into a single file with dummy fields inserted as neccessary to ensure uniformity and consistency in columns:
$ regularize-cols.py -m priors2_1 priors1_1 prior1_2 priors2_1 priors2_2 > priors_all
(found 111 fields in model file)
-- Processing 1 of 4: priors1_1
* Source file has 109 columns instead of 111.
* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"
-- Processing 2 of 4: priors1_2
* Source file has 109 columns instead of 111.
* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"
-- Processing 3 of 4: priors2_1
* Source file has same number of columns as model file.
* No columns to be inserted.
-- Processing 4 of 4: priors2_2
* Source file has same number of columns as model file.
* No columns to be inserted.
Things to note regarding the operation of the script:
  • Let me repeat this with emphasis: the script depends crucially on a header row of column names as the first line of each file!
  • It only inserts columns into rows of the source files so that their column pattern matches the reference or model file. It does not delete any columns. Thus, the model file should have a column pattern that is a superset of all the column patterns across the source files. You can use the UNIX command "head <FILENAME> | awk '{print NF}'" to quickly count columns in files.
  • It not only does not re-order columns in the source files, its logic depends on the column order of the source files matching the order of columns in the model file (sans missing columns, of course).
  • All the above conditions are met perfectly by the output files produced by msBayes, so as long as you do not modify these files directly, or are careful to maintain the above conditions if you do modify the files, there should be no problem in the execution of the script.
So, to summarize, as things stand when I write this, the msBayes pipeline generates files of varying column patterns. You will need to ensure that your column patterns are identical across and within files before actually carrying out the rejection sampling procedure. The script given here will automate this task for you, removing the tedium and reducing the risk of errors.

In my next post, I will discuss the actual invocation and running of "msReject", and present a Python script that makes the call to "msReject" simpler by identifying the summary statistic columns for you.

msBayes Basics III: Approaches to Sampling from the Posterior

This is the third of a multi-part series on msBayes basics. Here I will provide a brief overview of the ways you can carry out rejection sampling on your simulations from the prior using the programs distributed with msBayes. Later posts will describe the procedure in detail.

In the previous step, you have accumulated a large number parameter values sampled from the prior. Each sample of parameter values is associated with a set of summary statistics calculated on data simulated using those parameter values. In the current step, a specified proportion of these samples with summary statistics that are closest to the summary statistics calculated on the observed data will be retained, while the other samples will be rejected. Once you have completed this, you will have transformed your samples from the prior to samples from the posterior. It should be noted that this approach of generating posterior samples from the prior is only one variant of many, and is the approach adopted by the programs supplied with msBayes. An excellent summary of some alternative approaches, along with references, can be seen here.

I realize that up to this point I have not discussed the actual simulation model of msBayes at all, nor the summary statistics. I definitely think it would be interesting and useful to delve into the former at some point, but for now I am going to simply refer you to the papers that describe the models in detail, here and here, as well as just include a figure illustrating the model (as given in the original paper):



The summary statistics, and, just as importantly, the metric used to calculate distances between different sets of summary statistics are also definitely worthy of discussion. In fact, improvement of existing and development of new generalized summary statistics and corresponding distance metrics are where I think (and hope!) we will see the most exciting progress in Approximate Bayesian Computation in phylogeographical analysis in the near future. Right now, msBayes employs a suite of over a 100 traditional population genetic statistics (such as pi, Watterson's theta, Tajima's D, Shannon's Index, etc.) that have been tested in simulation studies as its core summary statistics. For its distance metric, msBayes uses the simple multivariate Euclidean distance between the summary statistics as a first step. For continuous parameter values, the posterior samples are then regressed onto the distance between the simulated data and the observed data summary statistics.

So, returning to our msBayes pipeline "walk-through", how do we go from our samples from the prior to samples to the posterior? We have a number of different options. msBayes itself provides two approaches, one using a Perl script, "acceptReject.pl", which drives an R script that carries out the business, and the other a C program, "msReject".

The Perl+R approach, using "acceptReject.pl", automatically does the local linear-regression weighting for you, and produces nice and fancy PDF's of the posterior probability distribution of the number of distinct divergence times for you. I do not recommend it for two reasons. The first reason is that the R analysis is extremely slow and is a phenomenal memory hog. I had to switch to using 16G machines to processes some files, and even then I was pushing the envelope. The second reason is that while the PDF plots of the posterior probability distribution of the number of divergence times are very nice and informative, I would far rather have the actual samples from the posterior. With this, not only will it be possible to generate any of a wide variety of plots that illustrate the posterior probability distribution of any recorded parameter value, I can also summarize the values in a number of other ways, or use the samples in other analyses.

The "msReject" approach carries out rejection sampling on the simulations from the prior using Euclidean distances. The program is extremely fast and has a relatively miniscule memory footprint. It routinely handles the processing of 50 million sample files without any trouble. However, it does not carry out the locally-weighted linear regression of the posterior values, and you will have to do this yourself separately. Nonetheless, I highly recommend using this program over the Perl script.

In the next post, I will go over in detail how you generate summary statistics on your observed data, provide a Python script that regularizes the file so that it is compatible with the samples simulated from the prior, and provide another Python script that identifies the summary statistic columns in the file and calls on msReject with the appropriate options to correctly generate a sample from the posterior.

Saturday, September 19, 2009

msBayes Basics II: Data Assembly, Configuration File Preparation, and Generating Samples from the Prior

This is the second of a multi-part series on msBayes basics, and here I will discuss the preparation of the input files to the pipeline, the creation of the configuration file to control the simulations from the prior, as well as the running of the simulator itself.

Data assembly is probably one of the more tedious parts of the whole ABC pipeline, and I only provide a brief overview. Much more detailed and very helpful information can be found in the msBayes documention.

The basic sequence data files themselves are pretty straightfoward: for each species or taxon distributed across the barrier, you need to provide a single FASTA file with all the sequences listed in order (i.e., with sequences from population A listed first, followed by sequences from population B). That's the easy part. The part that requires a bit of work is collecting the information required to flesh out the configuration file. This includes a distinct sequence evolution model for each population pair. The msBayes simulator needs to generate realistic genetic sequence differences between individuals from each population under a standard molecular evolution model (HKY), and thus requires that you provide estimates of the transition-transversion rate ratio and nucleotide base frequencies for each population pair. So, for example, you might want to use GARLI to infer a maximum likelihood tree under an HKY model for the sequences of each population pair, or alternatively, bring your favorite tree into PAUP* along with the sequences and ask PAUP* to provide ML estimates of the HKY parameters.

For the actual format of the master configuration file, please see the msBayes documention, but basically for each population pair, you need to provide the total number of sequences, the number of sequences from population A, the number of sequences from population B, the HKY parameters, the length of the sequence, and information the path to the FASTA file.

I would highly recommend that you also go the extra step and embed the msBayes run parameters in the configuration file as well, rather than rely on the interactive queries. This is neccessary if you want to run the simulations remotely on a cluster through a scheduler, but far more importantly, it provides a permanent record of the simulation settings so as to allow you to track down mistakes or reproduce the same conditions.

Here is an example of a complete simulation configuration file, where the run settings, hyper-priors, priors, and observed sequence data are all specified:


numLoci = 1
lowerTheta = 0.01
upperTheta = 75.0
upperTau = 10.0
numTauClasses = 0
upperMig = 0.0
upperRec = 0.0
upperAncPopSize = 0.5
constrain = 0
subParamConstrain = 000000000
reps = 5000000

BEGIN SAMPLE_TBL
100 7 93 3.71 2356 0.2 0.3 0.1 a1-a2.fas
16 4 12 4.08 2559 0.3 0.2 0.2 b1-b2.fas
10 5 5 4.08 1356 0.3 0.2 0.2 c1-c2.fas
29 8 21 4.09 2367 0.2 0.2 0.1 d1-d2.fas
11 2 9 4.13 1372 0.3 0.2 0.1 e1-d2.fas
69 38 31 4.13 1372 0.2 0.2 0.1 f1-f2.fas
END SAMPLE_TBL



Once you have assembled your data files and created the simulation configuration file, the next step is to actually run the msBayes simulator to generate samples from the prior. This can be as simple as:



$ msbayes.pl -c msbayes_master -o prior_samples


This calls on "msbayes.pl", a Perl script that is part of the msBayes distribution, passing it our master configuration file "msbayes_master" as an argument and asking it to direct its results to a file called "prior_samples".

"msbayes.pl" is the primary driver for the simulations from the prior. It wraps up a whole bunch of complexity: generation of hyper-parameters from their prior distribution, the generation of the actual model parameters to be passed to the underlying coalescent-based sequence simulator (using "msprior"), calling the coalescent-based sequence simulator ("msDQH") to simulate a sample of summary statistics based on the generated model parameters, assembly of a row of data that couples together the set of generated parameter values and the corresponding summary statistics simulated based on this values, and the output of this row as a tab-delimited line of data. This gets repeated again and again, for the number of replicates specified in the master configuration file ("reps = 5000000" in the example file above), to build up a large sample drawn from the parameter space according to the given prior distributions.

All this is a lot of work, but is extremely computationally efficient: the CPU and memory footprint of a single process running is almost negligible, and you can run multiple simultaneous simulations without much overhead even on a single middle-of-the-road machine. Which leads to my recommendation here: run multiple processes to generate multiple files with samples from the prior simultaneously, concatenating these files before going to the next step in the ABC pipeline. So, for example, if you want 10 million samples from the prior, you can run 10 simultaneous processes, each drawing 1 million samples from the prior; concatenating the files then gives you 10 million samples in 1/10th the time.

Once you have you samples from the prior, the next step is to carry out rejection sampling on this data to provide samples from the posterior. And this is where the going gets pretty interesting ...

You see, to be honest, I do not think my posts describing the procedures up to this stage have really provided all that much more new or useful information than already existed in the documentation. I think that the operation of the msBayes ABC pipeline up to this point has been smooth and relatively trouble-free, and has proceeded as expected based on the documentation and help files. However, going beyond the generation of samples from the prior is where I ran into a number of difficulties, gotchas, glitches, and trip-me-up's. My next post will discuss these issues, share my experiences, lessons learned and solutions, as well as offer some scripts that I wrote to patch up the "leaks" and unexpected incompatibilities between different parts of the pipeline.

msBayes Basics I: Introduction

I've been working quite a bit with msBayes recently, and thought I would share some of the stuff that I've learnt, with an emphasis on the practical side of things. Over a series of posts, I am going to summarize the steps required to carry out a basic msBayes analysis, provide some "lessons learned" and tips for each of these steps, as well as provide some supplemental Python scripts that I find greatly facilitate the flow of data from one stage of the pipeline to another. This first post serves as a general introduction to the approach and the program, and a general overview of the procedures involved in an msBayes analysis.

Most of you are probably already quite familiar with this method of analysis, but for those of you that are not, perhaps a brief explanation may be in order. msBayes is a suite of tools that plug into a pipeline for Approximate Bayesian Computation (ABC) phylogeographic analysis, estimation and hypothesis testing. The ABC approach is a computationally efficient approach to complex estimation and inference problems. At its most basic, it works as follows:
  1. A set of summary statistics is calculated on an observed data set.
  2. Multiple sets of data are simulated using parameters drawn from a prior distribution.
  3. The same same set of summary statistics is calculated for each of these simulated data sets.
  4. The parameters that result in simulated data sets with summary statistics that are equal to or within some threshold distance of the summary statistics of the observed data set are recorded.
  5. The posterior probability of the parameter values is given by the proportion of those values in the final set of accepted parameters.
Each of these steps involves multiple levels of complexities and and a smorgasbord of variant approaches, of course, from the full nuts-and-bolts of the simulation model to various flavours of importance/rejection sampling (step 4), but the core concept remains the same: simulate the data under a range of parameter values, and keep those parameter values that result in simulated data sufficiently close to the observed data. This approach is extremely attractive: for many complex problems and questions, data can be simulated relatively easily, whereas sometimes even formulating the corresponding likelihood statement can be pretty challenging. Even if the likelihood can be formulated, computing it tends to be extremely expensive in terms of CPU power and time as well as memory, while simulating data is relatively cheap as far as those resources go. However, be warned: the ABC approach does exact a cost in storage space; all those samples from the prior have to be retained until they can be processed, and I have had analyses that walk into the realm of 50-60 GB.

msBayes can be used to answer a broad variety of questions using ABC, but a major one, and the one which I am currently using it for, is to test the hypothesis of simultaneous divergence amongst a set of taxa co-distributed across a geographical barrier. Specifically, I am asking "given a bunch of sequences of pairs of populations of different species on either side of a barrier, what is the posterior probability a single distinct divergence time between each pair?"

Using msBayes to answer this question involves the following steps:
  1. Assemble the observed data files.
  2. Create the configuration file to drive the simulations from the prior.
  3. Run the simulator to produce a large number of samples from the prior.
  4. Calculate the summary statistics on the observed data.
  5. Regularize the columns / fields in the summary statistics file and the simulated data.
  6. Carry out the acceptance/rejection sampling on the prior samples to produce samples from the posterior.
  7. Summarize the posterior probability distribution of discrete parameter values (such as number of divergence times) by assessing the proportion of those values in the posterior samples.
  8. Weight posterior samples of continuous parameter using local linear regression before summarizing their distribution.
Each of these steps will be covered in detail in future posts, and I will also be providing some scripts that I found really useful in helping to carry out some of these steps:

Thursday, July 9, 2009

Welcome to geodendron!

Now that this blog has been up for several weeks, it's probably time to post something. It's become increasingly apparent (at least in my opinion) that the time is ripe for some serious testing of phylo- and biogeographic methods. Hopefully this blog will serve as a forum for thoughts and developments related to (i) approaches to simulating appropriate benchmark datasets, (ii) simulation studies to test existing methods, and (iii) the development of new analytical tools.

I envision a few posts each week on published papers or unpublished thoughts, each of which will spark comments and discourse on the post's topic. Not that this should need to be pointed out, but just in case: all discussions on this blog should remain civil and respectful in tone.

Jeet and I realized the potential usefulness of this blog when we discovered that we were duplicating each other's efforts in writing generalized simulation programs for phylogeographic data. Roughly, both of our programs relax coalescent assumptions (among other things) and allow individuals to move on a continuous landscape. My collaborators and I are in the process of writing up an application note about our program and are trying to figure out what simple simulations would be good for illustrating the utility of continuous-landscape simulations over existing coalescent-based simulations.

So, to kick things off, what are your thoughts on this? What are some simple scenarios where relaxing the coalescent assumption might be particularly important?