Sunday, September 20, 2009

msBayes Basics III: Approaches to Sampling from the Posterior

This is the third of a multi-part series on msBayes basics. Here I will provide a brief overview of the ways you can carry out rejection sampling on your simulations from the prior using the programs distributed with msBayes. Later posts will describe the procedure in detail.

In the previous step, you have accumulated a large number parameter values sampled from the prior. Each sample of parameter values is associated with a set of summary statistics calculated on data simulated using those parameter values. In the current step, a specified proportion of these samples with summary statistics that are closest to the summary statistics calculated on the observed data will be retained, while the other samples will be rejected. Once you have completed this, you will have transformed your samples from the prior to samples from the posterior. It should be noted that this approach of generating posterior samples from the prior is only one variant of many, and is the approach adopted by the programs supplied with msBayes. An excellent summary of some alternative approaches, along with references, can be seen here.

I realize that up to this point I have not discussed the actual simulation model of msBayes at all, nor the summary statistics. I definitely think it would be interesting and useful to delve into the former at some point, but for now I am going to simply refer you to the papers that describe the models in detail, here and here, as well as just include a figure illustrating the model (as given in the original paper):



The summary statistics, and, just as importantly, the metric used to calculate distances between different sets of summary statistics are also definitely worthy of discussion. In fact, improvement of existing and development of new generalized summary statistics and corresponding distance metrics are where I think (and hope!) we will see the most exciting progress in Approximate Bayesian Computation in phylogeographical analysis in the near future. Right now, msBayes employs a suite of over a 100 traditional population genetic statistics (such as pi, Watterson's theta, Tajima's D, Shannon's Index, etc.) that have been tested in simulation studies as its core summary statistics. For its distance metric, msBayes uses the simple multivariate Euclidean distance between the summary statistics as a first step. For continuous parameter values, the posterior samples are then regressed onto the distance between the simulated data and the observed data summary statistics.

So, returning to our msBayes pipeline "walk-through", how do we go from our samples from the prior to samples to the posterior? We have a number of different options. msBayes itself provides two approaches, one using a Perl script, "acceptReject.pl", which drives an R script that carries out the business, and the other a C program, "msReject".

The Perl+R approach, using "acceptReject.pl", automatically does the local linear-regression weighting for you, and produces nice and fancy PDF's of the posterior probability distribution of the number of distinct divergence times for you. I do not recommend it for two reasons. The first reason is that the R analysis is extremely slow and is a phenomenal memory hog. I had to switch to using 16G machines to processes some files, and even then I was pushing the envelope. The second reason is that while the PDF plots of the posterior probability distribution of the number of divergence times are very nice and informative, I would far rather have the actual samples from the posterior. With this, not only will it be possible to generate any of a wide variety of plots that illustrate the posterior probability distribution of any recorded parameter value, I can also summarize the values in a number of other ways, or use the samples in other analyses.

The "msReject" approach carries out rejection sampling on the simulations from the prior using Euclidean distances. The program is extremely fast and has a relatively miniscule memory footprint. It routinely handles the processing of 50 million sample files without any trouble. However, it does not carry out the locally-weighted linear regression of the posterior values, and you will have to do this yourself separately. Nonetheless, I highly recommend using this program over the Perl script.

In the next post, I will go over in detail how you generate summary statistics on your observed data, provide a Python script that regularizes the file so that it is compatible with the samples simulated from the prior, and provide another Python script that identifies the summary statistic columns in the file and calls on msReject with the appropriate options to correctly generate a sample from the posterior.

1 comment:

  1. Nice posts Jeet. I look forward to a discussion of the local linear regression step, as it is still pretty fuzzy in my mind what it does and why it's necessary.

    ReplyDelete