Saturday, September 19, 2009

msBayes Basics I: Introduction

I've been working quite a bit with msBayes recently, and thought I would share some of the stuff that I've learnt, with an emphasis on the practical side of things. Over a series of posts, I am going to summarize the steps required to carry out a basic msBayes analysis, provide some "lessons learned" and tips for each of these steps, as well as provide some supplemental Python scripts that I find greatly facilitate the flow of data from one stage of the pipeline to another. This first post serves as a general introduction to the approach and the program, and a general overview of the procedures involved in an msBayes analysis.

Most of you are probably already quite familiar with this method of analysis, but for those of you that are not, perhaps a brief explanation may be in order. msBayes is a suite of tools that plug into a pipeline for Approximate Bayesian Computation (ABC) phylogeographic analysis, estimation and hypothesis testing. The ABC approach is a computationally efficient approach to complex estimation and inference problems. At its most basic, it works as follows:
  1. A set of summary statistics is calculated on an observed data set.
  2. Multiple sets of data are simulated using parameters drawn from a prior distribution.
  3. The same same set of summary statistics is calculated for each of these simulated data sets.
  4. The parameters that result in simulated data sets with summary statistics that are equal to or within some threshold distance of the summary statistics of the observed data set are recorded.
  5. The posterior probability of the parameter values is given by the proportion of those values in the final set of accepted parameters.
Each of these steps involves multiple levels of complexities and and a smorgasbord of variant approaches, of course, from the full nuts-and-bolts of the simulation model to various flavours of importance/rejection sampling (step 4), but the core concept remains the same: simulate the data under a range of parameter values, and keep those parameter values that result in simulated data sufficiently close to the observed data. This approach is extremely attractive: for many complex problems and questions, data can be simulated relatively easily, whereas sometimes even formulating the corresponding likelihood statement can be pretty challenging. Even if the likelihood can be formulated, computing it tends to be extremely expensive in terms of CPU power and time as well as memory, while simulating data is relatively cheap as far as those resources go. However, be warned: the ABC approach does exact a cost in storage space; all those samples from the prior have to be retained until they can be processed, and I have had analyses that walk into the realm of 50-60 GB.

msBayes can be used to answer a broad variety of questions using ABC, but a major one, and the one which I am currently using it for, is to test the hypothesis of simultaneous divergence amongst a set of taxa co-distributed across a geographical barrier. Specifically, I am asking "given a bunch of sequences of pairs of populations of different species on either side of a barrier, what is the posterior probability a single distinct divergence time between each pair?"

Using msBayes to answer this question involves the following steps:
  1. Assemble the observed data files.
  2. Create the configuration file to drive the simulations from the prior.
  3. Run the simulator to produce a large number of samples from the prior.
  4. Calculate the summary statistics on the observed data.
  5. Regularize the columns / fields in the summary statistics file and the simulated data.
  6. Carry out the acceptance/rejection sampling on the prior samples to produce samples from the posterior.
  7. Summarize the posterior probability distribution of discrete parameter values (such as number of divergence times) by assessing the proportion of those values in the posterior samples.
  8. Weight posterior samples of continuous parameter using local linear regression before summarizing their distribution.
Each of these steps will be covered in detail in future posts, and I will also be providing some scripts that I found really useful in helping to carry out some of these steps:

6 comments:

  1. You characterize ABC as "computationally efficient." I suppose that is in the eye of the beholder, but I'd generally consider importance sampling methods to be quite inefficient unless the importance distribution is pretty similar to the posterior that you want to sample.

    "Even if the likelihood can be formulated, computing it tends to be extremely expensive in terms of CPU power and time as well as memory, while simulating data is relatively cheap as far as those resources go." Are there examples for which we can calculate the likelihood, but this type of simulation is actually more efficient than the likelihood calcs? I guess it is hard to answer. We have cases in which there are so many continuous parameters that integration via grid-style parameter sweeps is impractical. Then it is a matter of comparing MCMC to ABC. Poorly tuned MCMC could certainly be a lot worse than ABC. My gut instinct is that reasonably tuned MCMC would almost always "beat" ABC.

    In my view, the real advantage of ABC is its flexibility. It is often easy to simulate a complex scenario, but designing rev-jump MCMC moves to sample a more complex model space can be tedious. It might pay off if you were going to repeat the analysis many times, but in most case you could get the answer faster by going the ABC route (and most of time that it would take would be computer's crunching numbers rather than your precious programmer time).


    That is my 2 cents worth...

    ReplyDelete
  2. Mark,

    I agree that the greatest strength of ABC is the ability to estimate posterior probabilities of really complex scenarios which may be very difficult to formulate in terms of a probabilistic model, but might be relatively easy to formulate in terms of a, for example, mechanistic model.

    The way msBayes does the rejection sampling is by simply retaining a specified proportion of samples from the simulations from prior that comes closest to the observed data in terms of the distance between summary statistics. This approach, I think, allows you to have really vague priors and still get results in reasonable period of time. But if your priors are mostly (or totally) off, or you simply do not sample from the priors sufficiently, the "efficiency" in terms of time and resources used to get results is going to be the same, but your results will be rather misleading, right?

    Another way to go about this is to keep on simulating from the prior, accepting only samples that come within some threshold distance, until you have accumulated a specified number of samples. It seems that vague priors here, coupled with a reasonably strict threshold, may mean that it may take an inordinately long time to get the target posterior sample size. And if your priors are off, you will never get there. So here it is difficult to make the argument for efficiency. But at least you will recognize that you have bad priors. What do you think?

    ReplyDelete
  3. Jeet,

    Just to clarify: the threshold that is used in the selection of models for the posterior uses Euclidian distance as it's metric, but it simply takes the best models from your prior, according to the proportion you specify. So if all your models are way off, you still get a posterior that minimizes Euclidian distance, but that doesn't mean your model is good or close to the truth.

    I've been a little uneasy about presenting the posterior probabilities of individual models based on a single threshold. I see variation in my simulations where I might believe model 0 is by far the best at a threshold of 0.001. But if I raise that to 0.005, I start to see model 2 creeping into the posterior. So as a bit of a copout, I've simply decided to present the posterior probabilities of individual models as a function of the threshold used. It would be nice to have some insights as to what a reasonable threshold is, but I'm not sure how one would generate those insights.

    Sorry, a bit off the topic of efficiency...

    ReplyDelete
  4. Hi Jake,

    Maybe you should explore the other method of rejection sampling, i.e., when you keep simulating from the prior until you have accumulated a pre-determined number of samples that are within some distance threshold of your observed data?

    It is a bit confusing to me how the word "threshold" (and "tolerance") is used in different ways under the two different rejection sampling schemes. For example, in the alternate rejection sampling scheme that I suggested in the previous paragraph, the term "threshold" refers to the minimum Euclidean distance in summary statistics for a sample to be accepted. In contrast, in the approach to rejection sampling used by msBayes, the tolerance/threshold refers to the proportion of samples from the prior that will be accepted (after they have been ranked by their distances to the observed data).

    Are there terms to distingush between the two rejection sampling schemes? i.e., Rejection-sampling-based-on-a-fixed-proportion-of-the-prior vs. Rejection-sampling-based-on-only-accepting-samples-within-some-distance-from-the-observed?

    ReplyDelete
  5. Good idea. Do you know how to implement it?

    ReplyDelete
  6. Jake,

    The hacky way to do this might be to use msBayes.pl to generate one sample from the prior, then calculate the Euclidean distance to the observed summary statistics to decide whether to keep or reject this sample, appending it to a growing file of samples from the posterior if you decide to keep it. Lather. Rinse. Repeat. Should be trivial to write a Python script to automate this process for a given threshold. But I suspect a major performance hit because of all the file opening and closing going on. A better way would be to hack the msBayes.pl script directly, but this will involve Perl. I personally would rather spend a few days coding up a complete solution from scratch to replicate the full functionality of msBayes.pl in Python than dip into Perl.

    ReplyDelete