geodendron: approximate Bayesian Computation

Showing posts with label approximate Bayesian Computation. Show all posts

Thursday, October 1, 2009

msBayes Basics Part VI: Tolerances and Local-Linear Regression Weighting of the Posteriors

This is the sixth of a multi-part series on msBayes basics. In the previous part, I discussed how you might run "msReject" to carry out rejection sampling on your samples from the prior, to obtain samples from the posterior. One crucial setting for this previous step was the choice of tolerance, and this is something I discuss in detail here.

First, let us consider why there is a need for tolerances. In general, the more summary statistics we use, the better our ability to discriminate between models and correctly approximate the posterior becomes (as long as we are reasonable about the summary statistics we use). But, at the same time, the more difficult it becomes to simulate a sample from the priors that result in summary statistics that equal the summary statistics calculated on our observed data. The variation in some of the nuisance parameters that we are trying to integrate out, as well as stochasticity in the processes that we are trying to model, make it difficult to generate a sample with identical summary statistics even under a perfect (correct) model, even with a few summary statistics. So we allow some "jiggle room": we say that we will accept any sample from the prior that comes close to the observed in terms of the summary statistic, even if they are not exactly equal. And this jiggle room is what we call the tolerance. The more summary statistics that we use, the larger the tolerance that we will need to allow us to accept simulations at a reasonable rate.

However, as the tolerance gets larger and larger, we begin introducing more and more distortion in our approximation. This is because all samples accepted within the tolerance zone are weighted equally in the posterior. So a sample far from the observed in terms of summary statistic distance will be contribute the same proportion to the posterior as one very close, as long as they are both within the tolerance zone. In the limit, as our tolerance goes to infinity, we will accept all samples from the prior, and thus the direction of bias due to large tolerances is to weight our posteriors toward the prior.

As it happens, recent work by Beaumont et al. provides a solution to this problem. They use local-linear regression to weight the posteriors so that samples with summary statistics closer to the observed summary statistics are given a greater weight than those further away. With this, the analysis becomes relatively insensitive to choice of tolerances, as long as the number of posterior samples are large enough to allow accurate regression. This is vividly illustrated in the data below, which shows the results of rejection sampling at different tolerances, comparing the mean of the raw posterior distribution of a parameter vs. the mean of the weighted posterior distribution of the same parameter:

TOL.    NUM.    UNWEIGHTED           WEIGHTED
1e-05   500     0.148147697394790    0.188623205810913
2e-05  1000    0.178475544544545    0.138229444728950
1e-04   5000    0.217240549309862    0.126658030148325
2e-04   10000   0.235878721472147    0.130784297340281
0.001   50000   0.308834221344427    0.123652684056336
0.002   100000  0.355224938539385    0.123306496530203

As can be seen, both the range and variance of the regression-weighted parameter distribution across different tolerances are an order of magnitude less than the unweighted parameter distribution (0.0653167 vs. 0.2070772, and 0.0006326896 vs. 0.006153922, respectively). The posterior samples size under the first tolerance setting, 1e-05, is 500, and this small size probably led to a relatively inaccurate regression, as is evidence by the mean of the weighted distribution associated with it being rather far (> 7 s.d. units) from all the other values. Considering posterior sizes of 1000 or more, we find that the variance of the means of the weighted distributions vs. the unweighted is 3.843462e-05 vs. 0.005126309, and, regardless of our choice of the tolerance, we can safely place the mean between 0.12 and 0.14 (a difference of 0.02), while the mean for the unweighted parameter ranges from 0.18 to 0.36 (a difference of 0.18).

This informal analysis demonstrates that the local linear regression weighting approach of Beaumont et al. is remarkably effective in making the estimates insenstive to our choice of tolerances by correcting for the bias toward the prior introduced by large tolerances. This allows us to use larger tolerances in our analysis, which in turn allows us to use more summary statistics and yet collect samples from the posterior at a reasonable rate. Of course, the regression weighting only applies to estimates of continuous parameter values. When doing model selection (such as the numbers of divergence times), then this method is not suitable. In the case of numbers of divergence times, the best approach might be to use the regression to weight the variance and estimate of divergence times in your data (Var(t) and E(t)), and use the ratio of this (Omega = Var(t)/E(t)) to assess support for a single divergence time.

The story does not quite end here, however. A further complication is in the way that msBayes carries out rejection sampling. In all the discussion so far, we have referred to the process of accepting samples from the prior that come within some specified distance (the tolerance) of the observed. One would imagine carrying out simulations from the prior, and, for each simulation, accepting the sample into the posterior only if the Euclidean distance between the summary statistics of the simulated data and the observed data fall within some specified tolerance distance. In this scheme, we would run the simulations from the prior continuously, building our samples from the posterior as we go along, until our posterior samples reach some reasonable size. However, in implementing the rejection sampling, msBayes does things a little differently. As has been covered in painful detail, we generate samples from the prior of a specified size before we even begin the rejection sampling. The samples are then ranked in terms of the Euclidean distance of their summary statistics to the summary statistics of the observed data, and pre-specified proportion of those samples closest in distance to the observed data will be accepted into the posterior. It is confusing that this proportion is also called a tolerance, because it is not the same tolerance as discussed above. Rather, depending on the summary statistics values of the specific samples that happen to get accepted, an effective tolerance in the previous sense is implied or induced. For example, if we say that we will accept 0.02 of the 100000 prior samples into the posterior, then the 2000 samples out of the 100000 closest in the summary statistic distance will constitute our sample from the posterior, irrespective of the actual absolute distances. The tolerance in the sense discussed previously will then be given by the largest distance between the summary statistics of the 2000 samples in the posterior and the summary statistics of the observed data.

So where does this all leave us with regards to choice of tolerances? All I can offer here are some general habits that I practice, based on my experiences and understanding, which may or may not be applicable or appropriate in the context of your analyses (in other words, YMMV). I think that the most important guideline would be to have a posterior sample size at least 1000 or larger. I personally would probably not be too comfortable with less than a million samples from the prior, and prefer to work with prior sizes of 10 million or greater. Thus, I generally tend to have my "tolerance" (sensu msBayes) on the order of 1e-4 to 1e-5 or so, and ensure that my prior sample size is large enough so this choice of "tolerance" results in posteriors that are 1000 samples or more.

In the next post, I will provide an R script based on one that Mike Hickerson kindly sent to me to help carry out the local linear regression-based weighting of your posterior values, and discuss how to carry out this procedure.

References:


@article{beaumont2002approximate,
title={{Approximate Bayesian computation in population genetics}},
author={Beaumont, M.A. and Zhang, W. and Balding, D.J.},
journal={Genetics},
volume={162},
number={4},
pages={2025--2035},
year={2002},
publisher={Genetics Soc America}
url={http://www.genetics.org/cgi/content/full/162/4/2025}
}

@article{hickerson2006test,
title={{Test for simultaneous divergence using approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E.A. and Lessios, HA and Crandall, K.},
journal={Evolution},
volume={60},
number={12},
pages={2435--2453},
year={2006},
publisher={BioOne}
}

@article{hickerson2007msbayes,
title={{msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E. and Takebayashi, N.},
journal={BMC bioinformatics},
volume={8},
number={1},
pages={268},
year={2007},
publisher={BioMed Central Ltd}
}

Sunday, September 20, 2009

msBayes Basics III: Approaches to Sampling from the Posterior

This is the third of a multi-part series on msBayes basics. Here I will provide a brief overview of the ways you can carry out rejection sampling on your simulations from the prior using the programs distributed with msBayes. Later posts will describe the procedure in detail.

In the previous step, you have accumulated a large number parameter values sampled from the prior. Each sample of parameter values is associated with a set of summary statistics calculated on data simulated using those parameter values. In the current step, a specified proportion of these samples with summary statistics that are closest to the summary statistics calculated on the observed data will be retained, while the other samples will be rejected. Once you have completed this, you will have transformed your samples from the prior to samples from the posterior. It should be noted that this approach of generating posterior samples from the prior is only one variant of many, and is the approach adopted by the programs supplied with msBayes. An excellent summary of some alternative approaches, along with references, can be seen here.

I realize that up to this point I have not discussed the actual simulation model of msBayes at all, nor the summary statistics. I definitely think it would be interesting and useful to delve into the former at some point, but for now I am going to simply refer you to the papers that describe the models in detail, here and here, as well as just include a figure illustrating the model (as given in the original paper):

The summary statistics, and, just as importantly, the metric used to calculate distances between different sets of summary statistics are also definitely worthy of discussion. In fact, improvement of existing and development of new generalized summary statistics and corresponding distance metrics are where I think (and hope!) we will see the most exciting progress in Approximate Bayesian Computation in phylogeographical analysis in the near future. Right now, msBayes employs a suite of over a 100 traditional population genetic statistics (such as pi, Watterson's theta, Tajima's D, Shannon's Index, etc.) that have been tested in simulation studies as its core summary statistics. For its distance metric, msBayes uses the simple multivariate Euclidean distance between the summary statistics as a first step. For continuous parameter values, the posterior samples are then regressed onto the distance between the simulated data and the observed data summary statistics.

So, returning to our msBayes pipeline "walk-through", how do we go from our samples from the prior to samples to the posterior? We have a number of different options. msBayes itself provides two approaches, one using a Perl script, "acceptReject.pl", which drives an R script that carries out the business, and the other a C program, "msReject".

The Perl+R approach, using "acceptReject.pl", automatically does the local linear-regression weighting for you, and produces nice and fancy PDF's of the posterior probability distribution of the number of distinct divergence times for you. I do not recommend it for two reasons. The first reason is that the R analysis is extremely slow and is a phenomenal memory hog. I had to switch to using 16G machines to processes some files, and even then I was pushing the envelope. The second reason is that while the PDF plots of the posterior probability distribution of the number of divergence times are very nice and informative, I would far rather have the actual samples from the posterior. With this, not only will it be possible to generate any of a wide variety of plots that illustrate the posterior probability distribution of any recorded parameter value, I can also summarize the values in a number of other ways, or use the samples in other analyses.

The "msReject" approach carries out rejection sampling on the simulations from the prior using Euclidean distances. The program is extremely fast and has a relatively miniscule memory footprint. It routinely handles the processing of 50 million sample files without any trouble. However, it does not carry out the locally-weighted linear regression of the posterior values, and you will have to do this yourself separately. Nonetheless, I highly recommend using this program over the Perl script.

In the next post, I will go over in detail how you generate summary statistics on your observed data, provide a Python script that regularizes the file so that it is compatible with the samples simulated from the prior, and provide another Python script that identifies the summary statistic columns in the file and calls on msReject with the appropriate options to correctly generate a sample from the posterior.

Saturday, September 19, 2009

msBayes Basics II: Data Assembly, Configuration File Preparation, and Generating Samples from the Prior

This is the second of a multi-part series on msBayes basics, and here I will discuss the preparation of the input files to the pipeline, the creation of the configuration file to control the simulations from the prior, as well as the running of the simulator itself.

Data assembly is probably one of the more tedious parts of the whole ABC pipeline, and I only provide a brief overview. Much more detailed and very helpful information can be found in the msBayes documention.

The basic sequence data files themselves are pretty straightfoward: for each species or taxon distributed across the barrier, you need to provide a single FASTA file with all the sequences listed in order (i.e., with sequences from population A listed first, followed by sequences from population B). That's the easy part. The part that requires a bit of work is collecting the information required to flesh out the configuration file. This includes a distinct sequence evolution model for each population pair. The msBayes simulator needs to generate realistic genetic sequence differences between individuals from each population under a standard molecular evolution model (HKY), and thus requires that you provide estimates of the transition-transversion rate ratio and nucleotide base frequencies for each population pair. So, for example, you might want to use GARLI to infer a maximum likelihood tree under an HKY model for the sequences of each population pair, or alternatively, bring your favorite tree into PAUP* along with the sequences and ask PAUP* to provide ML estimates of the HKY parameters.

For the actual format of the master configuration file, please see the msBayes documention, but basically for each population pair, you need to provide the total number of sequences, the number of sequences from population A, the number of sequences from population B, the HKY parameters, the length of the sequence, and information the path to the FASTA file.

I would highly recommend that you also go the extra step and embed the msBayes run parameters in the configuration file as well, rather than rely on the interactive queries. This is neccessary if you want to run the simulations remotely on a cluster through a scheduler, but far more importantly, it provides a permanent record of the simulation settings so as to allow you to track down mistakes or reproduce the same conditions.

Here is an example of a complete simulation configuration file, where the run settings, hyper-priors, priors, and observed sequence data are all specified:


numLoci = 1
lowerTheta = 0.01
upperTheta = 75.0
upperTau = 10.0
numTauClasses = 0
upperMig = 0.0
upperRec = 0.0
upperAncPopSize = 0.5
constrain = 0
subParamConstrain = 000000000
reps = 5000000

BEGIN SAMPLE_TBL
100 7   93  3.71    2356    0.2 0.3 0.1 a1-a2.fas
16  4   12  4.08    2559    0.3 0.2 0.2 b1-b2.fas
10  5   5   4.08    1356    0.3 0.2 0.2 c1-c2.fas
29  8   21  4.09    2367    0.2 0.2 0.1 d1-d2.fas
11  2   9   4.13    1372    0.3 0.2 0.1 e1-d2.fas
69  38  31  4.13    1372    0.2 0.2 0.1 f1-f2.fas
END SAMPLE_TBL

Once you have assembled your data files and created the simulation configuration file, the next step is to actually run the msBayes simulator to generate samples from the prior. This can be as simple as:


$ msbayes.pl -c msbayes_master -o prior_samples

This calls on "msbayes.pl", a Perl script that is part of the msBayes distribution, passing it our master configuration file "msbayes_master" as an argument and asking it to direct its results to a file called "prior_samples".

"msbayes.pl" is the primary driver for the simulations from the prior. It wraps up a whole bunch of complexity: generation of hyper-parameters from their prior distribution, the generation of the actual model parameters to be passed to the underlying coalescent-based sequence simulator (using "msprior"), calling the coalescent-based sequence simulator ("msDQH") to simulate a sample of summary statistics based on the generated model parameters, assembly of a row of data that couples together the set of generated parameter values and the corresponding summary statistics simulated based on this values, and the output of this row as a tab-delimited line of data. This gets repeated again and again, for the number of replicates specified in the master configuration file ("reps = 5000000" in the example file above), to build up a large sample drawn from the parameter space according to the given prior distributions.

All this is a lot of work, but is extremely computationally efficient: the CPU and memory footprint of a single process running is almost negligible, and you can run multiple simultaneous simulations without much overhead even on a single middle-of-the-road machine. Which leads to my recommendation here: run multiple processes to generate multiple files with samples from the prior simultaneously, concatenating these files before going to the next step in the ABC pipeline. So, for example, if you want 10 million samples from the prior, you can run 10 simultaneous processes, each drawing 1 million samples from the prior; concatenating the files then gives you 10 million samples in 1/10th the time.

Once you have you samples from the prior, the next step is to carry out rejection sampling on this data to provide samples from the posterior. And this is where the going gets pretty interesting ...

You see, to be honest, I do not think my posts describing the procedures up to this stage have really provided all that much more new or useful information than already existed in the documentation. I think that the operation of the msBayes ABC pipeline up to this point has been smooth and relatively trouble-free, and has proceeded as expected based on the documentation and help files. However, going beyond the generation of samples from the prior is where I ran into a number of difficulties, gotchas, glitches, and trip-me-up's. My next post will discuss these issues, share my experiences, lessons learned and solutions, as well as offer some scripts that I wrote to patch up the "leaks" and unexpected incompatibilities between different parts of the pipeline.

msBayes Basics I: Introduction

I've been working quite a bit with msBayes recently, and thought I would share some of the stuff that I've learnt, with an emphasis on the practical side of things. Over a series of posts, I am going to summarize the steps required to carry out a basic msBayes analysis, provide some "lessons learned" and tips for each of these steps, as well as provide some supplemental Python scripts that I find greatly facilitate the flow of data from one stage of the pipeline to another. This first post serves as a general introduction to the approach and the program, and a general overview of the procedures involved in an msBayes analysis.

Most of you are probably already quite familiar with this method of analysis, but for those of you that are not, perhaps a brief explanation may be in order. msBayes is a suite of tools that plug into a pipeline for Approximate Bayesian Computation (ABC) phylogeographic analysis, estimation and hypothesis testing. The ABC approach is a computationally efficient approach to complex estimation and inference problems. At its most basic, it works as follows:

A set of summary statistics is calculated on an observed data set.
Multiple sets of data are simulated using parameters drawn from a prior distribution.
The same same set of summary statistics is calculated for each of these simulated data sets.
The parameters that result in simulated data sets with summary statistics that are equal to or within some threshold distance of the summary statistics of the observed data set are recorded.
The posterior probability of the parameter values is given by the proportion of those values in the final set of accepted parameters.

Each of these steps involves multiple levels of complexities and and a smorgasbord of variant approaches, of course, from the full nuts-and-bolts of the simulation model to various flavours of importance/rejection sampling (step 4), but the core concept remains the same: simulate the data under a range of parameter values, and keep those parameter values that result in simulated data sufficiently close to the observed data. This approach is extremely attractive: for many complex problems and questions, data can be simulated relatively easily, whereas sometimes even formulating the corresponding likelihood statement can be pretty challenging. Even if the likelihood can be formulated, computing it tends to be extremely expensive in terms of CPU power and time as well as memory, while simulating data is relatively cheap as far as those resources go. However, be warned: the ABC approach does exact a cost in storage space; all those samples from the prior have to be retained until they can be processed, and I have had analyses that walk into the realm of 50-60 GB.

msBayes can be used to answer a broad variety of questions using ABC, but a major one, and the one which I am currently using it for, is to test the hypothesis of simultaneous divergence amongst a set of taxa co-distributed across a geographical barrier. Specifically, I am asking "given a bunch of sequences of pairs of populations of different species on either side of a barrier, what is the posterior probability a single distinct divergence time between each pair?"

Using msBayes to answer this question involves the following steps:

Assemble the observed data files.
Create the configuration file to drive the simulations from the prior.
Run the simulator to produce a large number of samples from the prior.
Calculate the summary statistics on the observed data.
Regularize the columns / fields in the summary statistics file and the simulated data.
Carry out the acceptance/rejection sampling on the prior samples to produce samples from the posterior.
Summarize the posterior probability distribution of discrete parameter values (such as number of divergence times) by assessing the proportion of those values in the posterior samples.
Weight posterior samples of continuous parameter using local linear regression before summarizing their distribution.

Each of these steps will be covered in detail in future posts, and I will also be providing some scripts that I found really useful in helping to carry out some of these steps:

geodendron

Thursday, October 1, 2009

msBayes Basics Part VI: Tolerances and Local-Linear Regression Weighting of the Posteriors

Sunday, September 20, 2009

msBayes Basics III: Approaches to Sampling from the Posterior

Saturday, September 19, 2009

msBayes Basics II: Data Assembly, Configuration File Preparation, and Generating Samples from the Prior

msBayes Basics I: Introduction

Blog Archive

Contributors

Followers

geodendron

Thursday, October 1, 2009

msBayes Basics Part VI: Tolerances and Local-Linear Regression Weighting of the Posteriors

Sunday, September 20, 2009

msBayes Basics III: Approaches to Sampling from the Posterior

Saturday, September 19, 2009

msBayes Basics II: Data Assembly, Configuration File Preparation, and Generating Samples from the Prior

msBayes Basics I: Introduction

Blog Archive

Subscribe To

Contributors

Followers