Thursday, October 1, 2009

msBayes Basics Part VI: Tolerances and Local-Linear Regression Weighting of the Posteriors

This is the sixth of a multi-part series on msBayes basics. In the previous part, I discussed how you might run "msReject" to carry out rejection sampling on your samples from the prior, to obtain samples from the posterior. One crucial setting for this previous step was the choice of tolerance, and this is something I discuss in detail here.

First, let us consider why there is a need for tolerances. In general, the more summary statistics we use, the better our ability to discriminate between models and correctly approximate the posterior becomes (as long as we are reasonable about the summary statistics we use). But, at the same time, the more difficult it becomes to simulate a sample from the priors that result in summary statistics that equal the summary statistics calculated on our observed data. The variation in some of the nuisance parameters that we are trying to integrate out, as well as stochasticity in the processes that we are trying to model, make it difficult to generate a sample with identical summary statistics even under a perfect (correct) model, even with a few summary statistics. So we allow some "jiggle room": we say that we will accept any sample from the prior that comes close to the observed in terms of the summary statistic, even if they are not exactly equal. And this jiggle room is what we call the tolerance. The more summary statistics that we use, the larger the tolerance that we will need to allow us to accept simulations at a reasonable rate.

However, as the tolerance gets larger and larger, we begin introducing more and more distortion in our approximation. This is because all samples accepted within the tolerance zone are weighted equally in the posterior. So a sample far from the observed in terms of summary statistic distance will be contribute the same proportion to the posterior as one very close, as long as they are both within the tolerance zone. In the limit, as our tolerance goes to infinity, we will accept all samples from the prior, and thus the direction of bias due to large tolerances is to weight our posteriors toward the prior.

As it happens, recent work by Beaumont et al. provides a solution to this problem. They use local-linear regression to weight the posteriors so that samples with summary statistics closer to the observed summary statistics are given a greater weight than those further away. With this, the analysis becomes relatively insensitive to choice of tolerances, as long as the number of posterior samples are large enough to allow accurate regression. This is vividly illustrated in the data below, which shows the results of rejection sampling at different tolerances, comparing the mean of the raw posterior distribution of a parameter vs. the mean of the weighted posterior distribution of the same parameter:

TOL.    NUM.    UNWEIGHTED           WEIGHTED
1e-05 500 0.148147697394790 0.188623205810913
2e-05 1000 0.178475544544545 0.138229444728950
1e-04 5000 0.217240549309862 0.126658030148325
2e-04 10000 0.235878721472147 0.130784297340281
0.001 50000 0.308834221344427 0.123652684056336
0.002 100000 0.355224938539385 0.123306496530203

As can be seen, both the range and variance of the regression-weighted parameter distribution across different tolerances are an order of magnitude less than the unweighted parameter distribution (0.0653167 vs. 0.2070772, and 0.0006326896 vs. 0.006153922, respectively). The posterior samples size under the first tolerance setting, 1e-05, is 500, and this small size probably led to a relatively inaccurate regression, as is evidence by the mean of the weighted distribution associated with it being rather far (> 7 s.d. units) from all the other values. Considering posterior sizes of 1000 or more, we find that the variance of the means of the weighted distributions vs. the unweighted is 3.843462e-05 vs. 0.005126309, and, regardless of our choice of the tolerance, we can safely place the mean between 0.12 and 0.14 (a difference of 0.02), while the mean for the unweighted parameter ranges from 0.18 to 0.36 (a difference of 0.18).

This informal analysis demonstrates that the local linear regression weighting approach of Beaumont et al. is remarkably effective in making the estimates insenstive to our choice of tolerances by correcting for the bias toward the prior introduced by large tolerances. This allows us to use larger tolerances in our analysis, which in turn allows us to use more summary statistics and yet collect samples from the posterior at a reasonable rate. Of course, the regression weighting only applies to estimates of continuous parameter values. When doing model selection (such as the numbers of divergence times), then this method is not suitable. In the case of numbers of divergence times, the best approach might be to use the regression to weight the variance and estimate of divergence times in your data (Var(t) and E(t)), and use the ratio of this (Omega = Var(t)/E(t)) to assess support for a single divergence time.

The story does not quite end here, however. A further complication is in the way that msBayes carries out rejection sampling. In all the discussion so far, we have referred to the process of accepting samples from the prior that come within some specified distance (the tolerance) of the observed. One would imagine carrying out simulations from the prior, and, for each simulation, accepting the sample into the posterior only if the Euclidean distance between the summary statistics of the simulated data and the observed data fall within some specified tolerance distance. In this scheme, we would run the simulations from the prior continuously, building our samples from the posterior as we go along, until our posterior samples reach some reasonable size. However, in implementing the rejection sampling, msBayes does things a little differently. As has been covered in painful detail, we generate samples from the prior of a specified size before we even begin the rejection sampling. The samples are then ranked in terms of the Euclidean distance of their summary statistics to the summary statistics of the observed data, and pre-specified proportion of those samples closest in distance to the observed data will be accepted into the posterior. It is confusing that this proportion is also called a tolerance, because it is not the same tolerance as discussed above. Rather, depending on the summary statistics values of the specific samples that happen to get accepted, an effective tolerance in the previous sense is implied or induced. For example, if we say that we will accept 0.02 of the 100000 prior samples into the posterior, then the 2000 samples out of the 100000 closest in the summary statistic distance will constitute our sample from the posterior, irrespective of the actual absolute distances. The tolerance in the sense discussed previously will then be given by the largest distance between the summary statistics of the 2000 samples in the posterior and the summary statistics of the observed data.

So where does this all leave us with regards to choice of tolerances? All I can offer here are some general habits that I practice, based on my experiences and understanding, which may or may not be applicable or appropriate in the context of your analyses (in other words, YMMV). I think that the most important guideline would be to have a posterior sample size at least 1000 or larger. I personally would probably not be too comfortable with less than a million samples from the prior, and prefer to work with prior sizes of 10 million or greater. Thus, I generally tend to have my "tolerance" (sensu msBayes) on the order of 1e-4 to 1e-5 or so, and ensure that my prior sample size is large enough so this choice of "tolerance" results in posteriors that are 1000 samples or more.

In the next post, I will provide an R script based on one that Mike Hickerson kindly sent to me to help carry out the local linear regression-based weighting of your posterior values, and discuss how to carry out this procedure.

References:


@article{beaumont2002approximate,
title={{Approximate Bayesian computation in population genetics}},
author={Beaumont, M.A. and Zhang, W. and Balding, D.J.},
journal={Genetics},
volume={162},
number={4},
pages={2025--2035},
year={2002},
publisher={Genetics Soc America}
url={http://www.genetics.org/cgi/content/full/162/4/2025}
}

@article{hickerson2006test,
title={{Test for simultaneous divergence using approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E.A. and Lessios, HA and Crandall, K.},
journal={Evolution},
volume={60},
number={12},
pages={2435--2453},
year={2006},
publisher={BioOne}
}

@article{hickerson2007msbayes,
title={{msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E. and Takebayashi, N.},
journal={BMC bioinformatics},
volume={8},
number={1},
pages={268},
year={2007},
publisher={BioMed Central Ltd}
}

6 comments:

  1. Is it possible that by generating such a large prior and using a very small tolerance that you're actually ending up with a point estimate of the truth, rather than a full posterior distribution that includes some of the landscape which is relatively far from the best estimate, but still plausible?

    ReplyDelete
  2. Hi Jake,

    I suppose it is possible, but I do not think that this is happening here, as my 50 million sample prior consists of 10 independent 5 million sample simulations combined, and each of those invdividual 5 million sample priors result in comparable posterior estimates. I find it difficult to see having too much data as being a bad thing. I guess one thing to try might be to see how results change as we hold our posterior sample size fixed, but keep increasing the prior (and decreasing the tolerance to adjust for the same posterior sample size), and see if the results converge on the MLE.

    ReplyDelete
  3. Having a big prior and a big posterior are very different things. As you say, it's the size of the tolerance in relation to these that could cause a point estimate, I suppose. But, it may be that you're dealing with so many summary statistics that it would be impossible to keep hitting a point estimate anyway.

    ReplyDelete
  4. Jake,

    Look at equation (1) of the Beaumont 2002 paper, and the text just preceding. In the ideal case, you will be accepting samples from the prior in proportion to their likelihood. No matter how small your tolerance is (in the proper sense of the maximum distance threshold between summary statistics), as long as you sample the posterior sufficiently (i.e., your posterior size is big enough), then it does not matter how large your prior sample sizes are.

    Obviously there is a relationship between all these components. If your tolerance is very small, then you will need a very large sample from the prior for you to get a decent sample from the posterior. Nothing surprising or controversial there. My point is as long as your posterior is large enough (e.g., > 1000), then I do not see a problem with huge prior sample sizes.

    To put it in more concrete terms, imagine two sets of samples from the prior: a 10 million sample prior vs. a 1000 million sample prior. I am suggesting that, given good summary statistics, if you were to extract a 1000-sample posterior from *both*, the posterior extracted from the 1000-million sample prior is not going to be more of a "point" estimate, or less representative of the full distribution, than the posterior extracted from the 100-million sample prior. In both, you would expect the models and parameters to be sampled in proportion to their (weighted) likelihoods.

    ReplyDelete
  5. It occurs to me that we may be using the term "tolerance" in different senses. I am using it in the sense of the "maximum distance between the summary statistics of the prior sample and the summary statistics of the data", i.e. the "distance-tolerance". What I said above holds for the tolerance in this sense. If you are using it in the sense of "proportion of prior samples to accept" i.e. "proportion-tolerance", then this implies a particular distance-tolerance. As your prior sample sizes get larger, the same proportion-tolerance will probably induce or imply smaller and smaller distance-tolerances. At this point, whether or not your posterior estimates reduce to a MLE point estimate will depend on the qualities of your summary statistics, in how well they discriminate between your models while at the same time ignoring the noise from your nuisance parameters.

    ReplyDelete
  6. Yes, I was thinking of the 'proportion tolerance'.

    ReplyDelete