Wednesday, September 30, 2009

msBayes Basics V: Rejection Sampling to Extract the Posteriors from Your Priors

This is the fifth of a multi-part series on msBayes basics. Here I will describe how to use programs distributed as part of the msBayes package to carry out rejection sampling on your simulations from the prior, and supply a Python script that makes some of the practical running of this a little bit less error-prone and simpler.

At this stage of the proceedings, you will have a large collection of samples simulated from the prior, in the form of a file of tab-delimited values. The first row will consist of column or field names, and each subsequent row will consist of two sets of fields that describe each sample: a set of parameters drawn from their prior distribution, and a set of summary statistics calculated on data simulated using those parameters. You should also have a file of summary statistics calculated on your observed data. Using the Python script to regularize columns between files, as discussed in a previous post, the columns in both the simulated and observed summary statistic files should now be compatible.

The next step is now to run "msReject" on your samples from the prior. The call to this program requires that you specify, in order:
  • the path to your summary statistics file
  • the path to your file of samples from the prior
  • the sampling tolerance
  • a list of columns (1-based column index numbers) which identify the summary statistic columns
For example, the following:
$ msReject obs_sum_stats.txt priors.txt 0.2 6 7 8 9 10 11 12
asks "msReject" to carrying out rejection sampling on the data in "priors.txt", using the summary statistics given in "obs_sum_stats.txt" as the observed summary statistics, and a tolerance of "0.2", with columns 6 through 12 taken to be summary statistic columns (and the remaining assumed to be prior/parameter columns).

It can be a little bit tedious, not to mention error-prone, needing to specify the summary statistic columns when invoking "msReject". To this end, I've written a Python script that automatically identifies the summary statistic columns in the file and composes the proper call to "msReject". By default it prints the command to the standard output, to show you what it is doing, and then invokes the command itself. Alternatively, it can be run in an "no execute" mode, where it just prints the correct command and exits, so that you can copy the command and use it in your own scripts.

The following example shows it in action, running "msReject" with a tolerance of 0.01 and directing the results to the file "posterior_accepted_samples.txt"

$run-reject.py obs_sum_stats priors1.txt 0.01 > posterior_accepted_samples.txt
Following columns identified as PARAMETERS:
[1]:"PRI.numTauClass"
[2]:"PRI.Psi"
[3]:"PRI.var.t"
[4]:"PRI.E.t"
[5]:"PRI.omega"

Following columns identified as SUMMARY STATISTICS:
[6]:"pi.b.1"
[7]:"pi.b.2"
.
.
(etc.)
.
.
[106]:"ShannonsIndex.Pop2.5"
[107]:"ShannonsIndex.Pop2.6"


Command to be executed:
msReject obs_sum_stats priors1.txt 0.02 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
101 102 103 104 105 106 107

Running with PID: 10927.
Completed with exit status: 0.

Once this process is done, you will have effectively sampled the posterior probability distribution of your parameters. The posterior probability of your model is the approximately the proportion of times it has been sampled in the posterior distribution. So, for example, the proportion of samples with 1 divergence time gives the approximate posterior probability of a single distinct divergence time between every population pair in your dataset. I have a small Python script that I use to calculate the proportions of the various parameters, but you can also use R or Perl or what-have-you.

One issue that I have not discussed in detail is the choice of tolerance and how that affects the analysis. As it turns out, this your analysis can be quite sensitive to the choice of tolerance, and there are steps that you can take to correct this. I will discuss the issue of tolerances in future posts, as well as provide an R script that weights the posterior probability of your parameters to reduce their sensitivity to your choice of tolerance.

No comments:

Post a Comment