Wednesday, September 30, 2009

msBayes Basics V: Rejection Sampling to Extract the Posteriors from Your Priors

This is the fifth of a multi-part series on msBayes basics. Here I will describe how to use programs distributed as part of the msBayes package to carry out rejection sampling on your simulations from the prior, and supply a Python script that makes some of the practical running of this a little bit less error-prone and simpler.

At this stage of the proceedings, you will have a large collection of samples simulated from the prior, in the form of a file of tab-delimited values. The first row will consist of column or field names, and each subsequent row will consist of two sets of fields that describe each sample: a set of parameters drawn from their prior distribution, and a set of summary statistics calculated on data simulated using those parameters. You should also have a file of summary statistics calculated on your observed data. Using the Python script to regularize columns between files, as discussed in a previous post, the columns in both the simulated and observed summary statistic files should now be compatible.

The next step is now to run "msReject" on your samples from the prior. The call to this program requires that you specify, in order:
  • the path to your summary statistics file
  • the path to your file of samples from the prior
  • the sampling tolerance
  • a list of columns (1-based column index numbers) which identify the summary statistic columns
For example, the following:
$ msReject obs_sum_stats.txt priors.txt 0.2 6 7 8 9 10 11 12
asks "msReject" to carrying out rejection sampling on the data in "priors.txt", using the summary statistics given in "obs_sum_stats.txt" as the observed summary statistics, and a tolerance of "0.2", with columns 6 through 12 taken to be summary statistic columns (and the remaining assumed to be prior/parameter columns).

It can be a little bit tedious, not to mention error-prone, needing to specify the summary statistic columns when invoking "msReject". To this end, I've written a Python script that automatically identifies the summary statistic columns in the file and composes the proper call to "msReject". By default it prints the command to the standard output, to show you what it is doing, and then invokes the command itself. Alternatively, it can be run in an "no execute" mode, where it just prints the correct command and exits, so that you can copy the command and use it in your own scripts.

The following example shows it in action, running "msReject" with a tolerance of 0.01 and directing the results to the file "posterior_accepted_samples.txt"

$run-reject.py obs_sum_stats priors1.txt 0.01 > posterior_accepted_samples.txt
Following columns identified as PARAMETERS:
[1]:"PRI.numTauClass"
[2]:"PRI.Psi"
[3]:"PRI.var.t"
[4]:"PRI.E.t"
[5]:"PRI.omega"

Following columns identified as SUMMARY STATISTICS:
[6]:"pi.b.1"
[7]:"pi.b.2"
.
.
(etc.)
.
.
[106]:"ShannonsIndex.Pop2.5"
[107]:"ShannonsIndex.Pop2.6"


Command to be executed:
msReject obs_sum_stats priors1.txt 0.02 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
101 102 103 104 105 106 107

Running with PID: 10927.
Completed with exit status: 0.

Once this process is done, you will have effectively sampled the posterior probability distribution of your parameters. The posterior probability of your model is the approximately the proportion of times it has been sampled in the posterior distribution. So, for example, the proportion of samples with 1 divergence time gives the approximate posterior probability of a single distinct divergence time between every population pair in your dataset. I have a small Python script that I use to calculate the proportions of the various parameters, but you can also use R or Perl or what-have-you.

One issue that I have not discussed in detail is the choice of tolerance and how that affects the analysis. As it turns out, this your analysis can be quite sensitive to the choice of tolerance, and there are steps that you can take to correct this. I will discuss the issue of tolerances in future posts, as well as provide an R script that weights the posterior probability of your parameters to reduce their sensitivity to your choice of tolerance.

Sunday, September 20, 2009

msBayes Basics IV: Preparing Your Files for Sampling from the Posterior

This is the fourth of a multi-part series on msBayes basics. Here I will discuss some glitches/incompatibilties in the msBayes pipeline when going into the rejection sampling stage that you have to watch out for. I will also present a Python script that fixes the incompatibilities in the pipeline and make the process easier.

The rejection sampling step requires two files: samples from the prior and summary statistics calculated on the observed data.

Our samples from the prior as generated by the script "msBayes.pl" are given as a file of tab-delimited values, with the first row of the file giving the field names. The first several fields consist of the parameter values drawn from the prior, while the remaining fields are the summary statistics calculated on data simulated with those parameter values. Parameter fields can be identified on the leading "PRI." associated with their names (e.g., "PRI.numTauClass", "PRI.Psi", "PRI.var.t"). The remaining fields are the summary statistics.

The summary statistics on the observed data are generated using the program "obsSumStats.pl". This is invoked with a single argument, the path to the master configuration file (discussed here), and by default writes the results to the standard output, but we will redirect this to a file "obs_sum_stats":
$ obsSumStats.pl msbayes_master > obs_sum_stats
The file "obs_sum_stats" is also a tab-delimited value file, and consists of two rows, a header row and a data row, with the latter comprising the summary statistics calculated on the observed data.

Now here is the problem with the msBayes pipeline: the observed summary statistics files generated by "obsSumStats.pl" and the file of simulations from the prior generated by "msBayes.pl" are incompatible!

In particular, the simulations from the prior file has a number of columns that are not present in the observed summary statistics. The file of observed summary statistics has 106 columns, with the first 4 columns being parameters and the remaining 102 columns being the summary statistics of interest. However, the file of simulations from the prior has a varying number of parameter columns, contingent upon your hyper-prior model settings, before being followed by the 102 columns of summary statistics. For example, if you do not constrain the number of divergence times ("numTauClasses = 0"), the simulations from the prior will have 107 columns, with the first 5 columns being parameters and the remaining 102 columns being summary statistics. On the other hand, if you constrain the run to have only 1 divergence time ("numTauClasses = 1"), then the simulations from the prior will have 109 columns, with the first 7 columns being parameters and the remaining 102 columns being summary statistics.

When dealing with just one file of observed summary statistics, and files of simulated samples from the prior generated under the exact same model, then fixing this incompatibility is trivial (IF you are aware of it; I was not the first few times that I ran msBayes, and because my results were not scarily unreasonable, it was only when I started to look at the files in detail that I realized what was happening). However, the moment that you begin doing anything more complex (such as runs under different hyper-priors for numbers of distinct divergence times), then it becomes really quite complicated and tedious.

UPDATE Sept 22 2009: It appears that "acceptRej.pl" not only correctly handles, but actually requires, that the file of summary statistics produced by "obsSumStats.pl" has one column short in relation to the file of simulations from the prior, i.e. 106 columns (of which the first 4 are parameters) instead of 107 (of which the first 5 are parameters). It is not clear to me at this point how it handles the situation when the file of simulations from the prior has > 107 columns, e.g., if the number of divergence times are constrained. One way or another though, to use "msReject", you will still want the column patterns consistent across all your files.

To this end, I've written a Python script that greatly simplifies the task of regularizing columns across multiple files. It depends crucially on a header row of column names as the first line of each file. You will pass it the path to a reference or "model" file that has a column pattern that you want to duplicate across all the source files. For each of the remaining source files you specify, it will examine the header row to identify the positions and names of dummy columns that it will need to insert to make the column pattern identical to the reference or model file. It then prints out the columns of data from the source file, inserting columns of dummy values as needed. If the column pattern of the source file matches, then it simply outputs the data rows as they are given. If you pass it multiple files as input, the data rows (i.e., all lines from the second line onwards) from all these files will be written out seamlessly so that the final output will consist of a single header row of column names followed by the data rows of all the source files, with dummy fields inserted into each row as neccessary so as to have a consistent column pattern.

For example, if you want your file of summary statistics calculated on your observed data (using "obsSumStats.pl") to match the column pattern of your file of simulations from the prior (generated using "msBayes.pl"), you would:
$ regularize-cols.py -m prior_sims obs_sum_stats > obs_sum_stats_regularized
(found 107 fields in model file)
-- Processing 1 of 1: obs_sum_stats
* Source file has 106 columns instead of 107.
* Columns to be inserted: [1]:"PRI.numTauClass"
As mentioned, the files of simulations from the prior generated by "msBayes.pl" have different numbers of columns, depending on the simulation model hyper-parameter settings. Since "regularize-cols.py" concatenates the data rows of all files passed to it as arguments, you can use it to merge simulations from the priors from different runs of "msBayes.pl" into a single large file for rejection sampling, and it will inspect each file individually to ensure that column patterns are identical to the model file. For example, the following takes 4 files containing data generated under two different models and concatenates the data into a single file with dummy fields inserted as neccessary to ensure uniformity and consistency in columns:
$ regularize-cols.py -m priors2_1 priors1_1 prior1_2 priors2_1 priors2_2 > priors_all
(found 111 fields in model file)
-- Processing 1 of 4: priors1_1
* Source file has 109 columns instead of 111.
* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"
-- Processing 2 of 4: priors1_2
* Source file has 109 columns instead of 111.
* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"
-- Processing 3 of 4: priors2_1
* Source file has same number of columns as model file.
* No columns to be inserted.
-- Processing 4 of 4: priors2_2
* Source file has same number of columns as model file.
* No columns to be inserted.
Things to note regarding the operation of the script:
  • Let me repeat this with emphasis: the script depends crucially on a header row of column names as the first line of each file!
  • It only inserts columns into rows of the source files so that their column pattern matches the reference or model file. It does not delete any columns. Thus, the model file should have a column pattern that is a superset of all the column patterns across the source files. You can use the UNIX command "head <FILENAME> | awk '{print NF}'" to quickly count columns in files.
  • It not only does not re-order columns in the source files, its logic depends on the column order of the source files matching the order of columns in the model file (sans missing columns, of course).
  • All the above conditions are met perfectly by the output files produced by msBayes, so as long as you do not modify these files directly, or are careful to maintain the above conditions if you do modify the files, there should be no problem in the execution of the script.
So, to summarize, as things stand when I write this, the msBayes pipeline generates files of varying column patterns. You will need to ensure that your column patterns are identical across and within files before actually carrying out the rejection sampling procedure. The script given here will automate this task for you, removing the tedium and reducing the risk of errors.

In my next post, I will discuss the actual invocation and running of "msReject", and present a Python script that makes the call to "msReject" simpler by identifying the summary statistic columns for you.

msBayes Basics III: Approaches to Sampling from the Posterior

This is the third of a multi-part series on msBayes basics. Here I will provide a brief overview of the ways you can carry out rejection sampling on your simulations from the prior using the programs distributed with msBayes. Later posts will describe the procedure in detail.

In the previous step, you have accumulated a large number parameter values sampled from the prior. Each sample of parameter values is associated with a set of summary statistics calculated on data simulated using those parameter values. In the current step, a specified proportion of these samples with summary statistics that are closest to the summary statistics calculated on the observed data will be retained, while the other samples will be rejected. Once you have completed this, you will have transformed your samples from the prior to samples from the posterior. It should be noted that this approach of generating posterior samples from the prior is only one variant of many, and is the approach adopted by the programs supplied with msBayes. An excellent summary of some alternative approaches, along with references, can be seen here.

I realize that up to this point I have not discussed the actual simulation model of msBayes at all, nor the summary statistics. I definitely think it would be interesting and useful to delve into the former at some point, but for now I am going to simply refer you to the papers that describe the models in detail, here and here, as well as just include a figure illustrating the model (as given in the original paper):



The summary statistics, and, just as importantly, the metric used to calculate distances between different sets of summary statistics are also definitely worthy of discussion. In fact, improvement of existing and development of new generalized summary statistics and corresponding distance metrics are where I think (and hope!) we will see the most exciting progress in Approximate Bayesian Computation in phylogeographical analysis in the near future. Right now, msBayes employs a suite of over a 100 traditional population genetic statistics (such as pi, Watterson's theta, Tajima's D, Shannon's Index, etc.) that have been tested in simulation studies as its core summary statistics. For its distance metric, msBayes uses the simple multivariate Euclidean distance between the summary statistics as a first step. For continuous parameter values, the posterior samples are then regressed onto the distance between the simulated data and the observed data summary statistics.

So, returning to our msBayes pipeline "walk-through", how do we go from our samples from the prior to samples to the posterior? We have a number of different options. msBayes itself provides two approaches, one using a Perl script, "acceptReject.pl", which drives an R script that carries out the business, and the other a C program, "msReject".

The Perl+R approach, using "acceptReject.pl", automatically does the local linear-regression weighting for you, and produces nice and fancy PDF's of the posterior probability distribution of the number of distinct divergence times for you. I do not recommend it for two reasons. The first reason is that the R analysis is extremely slow and is a phenomenal memory hog. I had to switch to using 16G machines to processes some files, and even then I was pushing the envelope. The second reason is that while the PDF plots of the posterior probability distribution of the number of divergence times are very nice and informative, I would far rather have the actual samples from the posterior. With this, not only will it be possible to generate any of a wide variety of plots that illustrate the posterior probability distribution of any recorded parameter value, I can also summarize the values in a number of other ways, or use the samples in other analyses.

The "msReject" approach carries out rejection sampling on the simulations from the prior using Euclidean distances. The program is extremely fast and has a relatively miniscule memory footprint. It routinely handles the processing of 50 million sample files without any trouble. However, it does not carry out the locally-weighted linear regression of the posterior values, and you will have to do this yourself separately. Nonetheless, I highly recommend using this program over the Perl script.

In the next post, I will go over in detail how you generate summary statistics on your observed data, provide a Python script that regularizes the file so that it is compatible with the samples simulated from the prior, and provide another Python script that identifies the summary statistic columns in the file and calls on msReject with the appropriate options to correctly generate a sample from the posterior.

Saturday, September 19, 2009

msBayes Basics II: Data Assembly, Configuration File Preparation, and Generating Samples from the Prior

This is the second of a multi-part series on msBayes basics, and here I will discuss the preparation of the input files to the pipeline, the creation of the configuration file to control the simulations from the prior, as well as the running of the simulator itself.

Data assembly is probably one of the more tedious parts of the whole ABC pipeline, and I only provide a brief overview. Much more detailed and very helpful information can be found in the msBayes documention.

The basic sequence data files themselves are pretty straightfoward: for each species or taxon distributed across the barrier, you need to provide a single FASTA file with all the sequences listed in order (i.e., with sequences from population A listed first, followed by sequences from population B). That's the easy part. The part that requires a bit of work is collecting the information required to flesh out the configuration file. This includes a distinct sequence evolution model for each population pair. The msBayes simulator needs to generate realistic genetic sequence differences between individuals from each population under a standard molecular evolution model (HKY), and thus requires that you provide estimates of the transition-transversion rate ratio and nucleotide base frequencies for each population pair. So, for example, you might want to use GARLI to infer a maximum likelihood tree under an HKY model for the sequences of each population pair, or alternatively, bring your favorite tree into PAUP* along with the sequences and ask PAUP* to provide ML estimates of the HKY parameters.

For the actual format of the master configuration file, please see the msBayes documention, but basically for each population pair, you need to provide the total number of sequences, the number of sequences from population A, the number of sequences from population B, the HKY parameters, the length of the sequence, and information the path to the FASTA file.

I would highly recommend that you also go the extra step and embed the msBayes run parameters in the configuration file as well, rather than rely on the interactive queries. This is neccessary if you want to run the simulations remotely on a cluster through a scheduler, but far more importantly, it provides a permanent record of the simulation settings so as to allow you to track down mistakes or reproduce the same conditions.

Here is an example of a complete simulation configuration file, where the run settings, hyper-priors, priors, and observed sequence data are all specified:


numLoci = 1
lowerTheta = 0.01
upperTheta = 75.0
upperTau = 10.0
numTauClasses = 0
upperMig = 0.0
upperRec = 0.0
upperAncPopSize = 0.5
constrain = 0
subParamConstrain = 000000000
reps = 5000000

BEGIN SAMPLE_TBL
100 7 93 3.71 2356 0.2 0.3 0.1 a1-a2.fas
16 4 12 4.08 2559 0.3 0.2 0.2 b1-b2.fas
10 5 5 4.08 1356 0.3 0.2 0.2 c1-c2.fas
29 8 21 4.09 2367 0.2 0.2 0.1 d1-d2.fas
11 2 9 4.13 1372 0.3 0.2 0.1 e1-d2.fas
69 38 31 4.13 1372 0.2 0.2 0.1 f1-f2.fas
END SAMPLE_TBL



Once you have assembled your data files and created the simulation configuration file, the next step is to actually run the msBayes simulator to generate samples from the prior. This can be as simple as:



$ msbayes.pl -c msbayes_master -o prior_samples


This calls on "msbayes.pl", a Perl script that is part of the msBayes distribution, passing it our master configuration file "msbayes_master" as an argument and asking it to direct its results to a file called "prior_samples".

"msbayes.pl" is the primary driver for the simulations from the prior. It wraps up a whole bunch of complexity: generation of hyper-parameters from their prior distribution, the generation of the actual model parameters to be passed to the underlying coalescent-based sequence simulator (using "msprior"), calling the coalescent-based sequence simulator ("msDQH") to simulate a sample of summary statistics based on the generated model parameters, assembly of a row of data that couples together the set of generated parameter values and the corresponding summary statistics simulated based on this values, and the output of this row as a tab-delimited line of data. This gets repeated again and again, for the number of replicates specified in the master configuration file ("reps = 5000000" in the example file above), to build up a large sample drawn from the parameter space according to the given prior distributions.

All this is a lot of work, but is extremely computationally efficient: the CPU and memory footprint of a single process running is almost negligible, and you can run multiple simultaneous simulations without much overhead even on a single middle-of-the-road machine. Which leads to my recommendation here: run multiple processes to generate multiple files with samples from the prior simultaneously, concatenating these files before going to the next step in the ABC pipeline. So, for example, if you want 10 million samples from the prior, you can run 10 simultaneous processes, each drawing 1 million samples from the prior; concatenating the files then gives you 10 million samples in 1/10th the time.

Once you have you samples from the prior, the next step is to carry out rejection sampling on this data to provide samples from the posterior. And this is where the going gets pretty interesting ...

You see, to be honest, I do not think my posts describing the procedures up to this stage have really provided all that much more new or useful information than already existed in the documentation. I think that the operation of the msBayes ABC pipeline up to this point has been smooth and relatively trouble-free, and has proceeded as expected based on the documentation and help files. However, going beyond the generation of samples from the prior is where I ran into a number of difficulties, gotchas, glitches, and trip-me-up's. My next post will discuss these issues, share my experiences, lessons learned and solutions, as well as offer some scripts that I wrote to patch up the "leaks" and unexpected incompatibilities between different parts of the pipeline.

msBayes Basics I: Introduction

I've been working quite a bit with msBayes recently, and thought I would share some of the stuff that I've learnt, with an emphasis on the practical side of things. Over a series of posts, I am going to summarize the steps required to carry out a basic msBayes analysis, provide some "lessons learned" and tips for each of these steps, as well as provide some supplemental Python scripts that I find greatly facilitate the flow of data from one stage of the pipeline to another. This first post serves as a general introduction to the approach and the program, and a general overview of the procedures involved in an msBayes analysis.

Most of you are probably already quite familiar with this method of analysis, but for those of you that are not, perhaps a brief explanation may be in order. msBayes is a suite of tools that plug into a pipeline for Approximate Bayesian Computation (ABC) phylogeographic analysis, estimation and hypothesis testing. The ABC approach is a computationally efficient approach to complex estimation and inference problems. At its most basic, it works as follows:
  1. A set of summary statistics is calculated on an observed data set.
  2. Multiple sets of data are simulated using parameters drawn from a prior distribution.
  3. The same same set of summary statistics is calculated for each of these simulated data sets.
  4. The parameters that result in simulated data sets with summary statistics that are equal to or within some threshold distance of the summary statistics of the observed data set are recorded.
  5. The posterior probability of the parameter values is given by the proportion of those values in the final set of accepted parameters.
Each of these steps involves multiple levels of complexities and and a smorgasbord of variant approaches, of course, from the full nuts-and-bolts of the simulation model to various flavours of importance/rejection sampling (step 4), but the core concept remains the same: simulate the data under a range of parameter values, and keep those parameter values that result in simulated data sufficiently close to the observed data. This approach is extremely attractive: for many complex problems and questions, data can be simulated relatively easily, whereas sometimes even formulating the corresponding likelihood statement can be pretty challenging. Even if the likelihood can be formulated, computing it tends to be extremely expensive in terms of CPU power and time as well as memory, while simulating data is relatively cheap as far as those resources go. However, be warned: the ABC approach does exact a cost in storage space; all those samples from the prior have to be retained until they can be processed, and I have had analyses that walk into the realm of 50-60 GB.

msBayes can be used to answer a broad variety of questions using ABC, but a major one, and the one which I am currently using it for, is to test the hypothesis of simultaneous divergence amongst a set of taxa co-distributed across a geographical barrier. Specifically, I am asking "given a bunch of sequences of pairs of populations of different species on either side of a barrier, what is the posterior probability a single distinct divergence time between each pair?"

Using msBayes to answer this question involves the following steps:
  1. Assemble the observed data files.
  2. Create the configuration file to drive the simulations from the prior.
  3. Run the simulator to produce a large number of samples from the prior.
  4. Calculate the summary statistics on the observed data.
  5. Regularize the columns / fields in the summary statistics file and the simulated data.
  6. Carry out the acceptance/rejection sampling on the prior samples to produce samples from the posterior.
  7. Summarize the posterior probability distribution of discrete parameter values (such as number of divergence times) by assessing the proportion of those values in the posterior samples.
  8. Weight posterior samples of continuous parameter using local linear regression before summarizing their distribution.
Each of these steps will be covered in detail in future posts, and I will also be providing some scripts that I found really useful in helping to carry out some of these steps: