Saturday, September 19, 2009

msBayes Basics II: Data Assembly, Configuration File Preparation, and Generating Samples from the Prior

This is the second of a multi-part series on msBayes basics, and here I will discuss the preparation of the input files to the pipeline, the creation of the configuration file to control the simulations from the prior, as well as the running of the simulator itself.

Data assembly is probably one of the more tedious parts of the whole ABC pipeline, and I only provide a brief overview. Much more detailed and very helpful information can be found in the msBayes documention.

The basic sequence data files themselves are pretty straightfoward: for each species or taxon distributed across the barrier, you need to provide a single FASTA file with all the sequences listed in order (i.e., with sequences from population A listed first, followed by sequences from population B). That's the easy part. The part that requires a bit of work is collecting the information required to flesh out the configuration file. This includes a distinct sequence evolution model for each population pair. The msBayes simulator needs to generate realistic genetic sequence differences between individuals from each population under a standard molecular evolution model (HKY), and thus requires that you provide estimates of the transition-transversion rate ratio and nucleotide base frequencies for each population pair. So, for example, you might want to use GARLI to infer a maximum likelihood tree under an HKY model for the sequences of each population pair, or alternatively, bring your favorite tree into PAUP* along with the sequences and ask PAUP* to provide ML estimates of the HKY parameters.

For the actual format of the master configuration file, please see the msBayes documention, but basically for each population pair, you need to provide the total number of sequences, the number of sequences from population A, the number of sequences from population B, the HKY parameters, the length of the sequence, and information the path to the FASTA file.

I would highly recommend that you also go the extra step and embed the msBayes run parameters in the configuration file as well, rather than rely on the interactive queries. This is neccessary if you want to run the simulations remotely on a cluster through a scheduler, but far more importantly, it provides a permanent record of the simulation settings so as to allow you to track down mistakes or reproduce the same conditions.

Here is an example of a complete simulation configuration file, where the run settings, hyper-priors, priors, and observed sequence data are all specified:


numLoci = 1
lowerTheta = 0.01
upperTheta = 75.0
upperTau = 10.0
numTauClasses = 0
upperMig = 0.0
upperRec = 0.0
upperAncPopSize = 0.5
constrain = 0
subParamConstrain = 000000000
reps = 5000000

BEGIN SAMPLE_TBL
100 7 93 3.71 2356 0.2 0.3 0.1 a1-a2.fas
16 4 12 4.08 2559 0.3 0.2 0.2 b1-b2.fas
10 5 5 4.08 1356 0.3 0.2 0.2 c1-c2.fas
29 8 21 4.09 2367 0.2 0.2 0.1 d1-d2.fas
11 2 9 4.13 1372 0.3 0.2 0.1 e1-d2.fas
69 38 31 4.13 1372 0.2 0.2 0.1 f1-f2.fas
END SAMPLE_TBL



Once you have assembled your data files and created the simulation configuration file, the next step is to actually run the msBayes simulator to generate samples from the prior. This can be as simple as:



$ msbayes.pl -c msbayes_master -o prior_samples


This calls on "msbayes.pl", a Perl script that is part of the msBayes distribution, passing it our master configuration file "msbayes_master" as an argument and asking it to direct its results to a file called "prior_samples".

"msbayes.pl" is the primary driver for the simulations from the prior. It wraps up a whole bunch of complexity: generation of hyper-parameters from their prior distribution, the generation of the actual model parameters to be passed to the underlying coalescent-based sequence simulator (using "msprior"), calling the coalescent-based sequence simulator ("msDQH") to simulate a sample of summary statistics based on the generated model parameters, assembly of a row of data that couples together the set of generated parameter values and the corresponding summary statistics simulated based on this values, and the output of this row as a tab-delimited line of data. This gets repeated again and again, for the number of replicates specified in the master configuration file ("reps = 5000000" in the example file above), to build up a large sample drawn from the parameter space according to the given prior distributions.

All this is a lot of work, but is extremely computationally efficient: the CPU and memory footprint of a single process running is almost negligible, and you can run multiple simultaneous simulations without much overhead even on a single middle-of-the-road machine. Which leads to my recommendation here: run multiple processes to generate multiple files with samples from the prior simultaneously, concatenating these files before going to the next step in the ABC pipeline. So, for example, if you want 10 million samples from the prior, you can run 10 simultaneous processes, each drawing 1 million samples from the prior; concatenating the files then gives you 10 million samples in 1/10th the time.

Once you have you samples from the prior, the next step is to carry out rejection sampling on this data to provide samples from the posterior. And this is where the going gets pretty interesting ...

You see, to be honest, I do not think my posts describing the procedures up to this stage have really provided all that much more new or useful information than already existed in the documentation. I think that the operation of the msBayes ABC pipeline up to this point has been smooth and relatively trouble-free, and has proceeded as expected based on the documentation and help files. However, going beyond the generation of samples from the prior is where I ran into a number of difficulties, gotchas, glitches, and trip-me-up's. My next post will discuss these issues, share my experiences, lessons learned and solutions, as well as offer some scripts that I wrote to patch up the "leaks" and unexpected incompatibilities between different parts of the pipeline.

No comments:

Post a Comment