Sunday, September 20, 2009

msBayes Basics IV: Preparing Your Files for Sampling from the Posterior

This is the fourth of a multi-part series on msBayes basics. Here I will discuss some glitches/incompatibilties in the msBayes pipeline when going into the rejection sampling stage that you have to watch out for. I will also present a Python script that fixes the incompatibilities in the pipeline and make the process easier.

The rejection sampling step requires two files: samples from the prior and summary statistics calculated on the observed data.

Our samples from the prior as generated by the script "msBayes.pl" are given as a file of tab-delimited values, with the first row of the file giving the field names. The first several fields consist of the parameter values drawn from the prior, while the remaining fields are the summary statistics calculated on data simulated with those parameter values. Parameter fields can be identified on the leading "PRI." associated with their names (e.g., "PRI.numTauClass", "PRI.Psi", "PRI.var.t"). The remaining fields are the summary statistics.

The summary statistics on the observed data are generated using the program "obsSumStats.pl". This is invoked with a single argument, the path to the master configuration file (discussed here), and by default writes the results to the standard output, but we will redirect this to a file "obs_sum_stats":
$ obsSumStats.pl msbayes_master > obs_sum_stats
The file "obs_sum_stats" is also a tab-delimited value file, and consists of two rows, a header row and a data row, with the latter comprising the summary statistics calculated on the observed data.

Now here is the problem with the msBayes pipeline: the observed summary statistics files generated by "obsSumStats.pl" and the file of simulations from the prior generated by "msBayes.pl" are incompatible!

In particular, the simulations from the prior file has a number of columns that are not present in the observed summary statistics. The file of observed summary statistics has 106 columns, with the first 4 columns being parameters and the remaining 102 columns being the summary statistics of interest. However, the file of simulations from the prior has a varying number of parameter columns, contingent upon your hyper-prior model settings, before being followed by the 102 columns of summary statistics. For example, if you do not constrain the number of divergence times ("numTauClasses = 0"), the simulations from the prior will have 107 columns, with the first 5 columns being parameters and the remaining 102 columns being summary statistics. On the other hand, if you constrain the run to have only 1 divergence time ("numTauClasses = 1"), then the simulations from the prior will have 109 columns, with the first 7 columns being parameters and the remaining 102 columns being summary statistics.

When dealing with just one file of observed summary statistics, and files of simulated samples from the prior generated under the exact same model, then fixing this incompatibility is trivial (IF you are aware of it; I was not the first few times that I ran msBayes, and because my results were not scarily unreasonable, it was only when I started to look at the files in detail that I realized what was happening). However, the moment that you begin doing anything more complex (such as runs under different hyper-priors for numbers of distinct divergence times), then it becomes really quite complicated and tedious.

UPDATE Sept 22 2009: It appears that "acceptRej.pl" not only correctly handles, but actually requires, that the file of summary statistics produced by "obsSumStats.pl" has one column short in relation to the file of simulations from the prior, i.e. 106 columns (of which the first 4 are parameters) instead of 107 (of which the first 5 are parameters). It is not clear to me at this point how it handles the situation when the file of simulations from the prior has > 107 columns, e.g., if the number of divergence times are constrained. One way or another though, to use "msReject", you will still want the column patterns consistent across all your files.

To this end, I've written a Python script that greatly simplifies the task of regularizing columns across multiple files. It depends crucially on a header row of column names as the first line of each file. You will pass it the path to a reference or "model" file that has a column pattern that you want to duplicate across all the source files. For each of the remaining source files you specify, it will examine the header row to identify the positions and names of dummy columns that it will need to insert to make the column pattern identical to the reference or model file. It then prints out the columns of data from the source file, inserting columns of dummy values as needed. If the column pattern of the source file matches, then it simply outputs the data rows as they are given. If you pass it multiple files as input, the data rows (i.e., all lines from the second line onwards) from all these files will be written out seamlessly so that the final output will consist of a single header row of column names followed by the data rows of all the source files, with dummy fields inserted into each row as neccessary so as to have a consistent column pattern.

For example, if you want your file of summary statistics calculated on your observed data (using "obsSumStats.pl") to match the column pattern of your file of simulations from the prior (generated using "msBayes.pl"), you would:
$ regularize-cols.py -m prior_sims obs_sum_stats > obs_sum_stats_regularized
(found 107 fields in model file)
-- Processing 1 of 1: obs_sum_stats
* Source file has 106 columns instead of 107.
* Columns to be inserted: [1]:"PRI.numTauClass"
As mentioned, the files of simulations from the prior generated by "msBayes.pl" have different numbers of columns, depending on the simulation model hyper-parameter settings. Since "regularize-cols.py" concatenates the data rows of all files passed to it as arguments, you can use it to merge simulations from the priors from different runs of "msBayes.pl" into a single large file for rejection sampling, and it will inspect each file individually to ensure that column patterns are identical to the model file. For example, the following takes 4 files containing data generated under two different models and concatenates the data into a single file with dummy fields inserted as neccessary to ensure uniformity and consistency in columns:
$ regularize-cols.py -m priors2_1 priors1_1 prior1_2 priors2_1 priors2_2 > priors_all
(found 111 fields in model file)
-- Processing 1 of 4: priors1_1
* Source file has 109 columns instead of 111.
* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"
-- Processing 2 of 4: priors1_2
* Source file has 109 columns instead of 111.
* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"
-- Processing 3 of 4: priors2_1
* Source file has same number of columns as model file.
* No columns to be inserted.
-- Processing 4 of 4: priors2_2
* Source file has same number of columns as model file.
* No columns to be inserted.
Things to note regarding the operation of the script:
  • Let me repeat this with emphasis: the script depends crucially on a header row of column names as the first line of each file!
  • It only inserts columns into rows of the source files so that their column pattern matches the reference or model file. It does not delete any columns. Thus, the model file should have a column pattern that is a superset of all the column patterns across the source files. You can use the UNIX command "head <FILENAME> | awk '{print NF}'" to quickly count columns in files.
  • It not only does not re-order columns in the source files, its logic depends on the column order of the source files matching the order of columns in the model file (sans missing columns, of course).
  • All the above conditions are met perfectly by the output files produced by msBayes, so as long as you do not modify these files directly, or are careful to maintain the above conditions if you do modify the files, there should be no problem in the execution of the script.
So, to summarize, as things stand when I write this, the msBayes pipeline generates files of varying column patterns. You will need to ensure that your column patterns are identical across and within files before actually carrying out the rejection sampling procedure. The script given here will automate this task for you, removing the tedium and reducing the risk of errors.

In my next post, I will discuss the actual invocation and running of "msReject", and present a Python script that makes the call to "msReject" simpler by identifying the summary statistic columns for you.

1 comment:

  1. Hi! I'm trying to download the regularize-cols.py script, but the link seems to be broken. Does the script still exists somewhere? Where may I download it from? Thank you!!

    ReplyDelete