





Methods and issues in phylogeography and biogeography.
R is like a microwave oven. It is capable of handling a wide range of pre-packaged tasks, but can be frustrating or inappropriate when trying to do even simple things that are outside of its (admittedly vast) library of functions. Ever tried to make toast in a microwave? There has been a push to start using R for simulations and phylogenetic analysis, and I am actually rather ambiguous about how I feel about this. On the one hand, I would much rather an open source platform R be used than some proprietary commercial platform such as Mathematica or Matlab. On the other hand, I do not think that R is the most suitable for the full spectrum of these applications. There are some serious limitations on its capability and performance when handling large datasets (mainly for historical design reasons), and, to be frank, I find many aspects of its syntax and idiom quite irritating. I primarily use it for what it was designed for: numerical and statistical computing, as well as data visualization and plotting, and in this context I am quite happy with it. For almost anything else, I look elsewhere. R has an excellent and very friendly community of people actively developing and supporting it, and I might change my view as R evolves. But, as things stand, it simply is not my first choice for the majority of my programming needs.
If R is like a microwave oven, then Python is a full-fledged modern kitchen. You can produce almost anything you want, from toast to multi-course banquets, but you probably need to do some extra work relative to R. With R, if you want an apple pie, all you need to do is pick up a frozen one from a supermarket and heat it up, and you'll be feasting in a matter of minutes. With Python, you will need to knead your own dough, cook your own apple filling, etc., but you will get your apple pie, and, to be honest, programming in Python is such a pleasure that you will probably enjoy the extra work. And what happens if you instead want a pie with strawberries and dark chocolate and a hint of chili in the filling? With R, if you cannot find an appropriate instant pie in your supermarket, you are out of luck, or you might be in for a very painful adventure in trying to mash together some chimeric concoction that will look horrible and taste worse. But with Python, any pie you can dream of is completely within reach, and probably will not take too much more effort than plain old apple pie. From data wrangling and manipulation (prepping data files, converting file formats, relabeling sequences, etc. etc.) to pipelining workflows, and even to many types analyses and simulations, Python is the ideal choice for the majority of tasks that an evolutionary biologist carries out.
If R is like a microwave, and Python is like modern kitchen, then C++ is like an antique Victorian kitchen in the countryside, far, far, far away from any supermarket. You want an apple pie? With C++, you can consider yourself lucky if you do not actually have to grow the apples and harvest the wheat yourself. You will certainly need to mill and sift the flour, churn the butter, pluck the apples, grind the spices, etc. And none of this "set the oven to 400°" business: you will need to chop up the wood for the fuel, start the fire and keep tending the heat while it is baking. You will eventually get the apple pie, but you are going to have to work very, very, very hard to get it, and most of it will be tedious and painful work. And you will probably have at least one or two bugs in the pie when all is done. More likely than not, memory bugs ...
Stepping out of the cooking/kitchen analogy, if I had to point out the single aspect of programming in C++ that makes it such a pain, I would say "memory management". Much of the time programming in C++ is spent coding up the overhead involved in managing memory (in the broadest sense, from declaration, definition and initialization of stack variables, to allocation and deallocation of heap variables, to tracking and maintaining both types through their lifecycles), and even more is spent in tracking down bugs caused by memory leaks. The Standard Template Library certainly helps, and I've come to find it indispensable, but it still exacts its own cost in terms of overhead and its own price in terms of chasing down memory leaks and bugs.
For an example of the overhead, compare looping through elements of a container in Python:
for i in data:
print i
vs. C++ using the Standard Template Library:
for (std::vector<long>::const_iterator i = data.begin();
i != data.end();
++i) {
std::cout << *i << std::endl;
}
And for an example of insiduous memory bug even with the Standard Template Library, consider this: what might happen sometimes to a pointer that you have been keeping to an element in a vector, when some part of your code appends a new element to the vector? It can be quite ugly.
So what does all that extra work and pain get you?
Performance.
When it comes to performance, C++ rocks. My initial attempt at a complex phylogeography simulator was in Python. It took me a week to get it working. I could manage population sizes of about 1000 on a 1G machine, and it could complete 10000 generations in about a week. I rewrote it in C++. Instead of a week, it took me two and a half months to get it to the same level of complexity. When completed, however, it could manage population sizes of over 1 million on a 1 G machine, and run 2.5 million generations in 24 hours.
After that experience, when I am thinking of coding up something that might be computationally intensive or push the memory limits of my machines, the language that comes to mind is C++. More likely than not, however, I would probably still try to code up the initial solution in Python, and only turn to C++ when it becomes clear that Python's performance is not up to the task.
Java, like Python, is a modern kitchen, allowing for a full range of operations with all the sanity-preserving conveniences and facilities (e.g., garbage-collection/memory-management). But it is a large, industrial kitchen, with an executive chef, sous chefs, and a full team of chefs de partie running things. And so, while you can do everything from making toast to multi-course meals, even the simplest tasks takes a certain minimal investment of organization and overhead. At the end of the day, for many simpler things, such as scrambled eggs and toast, you would get just as good results a lot quicker using Python.
I think that Java is really nice language, and I do like its idioms and syntax, which, by design, enforces many good programming practices. It is also probably much more suited for enterprise-level application development than Python. But I find it very top-heavy for the vast majority of things that I do, and the extra investment in programming overhead that it imposes (think: getters and setters) does not buy me any performance benefit at all. As a result, I have not found a need to code a single application in Java since I started using Python several years ago.
TOL. NUM. UNWEIGHTED WEIGHTED
1e-05 500 0.148147697394790 0.188623205810913
2e-05 1000 0.178475544544545 0.138229444728950
1e-04 5000 0.217240549309862 0.126658030148325
2e-04 10000 0.235878721472147 0.130784297340281
0.001 50000 0.308834221344427 0.123652684056336
0.002 100000 0.355224938539385 0.123306496530203
@article{beaumont2002approximate,
title={{Approximate Bayesian computation in population genetics}},
author={Beaumont, M.A. and Zhang, W. and Balding, D.J.},
journal={Genetics},
volume={162},
number={4},
pages={2025--2035},
year={2002},
publisher={Genetics Soc America}
url={http://www.genetics.org/cgi/content/full/162/4/2025}
}
@article{hickerson2006test,
title={{Test for simultaneous divergence using approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E.A. and Lessios, HA and Crandall, K.},
journal={Evolution},
volume={60},
number={12},
pages={2435--2453},
year={2006},
publisher={BioOne}
}
@article{hickerson2007msbayes,
title={{msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation}},
author={Hickerson, M.J. and Stahl, E. and Takebayashi, N.},
journal={BMC bioinformatics},
volume={8},
number={1},
pages={268},
year={2007},
publisher={BioMed Central Ltd}
}
$ msReject obs_sum_stats.txt priors.txt 0.2 6 7 8 9 10 11 12asks "msReject" to carrying out rejection sampling on the data in "priors.txt", using the summary statistics given in "obs_sum_stats.txt" as the observed summary statistics, and a tolerance of "0.2", with columns 6 through 12 taken to be summary statistic columns (and the remaining assumed to be prior/parameter columns).
$run-reject.py obs_sum_stats priors1.txt 0.01 > posterior_accepted_samples.txt
Following columns identified as PARAMETERS:
[1]:"PRI.numTauClass"
[2]:"PRI.Psi"
[3]:"PRI.var.t"
[4]:"PRI.E.t"
[5]:"PRI.omega"
Following columns identified as SUMMARY STATISTICS:
[6]:"pi.b.1"
[7]:"pi.b.2"
.
.
(etc.)
.
.
[106]:"ShannonsIndex.Pop2.5"
[107]:"ShannonsIndex.Pop2.6"
Command to be executed:
msReject obs_sum_stats priors1.txt 0.02 6 7 8 9 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82
83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
101 102 103 104 105 106 107
Running with PID: 10927.
Completed with exit status: 0.
$ obsSumStats.pl msbayes_master > obs_sum_statsThe file "obs_sum_stats" is also a tab-delimited value file, and consists of two rows, a header row and a data row, with the latter comprising the summary statistics calculated on the observed data.
$ regularize-cols.py -m prior_sims obs_sum_stats > obs_sum_stats_regularizedAs mentioned, the files of simulations from the prior generated by "msBayes.pl" have different numbers of columns, depending on the simulation model hyper-parameter settings. Since "regularize-cols.py" concatenates the data rows of all files passed to it as arguments, you can use it to merge simulations from the priors from different runs of "msBayes.pl" into a single large file for rejection sampling, and it will inspect each file individually to ensure that column patterns are identical to the model file. For example, the following takes 4 files containing data generated under two different models and concatenates the data into a single file with dummy fields inserted as neccessary to ensure uniformity and consistency in columns:
(found 107 fields in model file)
-- Processing 1 of 1: obs_sum_stats
* Source file has 106 columns instead of 107.
* Columns to be inserted: [1]:"PRI.numTauClass"
$ regularize-cols.py -m priors2_1 priors1_1 prior1_2 priors2_1 priors2_2 > priors_allThings to note regarding the operation of the script:
(found 111 fields in model file)
-- Processing 1 of 4: priors1_1
* Source file has 109 columns instead of 111.
* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"
-- Processing 2 of 4: priors1_2
* Source file has 109 columns instead of 111.
* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"
-- Processing 3 of 4: priors2_1
* Source file has same number of columns as model file.
* No columns to be inserted.
-- Processing 4 of 4: priors2_2
* Source file has same number of columns as model file.
* No columns to be inserted.
head <FILENAME> | awk '{print NF}'
" to quickly count columns in files.
numLoci = 1
lowerTheta = 0.01
upperTheta = 75.0
upperTau = 10.0
numTauClasses = 0
upperMig = 0.0
upperRec = 0.0
upperAncPopSize = 0.5
constrain = 0
subParamConstrain = 000000000
reps = 5000000
BEGIN SAMPLE_TBL
100 7 93 3.71 2356 0.2 0.3 0.1 a1-a2.fas
16 4 12 4.08 2559 0.3 0.2 0.2 b1-b2.fas
10 5 5 4.08 1356 0.3 0.2 0.2 c1-c2.fas
29 8 21 4.09 2367 0.2 0.2 0.1 d1-d2.fas
11 2 9 4.13 1372 0.3 0.2 0.1 e1-d2.fas
69 38 31 4.13 1372 0.2 0.2 0.1 f1-f2.fas
END SAMPLE_TBL
$ msbayes.pl -c msbayes_master -o prior_samples