Monday, October 18, 2010

Bayesian NCPA?

Hi all,
I just read Three roads diverged? Routes to phylogeographic inference by Erik W. Bloomquist, Philippe Lemey, and Marc A. Suchard, out next month in TREE and I'm confused! They review some interesting work in phylogeography, but also talk up a "Bayesian NCPA" method. However, my understanding of the method they discuss is that although it takes into account uncertainty in reconstruction of the haplotype network, it doesn't modify the inference key at all, which seemed to me to be the most problematically inscrutable part of NCPA. Have any of you folks looked at this? Is there anything fresh there?
As well- what are your impressions of the method they describe as "spatial diffusion"? My understanding from the Lemey et al 2009 paper is that it is akin to reconstructing location as a continuous character on the phylogeny under a variety of models. Have any of you tried this method? What are your thoughts?
Thanks!
Emily

Thursday, October 14, 2010

Sandwalk: Philosophers, Science, and Creationism

I don't entirely disagree with this, but...in what sense is "the supernatural did it" an explanation at all? If this is a tough question, then so is "proving naturalism couldn't do it" is also a tough question. Criticizing some particular scientific theory != no possible natural explanation is possible. It is easy to see how reasonable people might say this stuff gets beyond mere science.

Sunday, July 4, 2010

Building a Simple Application Using 'libsequence'

Folks,

As I discuss in my programming language comparison post, for performance, C++ is a natural choice. Calculating the simplest set of summary statistics on a set of data may take 30 seconds when using my beloved Python, but just a few seconds using C++. When needing to crunch through thousands or even tens of thousands of datasets, this makes a HUGE difference!

Of course, if writing the script to do so in Python takes less than 20 minutes, writing the equivalent C++ program can easily take several hours or even the whole day depending on your memory gremlins. A lot of the development time can be saved by making use of existing libraries. One of these that is particularly relevant to phylogeography is Kevin Thornton's libsequence. This library implements a broad range of summary statistics calculations in C++, and also features a fairly robust FASTA-format parser.

Using this library, I wrote a simple program to calculate Fst statistics from a FASTA file in less than an hour, and a lot of this time was spent in just dealing with parsing/processing the user options and arguments. As I note in the comments in the post, however, for a production-grade application, I would rather use the much more robust and flexible command-line parser that I wrote for my Ginkgo application, which would cut down development time on this aspect of the program by a dramatic amount.

In addition, it took me a little while longer than I anticipated to figure out how to get the whole thing to compile and link using gcc. As such, I also thought that I would share a simple general recipe for compiling and linking a libsequence application using gcc.

Because blogger sucks at code layout (or because I suck at getting blogger to layout code correctly), the actual sample code and build instructions are actually presented in a post on my personal site instead of here.

I have also written up a "autoconf" and "automake" project skeleton that automates most of the build process. This can be found in an attachment to the post mentioned above (direct link here).

Sunday, June 13, 2010

Evolution 2010 Fast Approaching

It's hard to believe, but Evolution 2010 in Portland, OR is right around the corner. According to their website, it will be the largest Evolution meetings ever with > 1,800 registrants and > 1,050 talks. Judging from the recently posted program, the meeting will have quite a lot to offer to those interested in all things phylogenetic. I count 6 sessions on "phylogenetic theory", a whopping 13 sessions on "phylogeography", 12 sessions on "phylogenetics & diversification", and lots of other related topics. Unfortunately, many of these sessions are overlapping on Saturday and Sunday. I hope everyone arrives rested and with a large coffee mug! (I'm serious about the mugs. The organizers are pushing to reduce waste at the meeting and are asking everyone to bring their own reusable beverage containers.)

On a side note, I will be acting as a mentor on Saturday for the Undergraduate Diversity program. Part of the program's purpose is to help participating undergraduates network by meeting researchers in the field. So, if you see me on Saturday, please stop for a quick hello and introductions.

Looking forward to seeing everyone who can make it!

Friday, May 21, 2010

Transferring Support Values Between Trees

Papers reporting phylogenetic trees often label some point estimate of the phylogeny (e.g., an ML tree) with support values derived from several methods (e.g., posterior probabilities, ML bootstraps, and MP bootstraps). Perhaps I'm merely ignorant of the available software, but I have never been able to find a way to easily transfer multiple support values from various consensus trees to a point estimate without having to recalculate the consensus values from the original collections of trees. This problem is particularly acute when the consensus trees differ in topology.

As general Python programming practice, as well as a way to learn to use Jeet and Mark's great Dendropy library, I decided to take a crack at fixing this problem. I wrote a python script, creatively titled transferBranchLabels.py, that takes a single point estimate of the phylogeny (w/ or w/o branch lengths) and an arbitrary number of trees with support values as command-line arguments. It then returns the point estimate (retaining branch lengths, if originally present) with all the support values labeling the internal nodes. If a node is present in the point estimate (target tree) but not in one of the support trees, the node is labeled with '-'. The output tree simply includes all the support values separated by some delimeter (I've used '/' for now). TreeView X doesn't like that format, but FigTree handles it just fine.

Here's an example run of the program:


Note that the program reports how many of the clades in each support tree were (not) found in the target tree. Here are the corresponding trees from FigTree:


Upper: Original ML tree. Lower: Labeled ML tree.


Upper: Consensus tree from a posterior distribution. Lower: Consensus tree from ML bootstrapping.

If this would be a useful utility for you, feel free to download it and give it a try. Let me know how it goes.