Tuesday, May 24, 2011

2011 and admixture!



Hi all!
It has been a long hiatus, but I just heard a talk that I though you all would find interesting, and Evolution is coming up, so I thought I would post!
I am in Okinawa at OIST for a workshop on genomics, with a focus on linkage and recombination. Today we had talks from Gil McVean and Simon Meyers, both of which were exciting, but it was Simon's work that I found particularly applicable to questions I (we, perhaps?) are interested in. He was involved in the development of HAPMIX, described in Price et al. 2008, a program for inferring ancestry of segments of chromosomes in admixed populations. I am planning to apply it to my longhorn data set, and I'll let you know how it goes, but it appears to have some advantages over using site likelihoods in STRUCTURE to estimate ancestry of alleles. Downside might be that it could require more dense genomic data than I have.


He also presented as yet unpublished work (I think in collaboration with Daniel Falush and Garrett Hellenthal) on using inference of patterns of introgression between populations to estimate levels and times of admixture, without ever actually estimating the blocks. Not clear on exactly how it works... Because he is doing this on human populations these correlate with fascinating historical events. Using a 28 year estimate for generation time, he was able to place fairly precisely times of admixture events which were corroborated by known historical migrations. Reconstructing Spanish introgression into native North American populations in the 1700s isn't terribly exciting, but inferring European ancestry from 400 BC in the Kalash people of northern Pakistan, who have an oral history of being descended from the armies of Alexander the Great is pretty cool. I'm hoping I can apply these methods to my cattle SNP data. I have already been claiming that Moors brought African cattle to Spain, but having dates to back it up make it lot more plausible...
Anyhow- I think HAPMIX might be a really useful resource, whose existence had slipped past me, and I look forwar to this new method being published.



Hope to see some of you folks in Norman next month. I am really looking forward to the next-gen data in phylogeography symposium!

Monday, October 18, 2010

Bayesian NCPA?

Hi all,
I just read Three roads diverged? Routes to phylogeographic inference by Erik W. Bloomquist, Philippe Lemey, and Marc A. Suchard, out next month in TREE and I'm confused! They review some interesting work in phylogeography, but also talk up a "Bayesian NCPA" method. However, my understanding of the method they discuss is that although it takes into account uncertainty in reconstruction of the haplotype network, it doesn't modify the inference key at all, which seemed to me to be the most problematically inscrutable part of NCPA. Have any of you folks looked at this? Is there anything fresh there?
As well- what are your impressions of the method they describe as "spatial diffusion"? My understanding from the Lemey et al 2009 paper is that it is akin to reconstructing location as a continuous character on the phylogeny under a variety of models. Have any of you tried this method? What are your thoughts?
Thanks!
Emily

Thursday, October 14, 2010

Sandwalk: Philosophers, Science, and Creationism

I don't entirely disagree with this, but...in what sense is "the supernatural did it" an explanation at all? If this is a tough question, then so is "proving naturalism couldn't do it" is also a tough question. Criticizing some particular scientific theory != no possible natural explanation is possible. It is easy to see how reasonable people might say this stuff gets beyond mere science.

Sunday, July 4, 2010

Building a Simple Application Using 'libsequence'

Folks,

As I discuss in my programming language comparison post, for performance, C++ is a natural choice. Calculating the simplest set of summary statistics on a set of data may take 30 seconds when using my beloved Python, but just a few seconds using C++. When needing to crunch through thousands or even tens of thousands of datasets, this makes a HUGE difference!

Of course, if writing the script to do so in Python takes less than 20 minutes, writing the equivalent C++ program can easily take several hours or even the whole day depending on your memory gremlins. A lot of the development time can be saved by making use of existing libraries. One of these that is particularly relevant to phylogeography is Kevin Thornton's libsequence. This library implements a broad range of summary statistics calculations in C++, and also features a fairly robust FASTA-format parser.

Using this library, I wrote a simple program to calculate Fst statistics from a FASTA file in less than an hour, and a lot of this time was spent in just dealing with parsing/processing the user options and arguments. As I note in the comments in the post, however, for a production-grade application, I would rather use the much more robust and flexible command-line parser that I wrote for my Ginkgo application, which would cut down development time on this aspect of the program by a dramatic amount.

In addition, it took me a little while longer than I anticipated to figure out how to get the whole thing to compile and link using gcc. As such, I also thought that I would share a simple general recipe for compiling and linking a libsequence application using gcc.

Because blogger sucks at code layout (or because I suck at getting blogger to layout code correctly), the actual sample code and build instructions are actually presented in a post on my personal site instead of here.

I have also written up a "autoconf" and "automake" project skeleton that automates most of the build process. This can be found in an attachment to the post mentioned above (direct link here).

Sunday, June 13, 2010

Evolution 2010 Fast Approaching

It's hard to believe, but Evolution 2010 in Portland, OR is right around the corner. According to their website, it will be the largest Evolution meetings ever with > 1,800 registrants and > 1,050 talks. Judging from the recently posted program, the meeting will have quite a lot to offer to those interested in all things phylogenetic. I count 6 sessions on "phylogenetic theory", a whopping 13 sessions on "phylogeography", 12 sessions on "phylogenetics & diversification", and lots of other related topics. Unfortunately, many of these sessions are overlapping on Saturday and Sunday. I hope everyone arrives rested and with a large coffee mug! (I'm serious about the mugs. The organizers are pushing to reduce waste at the meeting and are asking everyone to bring their own reusable beverage containers.)

On a side note, I will be acting as a mentor on Saturday for the Undergraduate Diversity program. Part of the program's purpose is to help participating undergraduates network by meeting researchers in the field. So, if you see me on Saturday, please stop for a quick hello and introductions.

Looking forward to seeing everyone who can make it!

Friday, May 21, 2010

Transferring Support Values Between Trees

Papers reporting phylogenetic trees often label some point estimate of the phylogeny (e.g., an ML tree) with support values derived from several methods (e.g., posterior probabilities, ML bootstraps, and MP bootstraps). Perhaps I'm merely ignorant of the available software, but I have never been able to find a way to easily transfer multiple support values from various consensus trees to a point estimate without having to recalculate the consensus values from the original collections of trees. This problem is particularly acute when the consensus trees differ in topology.

As general Python programming practice, as well as a way to learn to use Jeet and Mark's great Dendropy library, I decided to take a crack at fixing this problem. I wrote a python script, creatively titled transferBranchLabels.py, that takes a single point estimate of the phylogeny (w/ or w/o branch lengths) and an arbitrary number of trees with support values as command-line arguments. It then returns the point estimate (retaining branch lengths, if originally present) with all the support values labeling the internal nodes. If a node is present in the point estimate (target tree) but not in one of the support trees, the node is labeled with '-'. The output tree simply includes all the support values separated by some delimeter (I've used '/' for now). TreeView X doesn't like that format, but FigTree handles it just fine.

Here's an example run of the program:


Note that the program reports how many of the clades in each support tree were (not) found in the target tree. Here are the corresponding trees from FigTree:


Upper: Original ML tree. Lower: Labeled ML tree.


Upper: Consensus tree from a posterior distribution. Lower: Consensus tree from ML bootstrapping.

If this would be a useful utility for you, feel free to download it and give it a try. Let me know how it goes.

Friday, December 11, 2009

Who's Hungry for DIM SUM?

Kevin Savidge, Emily McTavish and I have been working on a program for the simulation of demography and individual migration events on a continuous landscape. We've decided to call it DIM SUM (Demography and Individual Migration Simulated Using a Markov chain). DIM SUM was primarily written by Kevin as an undergraduate research project.

DIM SUM is a stand-alone Java program that takes one main XML input file, as well as files specifying carrying capacity and borders across an arbitrary landscape. These secondary files can be provided as numeric matrices or as images. If images, particular color channels correspond to the different input variables. Because it can take images as input, you can output your favorite map from GIS software (with some color corresponding to how suitable a given pixel of habitat may be) and simulate directly on that landscape! It also uses great circle distances, in case the landscape on which you're simulating has a large latitudinal span (since two lines of latitude are closer to one another near the poles as compared to the equator).

As output, DIM SUM provides trees of ancestor-descendant relationships, the locations of all individuals, and can also plot all individuals directly on the landscape. Here's a simple example using the Hawaiian islands, where all land area is specified as equally good habitat. In this example, we've used four separate starting populations of size 1, with each population of origin tagged with a different color. Note how the population starting on the island with the greatest expanse of suitable habitat (the Big Island) begins to dominate the populations on the other islands. Here's the population after 5 generations with a population size of 21...




...after 25 generations with a population size of 2,288...


...after 200 generations with a population size of 2,635...



...and after 800 generations with a population size of 2,636...



Our original motivation for creating such a simulator was to generate data for the testing of phylogeographic methods (although DIM SUM has many other potential uses). In particular, these types of simulations should provide a more evenhanded test of coalescent-based methods and allow the testing of methods that model migration on a continuous landscape (Lemmon and Lemmon, 2008, Syst. Biol., 57(4): 544-561). One very nice feature of continuous landscape methods is that they allow you to infer the geographic locations (think latitude and longitude) of ancestors. In fact, you can plot a likelihood surface directly on the landscape! To do this, you sequentially constrain the location for the ancestor of interest to a series of locations across the landscape, calculate the maximum likelihood score, and then use this matrix of scores to approximate the surface.

DIM SUM can generate data suitable for testing such a method, by comparing the true location of the ancestor to the inferred likelihood surface. We performed some simple simulations to try out the Lemmons' program (PhyloMapper) for inferring ancestral locations. In the plots below, the true location of the ancestor is a marked with a "*".

Here's a surface for a simulation in which we allowed a population to expand from a single individual in the middle of the range for 5,000 generations and then double in size for another 5,000 generations. The area in red gives the 95% confidence envelope...



...and here's a case where we allowed both the population and range sizes to double (the range was originally constrained to the left half of the landscape)...



Note that there is some bias when migration occurs more often in one direction (when the range expands), but it wasn't substantial enough to reject the true location in this case. Given the new perspective on phylogeographic data that it provides, I hope PhyloMapper sees more widespread use and development.

If you're interested in using DIM SUM, you can download it at http://code.google.com/p/bio-dimsum. Please let us know what you think!

P.S. Count yourself lucky if you can find an undergrad who can write a program like this in under a semester!