<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-2531775326563628423</id><updated>2011-10-03T17:10:39.181-07:00</updated><category term='C++'/><category term='msbayes'/><category term='Python'/><category term='Java'/><category term='ABC'/><category term='programming'/><category term='R'/><category term='approximate Bayesian Computation'/><title type='text'>geodendron</title><subtitle type='html'>Methods and issues in phylogeography and biogeography.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>Jeet Sukumaran</name><uri>http://www.blogger.com/profile/04874973927331407332</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_bSNhphjTeao/S23tnWOSfNI/AAAAAAAAABI/FhFQDr3WBZI/s1600-R/f83bd77d036b0a5d1a595c697a26ef6c.png'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>17</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-865423833904902291</id><published>2011-07-06T14:54:00.000-07:00</published><updated>2011-07-06T15:22:20.533-07:00</updated><title type='text'>Scientific Reputation</title><content type='html'>&lt;a href="http://1.bp.blogspot.com/-cOxylBMLuEc/ThTf5LJeOHI/AAAAAAAAAEY/RSgLkcZ-qKk/s1600/reputationFortuneCookie.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 288px; height: 216px;" src="http://1.bp.blogspot.com/-cOxylBMLuEc/ThTf5LJeOHI/AAAAAAAAAEY/RSgLkcZ-qKk/s400/reputationFortuneCookie.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5626368007830648946" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;a href="http://4.bp.blogspot.com/-F0dXseRotRg/ThTd0rMTCbI/AAAAAAAAAEQ/nGdyu_pf8TM/s1600/reputationFortuneCookie.jpg" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;br /&gt;&lt;/a&gt;A recent &lt;a href="http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002108"&gt;editorial&lt;/a&gt; in PLoS Computational Biology on building and maintaining a good scientific reputation just came to my attention. Much of this is probably already known and practiced by most, but it never hurts to see it again and consider it closely.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-865423833904902291?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/865423833904902291/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2011/07/scientific-reputation.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/865423833904902291'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/865423833904902291'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2011/07/scientific-reputation.html' title='Scientific Reputation'/><author><name>Jeremy Brown</name><uri>http://www.blogger.com/profile/03105958793710069493</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='http://1.bp.blogspot.com/_KkgJhpfFV3g/Sj2QLpX3rrI/AAAAAAAAAAM/C-cQ9k-WqEo/S220/n7938591_48119704_6896.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-cOxylBMLuEc/ThTf5LJeOHI/AAAAAAAAAEY/RSgLkcZ-qKk/s72-c/reputationFortuneCookie.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-3570275203440507689</id><published>2011-06-28T09:43:00.000-07:00</published><updated>2011-06-28T09:45:06.320-07:00</updated><title type='text'>Ascertainment bias data sets?</title><content type='html'>Hi folks,&lt;br /&gt;I'm working on a simulation study looking at the effects of using ascertained SNP data for phylogenetic and phylogeographic reconstruction. I'm basing my simulations on real data sets. I have a few but am looking for more! &lt;br /&gt;   Have any of you seen studies recently that use single nucleotide loci that were selected based on being polymorphic is some ascertainment panel? Mostly SNP chip based data sets are what I was thinking of, but there are others too.&lt;br /&gt;For example, &lt;a href="http://www.pnas.org/content/106/44/18644.short"&gt;Decker et al. 2009&lt;/a&gt; used marker loci that are known to be polymorphic in cattle to reconstruct divergences up to 29 mya across bovidae.&lt;br /&gt;SNP chip data are cheap and easy to get- I want to figure out how far you can push using them!&lt;br /&gt;Send me an email! ejmctavish [at] utexas.edu&lt;br /&gt;Thanks,&lt;br /&gt;ejm&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/-POvlA5llOoA/TgoDpMzIlKI/AAAAAAAAADc/6HekgMroj7I/s1600/Decker2009Phy.tiff"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 279px; height: 320px;" src="http://1.bp.blogspot.com/-POvlA5llOoA/TgoDpMzIlKI/AAAAAAAAADc/6HekgMroj7I/s320/Decker2009Phy.tiff" border="0" alt=""id="BLOGGER_PHOTO_ID_5623311091070702754" /&gt;&lt;/a&gt;&lt;br /&gt;Fig. 1 from Decker et al 2009. Strict consensus cladogram (no branch lengths) of 17 most parsimonious trees based on 40,843 SNP genotypes. *, Denotes paraphyletic group.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-3570275203440507689?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/3570275203440507689/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2011/06/ascertainment-bias-data-sets.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/3570275203440507689'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/3570275203440507689'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2011/06/ascertainment-bias-data-sets.html' title='Ascertainment bias data sets?'/><author><name>ejm</name><uri>http://www.blogger.com/profile/05515770572245690005</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_lQgViSPYRzU/SndfPT6EE7I/AAAAAAAAAAM/Ya-hPvwI0T4/S220/EmilyMcTavish.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-POvlA5llOoA/TgoDpMzIlKI/AAAAAAAAADc/6HekgMroj7I/s72-c/Decker2009Phy.tiff' height='72' width='72'/><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-1988447542456270437</id><published>2011-05-24T07:34:00.000-07:00</published><updated>2011-05-24T07:54:47.843-07:00</updated><title type='text'>2011 and admixture!</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/-cf_eBVYQnA4/Tdu3AFeKUyI/AAAAAAAAAC8/TDTst4Vt1W4/s1600/IMG_3690.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 240px; height: 320px;" src="http://4.bp.blogspot.com/-cf_eBVYQnA4/Tdu3AFeKUyI/AAAAAAAAAC8/TDTst4Vt1W4/s320/IMG_3690.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5610278972916650786" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;Hi all!&lt;br /&gt;   It has been a long hiatus, but I just heard a talk that I though you all would find interesting, and Evolution is coming up, so I thought I would post!&lt;br /&gt;I am in Okinawa at OIST for a workshop on genomics, with a focus on linkage and recombination. Today we had talks from Gil McVean and &lt;a href="http://www.stats.ox.ac.uk/people/academic_staff/simon_myers2"&gt;Simon Meyers&lt;/a&gt;, both of which were exciting, but it was Simon's work that I found particularly applicable to questions I (we, perhaps?) are interested in. He was involved in the development of &lt;a href="http://www.stats.ox.ac.uk/~myers/software.html"&gt;HAPMIX&lt;/a&gt;, described in &lt;a href="http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000519"&gt;Price et al. 2008&lt;/a&gt;, a program for inferring ancestry of segments of chromosomes in admixed populations. I am planning to apply it to my longhorn data set, and I'll let you know how it goes, but it appears to have some advantages over using site likelihoods in STRUCTURE to estimate ancestry of alleles. Downside might be that it could require more dense genomic data than I have.&lt;br /&gt;   &lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-LAO5Y4T4jQw/Tdu-zXEYHxI/AAAAAAAAADE/oIGZd8famLo/s1600/journal.pgen.1000519.g002.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 219px;" src="http://3.bp.blogspot.com/-LAO5Y4T4jQw/Tdu-zXEYHxI/AAAAAAAAADE/oIGZd8famLo/s320/journal.pgen.1000519.g002.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5610287550395064082" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;He also presented as yet unpublished work (I think in collaboration with Daniel Falush and Garrett Hellenthal) on using inference of patterns of introgression between populations to estimate levels and times of admixture, without ever actually estimating the blocks. Not clear on exactly how it works... Because he is doing this on human populations these correlate with fascinating historical events. Using a 28 year estimate for generation time, he was able to place fairly precisely times of admixture events which were corroborated by known historical migrations. Reconstructing Spanish introgression into native North American populations in the 1700s isn't terribly exciting, but inferring European ancestry from 400 BC in the Kalash people of northern Pakistan, who have an oral history of being descended from the armies of Alexander the Great is pretty cool. I'm hoping I can apply these methods to my cattle SNP data. I have already been claiming that Moors brought African cattle to Spain, but having dates to back it up make it lot more plausible...&lt;br /&gt;Anyhow- I think HAPMIX might be a really useful resource, whose existence had slipped past me, and I look forwar to this new method being published.&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/-2MEbA0acn1c/Tdu_ZxmWhVI/AAAAAAAAADM/TPfyBnNeIEs/s1600/alexander%2Bthe%2Bgreat.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 308px; height: 300px;" src="http://3.bp.blogspot.com/-2MEbA0acn1c/Tdu_ZxmWhVI/AAAAAAAAADM/TPfyBnNeIEs/s320/alexander%2Bthe%2Bgreat.jpg" border="0" alt=""id="BLOGGER_PHOTO_ID_5610288210351916370" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;   Hope to see some of you folks in Norman next month. I am really looking forward to the next-gen data in phylogeography symposium!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-1988447542456270437?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/1988447542456270437/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2011/05/2011-and-admixture.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1988447542456270437'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1988447542456270437'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2011/05/2011-and-admixture.html' title='2011 and admixture!'/><author><name>ejm</name><uri>http://www.blogger.com/profile/05515770572245690005</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_lQgViSPYRzU/SndfPT6EE7I/AAAAAAAAAAM/Ya-hPvwI0T4/S220/EmilyMcTavish.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/-cf_eBVYQnA4/Tdu3AFeKUyI/AAAAAAAAAC8/TDTst4Vt1W4/s72-c/IMG_3690.jpg' height='72' width='72'/><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-6066879094056240253</id><published>2010-10-18T14:51:00.000-07:00</published><updated>2010-10-18T15:00:50.166-07:00</updated><title type='text'>Bayesian NCPA?</title><content type='html'>Hi all,&lt;br /&gt;I just read  &lt;a href="http://www.sciencedirect.com/science?_ob=ArticleURL&amp;amp;_udi=B6VJ1-512FTGV-1&amp;amp;_user=108429&amp;amp;_coverDate=11%2F30%2F2010&amp;amp;_rdoc=1&amp;amp;_fmt=high&amp;amp;_orig=search&amp;amp;_origin=search&amp;amp;_sort=d&amp;amp;_docanchor=&amp;amp;view=c&amp;amp;_searchStrId=1503505192&amp;amp;_rerunOrigin=scholar.google&amp;amp;_acct=C000059713&amp;amp;_version=1&amp;amp;_urlVersion=0&amp;amp;_userid=108429&amp;amp;md5=f14eab0a8b2863e4b18d768d5e858655&amp;amp;searchtype=a"&gt;Three roads diverged? Routes to phylogeographic inference&lt;/a&gt; by Erik W. Bloomquist, Philippe Lemey, and Marc A. Suchard, out next month in TREE and I'm confused! They review some interesting work in phylogeography, but also talk up a "Bayesian NCPA" method. However, my understanding of the method they discuss is that although it takes into account uncertainty in reconstruction of the haplotype network, it doesn't modify the inference key at all, which seemed to me to be the most problematically inscrutable part of NCPA. Have any of you folks looked at this? Is there anything fresh there?&lt;br /&gt;As well- what are your impressions of the method they describe as "spatial diffusion"? My understanding from the Lemey et al 2009 paper is that it is akin to reconstructing location as a continuous character on the phylogeny under a variety of models. Have any of you tried this method? What are your thoughts?&lt;br /&gt;Thanks!&lt;br /&gt;Emily&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-6066879094056240253?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/6066879094056240253/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2010/10/bayesian-ncpa.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/6066879094056240253'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/6066879094056240253'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2010/10/bayesian-ncpa.html' title='Bayesian NCPA?'/><author><name>ejm</name><uri>http://www.blogger.com/profile/05515770572245690005</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='29' height='32' src='http://3.bp.blogspot.com/_lQgViSPYRzU/SndfPT6EE7I/AAAAAAAAAAM/Ya-hPvwI0T4/S220/EmilyMcTavish.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-5538021781956707045</id><published>2010-10-14T11:41:00.000-07:00</published><updated>2010-10-14T11:41:02.610-07:00</updated><title type='text'>Sandwalk: Philosophers, Science, and Creationism</title><content type='html'>I don't entirely disagree with this, but...in what sense is "the supernatural did it" an explanation at all?  If this is a tough question, then so is "proving naturalism couldn't do it" is also a tough question.  Criticizing some particular scientific theory != no possible natural explanation is possible.  It is easy to see how reasonable people might say this stuff gets beyond mere science.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-5538021781956707045?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='related' href='http://sandwalk.blogspot.com/2010/10/philosophers-science-and-creationism.html' title='Sandwalk: Philosophers, Science, and Creationism'/><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/5538021781956707045/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2010/10/sandwalk-philosophers-science-and.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/5538021781956707045'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/5538021781956707045'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2010/10/sandwalk-philosophers-science-and.html' title='Sandwalk: Philosophers, Science, and Creationism'/><author><name>NickM</name><uri>http://www.blogger.com/profile/04765417807335152285</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-4955207721979050989</id><published>2010-07-04T00:26:00.000-07:00</published><updated>2010-07-04T00:47:45.138-07:00</updated><title type='text'>Building a Simple Application Using 'libsequence'</title><content type='html'>Folks,&lt;br /&gt;&lt;br /&gt;As I discuss in my &lt;a href="http://geodendron.blogspot.com/2009/10/idiosyncratic-overview-of-some.html"&gt;programming language comparison&lt;/a&gt; post, for performance, C++ is a natural choice. Calculating the simplest set of summary statistics on a set of data may take 30 seconds when using my beloved Python, but just a few seconds using C++. When needing to crunch through thousands or even tens of thousands of datasets, this makes a HUGE difference!&lt;br /&gt;&lt;br /&gt;Of course, if writing the script to do so in Python takes less than 20 minutes, writing the equivalent C++ program can easily take several hours or even the whole day depending on your memory gremlins. A lot of the development time can be saved by making use of existing libraries. One of these that is particularly relevant to phylogeography is Kevin Thornton's &lt;a href="http://molpopgen.org/software/libsequence.html"&gt;libsequence&lt;/a&gt;. This library implements a broad range of summary statistics calculations in C++, and also features a fairly robust FASTA-format parser. &lt;br /&gt;&lt;br /&gt;Using this library, I wrote a simple program to calculate Fst statistics from a FASTA file in less than an hour, and a lot of this time was spent in just dealing with parsing/processing the user options and arguments. As I note in the comments in the post, however, for a production-grade application, I would rather use the much more robust and flexible &lt;a href="http://github.com/jeetsukumaran/Ginkgo/blob/master/ginkgocc/cmdopt.cpp"&gt;command-line parser&lt;/a&gt; that I wrote for my &lt;a href="http://phylo.bio.ku.edu/ginkgo/"&gt;Ginkgo&lt;/a&gt; application, which would cut down development time on this aspect of the program by a dramatic amount.&lt;br /&gt;&lt;br /&gt;In addition, it took me a little while longer than I anticipated to figure out how to get the whole thing to compile and link using &lt;a href="http://gcc.gnu.org/"&gt;gcc&lt;/a&gt;. As such, I also thought that I would share a simple general recipe for compiling and linking a &lt;a href="http://molpopgen.org/software/libsequence.html"&gt;libsequence&lt;/a&gt; application using &lt;a href="http://gcc.gnu.org/"&gt;gcc&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Because blogger sucks at code layout (or because I suck at getting blogger to layout code correctly), the actual sample code and build instructions are actually presented &lt;a href="http://jeetworks.org/node/72"&gt;in a post on my personal site&lt;/a&gt; instead of here. &lt;br /&gt;&lt;br /&gt;I have also written up a "&lt;a href="http://www.gnu.org/software/autoconf/"&gt;autoconf&lt;/a&gt;" and "&lt;a href="http://www.gnu.org/software/automake/"&gt;automake&lt;/a&gt;" project skeleton that automates most of the build process. This can be found in an attachment to &lt;a href="http://jeetworks.org/node/72"&gt;the post&lt;/a&gt; mentioned above (direct link &lt;a href="http://jeetworks.org/sites/default/files/sample-libsequence-prog-build.tgz"&gt;here&lt;/a&gt;).&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-4955207721979050989?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/4955207721979050989/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2010/07/building-simple-application-using.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/4955207721979050989'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/4955207721979050989'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2010/07/building-simple-application-using.html' title='Building a Simple Application Using &apos;libsequence&apos;'/><author><name>Jeet Sukumaran</name><uri>http://www.blogger.com/profile/04874973927331407332</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_bSNhphjTeao/S23tnWOSfNI/AAAAAAAAABI/FhFQDr3WBZI/s1600-R/f83bd77d036b0a5d1a595c697a26ef6c.png'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-1264767014787045225</id><published>2010-06-13T16:00:00.001-07:00</published><updated>2010-06-13T16:23:58.441-07:00</updated><title type='text'>Evolution 2010 Fast Approaching</title><content type='html'>&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_KkgJhpfFV3g/TBVjUMU_PpI/AAAAAAAAADg/d_m1W-rLhrg/s1600/evolution2010_banner.png"&gt;&lt;img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 320px; height: 72px;" src="http://3.bp.blogspot.com/_KkgJhpfFV3g/TBVjUMU_PpI/AAAAAAAAADg/d_m1W-rLhrg/s320/evolution2010_banner.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5482397319951236754" /&gt;&lt;/a&gt;&lt;div&gt;It's hard to believe, but &lt;a href="http://www.evolutionsociety.org/SSE2010/"&gt;Evolution 2010&lt;/a&gt; in Portland, OR is right around the corner.  According to &lt;a href="http://www.evolutionsociety.org/SSE2010/"&gt;their website&lt;/a&gt;, it will be the largest Evolution meetings ever with &gt; 1,800 registrants and &gt; 1,050 talks.  Judging from the recently posted program, the meeting will have quite a lot to offer to those interested in all things phylogenetic. I count 6 sessions on "phylogenetic theory", a whopping 13 sessions on "phylogeography", 12 sessions on "phylogenetics &amp;amp; diversification", and lots of other related topics.  Unfortunately, many of these sessions are overlapping on Saturday and Sunday.  I hope everyone arrives rested and with a large coffee mug!  (I'm serious about the mugs.  The organizers are pushing to reduce waste at the meeting and are asking everyone to bring their own reusable beverage containers.)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;On a side note, I will be acting as a mentor on Saturday for the Undergraduate Diversity program.  Part of the program's purpose is to help participating undergraduates network by meeting researchers in the field.  So, if you see me on Saturday, please stop for a quick hello and introductions.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Looking forward to seeing everyone who can make it!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-1264767014787045225?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/1264767014787045225/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2010/06/evolution-2010-fast-approaching.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1264767014787045225'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1264767014787045225'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2010/06/evolution-2010-fast-approaching.html' title='Evolution 2010 Fast Approaching'/><author><name>Jeremy Brown</name><uri>http://www.blogger.com/profile/03105958793710069493</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='http://1.bp.blogspot.com/_KkgJhpfFV3g/Sj2QLpX3rrI/AAAAAAAAAAM/C-cQ9k-WqEo/S220/n7938591_48119704_6896.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_KkgJhpfFV3g/TBVjUMU_PpI/AAAAAAAAADg/d_m1W-rLhrg/s72-c/evolution2010_banner.png' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-1090771104535005720</id><published>2010-05-21T14:14:00.000-07:00</published><updated>2010-05-21T14:58:33.446-07:00</updated><title type='text'>Transferring Support Values Between Trees</title><content type='html'>&lt;div style="text-align: left;"&gt;Papers reporting phylogenetic trees often label some point estimate of the phylogeny (e.g., an ML tree) with support values derived from several methods (e.g., posterior probabilities, ML bootstraps, and MP bootstraps). Perhaps I'm merely ignorant of the available software, but I have never been able to find a way to easily transfer multiple support values from various consensus trees to a point estimate without having to recalculate the consensus values from the original collections of trees. This problem is particularly acute when the consensus trees differ in topology.&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As general Python programming practice, as well as a way to learn to use Jeet and Mark's great &lt;a href="http://packages.python.org/DendroPy/"&gt;Dendropy&lt;/a&gt; library, I decided to take a crack at fixing this problem. I wrote a python script, creatively titled &lt;a href="http://webspace.utexas.edu/jmb2239/transferBranchLabels.py"&gt;transferBranchLabels.py&lt;/a&gt;, that takes a single point estimate of the phylogeny (w/ or w/o branch lengths) and an arbitrary number of trees with support values as command-line arguments. It then returns the point estimate (retaining branch lengths, if originally present) with all the support values labeling the internal nodes. If a node is present in the point estimate (target tree) but not in one of the support trees, the node is labeled with '-'.  The output tree simply includes all the support values separated by some delimeter (I've used '/' for now).  TreeView X doesn't like that format, but FigTree handles it just fine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's an example run of the program:&lt;/div&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 0); "&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_KkgJhpfFV3g/S_b98vX5rfI/AAAAAAAAACw/v8czvStyr7I/s1600/programExecution.jpg"&gt;&lt;img src="http://4.bp.blogspot.com/_KkgJhpfFV3g/S_b98vX5rfI/AAAAAAAAACw/v8czvStyr7I/s400/programExecution.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5473841617066110450" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 227px; " /&gt;&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Note that the program reports how many of the clades in each support tree were (not) found in the target tree.  Here are the corresponding trees from FigTree:&lt;/div&gt;&lt;div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_KkgJhpfFV3g/S_cAo8s1c4I/AAAAAAAAADA/C9JBQPKZ_fo/s1600/example1.jpg"&gt;&lt;img src="http://2.bp.blogspot.com/_KkgJhpfFV3g/S_cAo8s1c4I/AAAAAAAAADA/C9JBQPKZ_fo/s400/example1.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5473844575581074306" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 267px; height: 400px; " /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Upper: Original ML tree. Lower: Labeled ML tree.  &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_KkgJhpfFV3g/S_cA_J65wII/AAAAAAAAADQ/YrACqvCiYuw/s1600/example2.jpg"&gt;&lt;img src="http://3.bp.blogspot.com/_KkgJhpfFV3g/S_cA_J65wII/AAAAAAAAADQ/YrACqvCiYuw/s400/example2.jpg" border="0" alt="" id="BLOGGER_PHOTO_ID_5473844957086859394" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 267px; height: 400px; " /&gt;&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Upper: Consensus tree from a posterior distribution.  Lower: Consensus tree from ML bootstrapping.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If this would be a useful utility for you, feel free to &lt;a href="http://webspace.utexas.edu/jmb2239/transferBranchLabels.py"&gt;download&lt;/a&gt; it and give it a try.  Let me know how it goes.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-1090771104535005720?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/1090771104535005720/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2010/05/transferring-support-values-between.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1090771104535005720'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1090771104535005720'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2010/05/transferring-support-values-between.html' title='Transferring Support Values Between Trees'/><author><name>Jeremy Brown</name><uri>http://www.blogger.com/profile/03105958793710069493</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='http://1.bp.blogspot.com/_KkgJhpfFV3g/Sj2QLpX3rrI/AAAAAAAAAAM/C-cQ9k-WqEo/S220/n7938591_48119704_6896.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_KkgJhpfFV3g/S_b98vX5rfI/AAAAAAAAACw/v8czvStyr7I/s72-c/programExecution.jpg' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-3161616414837717196</id><published>2009-12-11T16:13:00.000-08:00</published><updated>2009-12-11T17:41:31.826-08:00</updated><title type='text'>Who's Hungry for DIM SUM?</title><content type='html'>&lt;div style="text-align: left;"&gt;Kevin Savidge, Emily McTavish and I have been working on a program for the simulation of demography and individual migration events on a continuous landscape.  We've decided to call it DIM SUM (&lt;u&gt;D&lt;/u&gt;emography and &lt;u&gt;I&lt;/u&gt;ndividual &lt;u&gt;M&lt;/u&gt;igration &lt;u&gt;S&lt;/u&gt;imulated &lt;u&gt;U&lt;/u&gt;sing a &lt;u&gt;M&lt;/u&gt;arkov chain).  DIM SUM was primarily written by Kevin as an undergraduate research project.&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;DIM SUM is a stand-alone Java program that takes one main XML input file, as well as files specifying carrying capacity and borders across an arbitrary landscape.  These secondary files can be provided as numeric matrices or as images.  If images, particular color channels correspond to the different input variables.  Because it can take images as input, you can output your favorite map from GIS software (with some color corresponding to how suitable a given pixel of habitat may be) and simulate directly on that landscape!  It also uses great circle distances, in case the landscape on which you're simulating has a large latitudinal span (since two lines of latitude are closer to one another near the poles as compared to the equator).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As output, DIM SUM provides trees of ancestor-descendant relationships, the locations of all individuals, and can also plot all individuals directly on the landscape.  Here's a simple example using the Hawaiian islands, where all land area is specified as equally good habitat.  In this example, we've used four separate starting populations of size 1, with each population of origin tagged with a different color.  Note how the population starting on the island with the greatest expanse of suitable habitat (the Big Island) begins to dominate the populations on the other islands.  Here's the population after 5 generations with a population size of 21...&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 298px;" src="http://3.bp.blogspot.com/_KkgJhpfFV3g/SyLmGVEdyEI/AAAAAAAAABc/Z1gtEK7jocE/s400/Hawaii_1.gif" border="0" alt="" id="BLOGGER_PHOTO_ID_5414142698462234690" /&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;...after 25 generations with a population size of 2,288...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 298px;" src="http://4.bp.blogspot.com/_KkgJhpfFV3g/SyLmSH6j5uI/AAAAAAAAABk/BJQMOa5XSEY/s400/Hawaii_2.gif" border="0" alt="" id="BLOGGER_PHOTO_ID_5414142901089461986" /&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;...after 200 generations with a population size of 2,635...&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;img src="http://3.bp.blogspot.com/_KkgJhpfFV3g/SyLmwEMCB_I/AAAAAAAAABs/KVLyeTLnQhc/s400/Hawaii_3.gif" border="0" alt="" id="BLOGGER_PHOTO_ID_5414143415485073394" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 298px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;...and after 800 generations with a population size of 2,636...&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;img src="http://4.bp.blogspot.com/_KkgJhpfFV3g/SyLnHtv9VwI/AAAAAAAAAB0/94uPLKkU9lc/s400/Hawaii_4.gif" border="0" alt="" id="BLOGGER_PHOTO_ID_5414143821778605826" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 400px; height: 298px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Our original motivation for creating such a simulator was to generate data for the testing of phylogeographic methods (although DIM SUM has many other potential uses).  In particular, these types of simulations should provide a more evenhanded test of coalescent-based methods and allow the testing of methods that model migration on a continuous landscape (Lemmon and Lemmon, 2008, Syst. Biol., 57(4): 544-561).  One very nice feature of continuous landscape methods is that they allow you to infer the geographic locations (think latitude and longitude) of ancestors.  In fact, you can plot a likelihood surface directly on the landscape!  To do this, you sequentially constrain the location for the ancestor of interest to a series of locations across the landscape, calculate the maximum likelihood score, and then use this matrix of scores to approximate the surface.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;DIM SUM can generate data suitable for testing such a method, by comparing the true location of the ancestor to the inferred likelihood surface.  We performed some simple simulations to try out the Lemmons' program (PhyloMapper) for inferring ancestral locations.   In the plots below, the true location of the ancestor is a marked with a "*".&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Here's a surface for a simulation in which we allowed a population to expand from a single individual in the middle of the range for 5,000 generations and then double in size for another 5,000 generations.  The area in red gives the 95% confidence envelope...&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;img src="http://3.bp.blogspot.com/_KkgJhpfFV3g/SyLpffZBflI/AAAAAAAAAB8/RGdebaCD4f8/s400/pop.double.surface.gif" border="0" alt="" id="BLOGGER_PHOTO_ID_5414146429264428626" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 323px; height: 306px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;...and here's a case where we allowed both the population and range sizes to double (the range was originally constrained to the left half of the landscape)...&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;span class="Apple-style-span" style="color: rgb(0, 0, 238); -webkit-text-decorations-in-effect: underline; "&gt;&lt;img src="http://3.bp.blogspot.com/_KkgJhpfFV3g/SyLqB6dHUKI/AAAAAAAAACE/mHbP_l1zxHk/s400/range.pop.double.surface.gif" border="0" alt="" id="BLOGGER_PHOTO_ID_5414147020644896930" style="display: block; margin-top: 0px; margin-right: auto; margin-bottom: 10px; margin-left: auto; text-align: center; cursor: pointer; width: 323px; height: 306px; " /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: center;"&gt;&lt;span class="Apple-style-span"  style="color:#0000EE;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;Note that there is some bias when migration occurs more often in one direction (when the range expands), but it wasn't substantial enough to reject the true location in this case.  Given the new perspective on phylogeographic data that it provides, I hope PhyloMapper sees more widespread use and development.&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;If you're interested in using DIM SUM, you can download it at http://code.google.com/p/bio-dimsum.  Please let us know what you think!&lt;/div&gt;&lt;div style="text-align: left;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="text-align: left;"&gt;P.S. Count yourself lucky if you can find an undergrad who can write a program like this in under a semester!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-3161616414837717196?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/3161616414837717196/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2009/12/whos-hungry-for-dim-sum.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/3161616414837717196'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/3161616414837717196'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2009/12/whos-hungry-for-dim-sum.html' title='Who&apos;s Hungry for DIM SUM?'/><author><name>Jeremy Brown</name><uri>http://www.blogger.com/profile/03105958793710069493</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='http://1.bp.blogspot.com/_KkgJhpfFV3g/Sj2QLpX3rrI/AAAAAAAAAAM/C-cQ9k-WqEo/S220/n7938591_48119704_6896.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_KkgJhpfFV3g/SyLmGVEdyEI/AAAAAAAAABc/Z1gtEK7jocE/s72-c/Hawaii_1.gif' height='72' width='72'/><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-7185024834219562805</id><published>2009-10-03T00:00:00.000-07:00</published><updated>2009-10-05T10:28:38.454-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='Java'/><category scheme='http://www.blogger.com/atom/ns#' term='programming'/><category scheme='http://www.blogger.com/atom/ns#' term='C++'/><category scheme='http://www.blogger.com/atom/ns#' term='Python'/><category scheme='http://www.blogger.com/atom/ns#' term='R'/><title type='text'>An Idiosyncratic Analogical Overview of Some Programming Languages from an Evolutionary Biologist's Perspective</title><content type='html'>&lt;div class="content"&gt;&lt;h3&gt;&lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;&lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; is like a microwave oven. It is capable of handling a wide range of pre-packaged tasks, but can be frustrating or inappropriate when trying to do even simple things that are outside of its (admittedly vast) library of functions. Ever tried to make toast in a microwave? There has been a push to start using &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;  for simulations and phylogenetic analysis, and I am actually rather ambiguous about how I feel about this. On the one hand, I would much rather an open source platform &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; be used than some proprietary commercial platform such as &lt;a href="http://www.wolfram.com/"&gt;Mathematica&lt;/a&gt; or &lt;a href="http://www.mathworks.com/"&gt;Matlab&lt;/a&gt;. On the other hand, I do not think that &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; is the most suitable for the full spectrum of these applications. There are some serious limitations on its capability and performance when handling large datasets (mainly for historical design reasons), and, to be frank, I find many aspects of its syntax and idiom quite irritating. I primarily use it for what it was designed for: numerical and statistical computing, as well as data visualization and plotting, and in this context I am quite happy with it. For almost anything else, I look elsewhere. &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; has an excellent and very friendly community of people actively developing and supporting it, and I might change my view as &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; evolves. But, as things stand, it simply is not my first choice for the majority of my programming needs.&lt;br /&gt;&lt;/p&gt;&lt;h3&gt;&lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;If &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; is like a microwave oven, then &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt; is a full-fledged modern kitchen. You can produce almost anything you want, from toast to multi-course banquets, but you probably need to do some extra work relative to &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;. With &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, if you want an apple pie, all you need to do is pick up a frozen one from a supermarket and heat it up, and you'll be feasting in a matter of minutes. With &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;, you will need to knead your own dough, cook your own apple filling, etc., but you will get your apple pie, and, to be honest, programming in &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt; is such a pleasure that you will probably enjoy the extra work. And what happens if you instead want a pie with strawberries and dark chocolate and a hint of chili in the filling? With &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;, if you cannot find an appropriate instant pie in your supermarket, you are out of luck, or you might be in for a very painful adventure in trying to mash together some chimeric concoction that will look horrible and taste worse. But with &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;,  any pie you can dream of is completely within reach, and probably will not take too much more effort than plain old apple pie. From data wrangling and manipulation (prepping data files, converting file formats, relabeling sequences, etc. etc.) to pipelining workflows, and even to many types analyses and simulations, &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt; is the ideal choice for the majority of tasks that an evolutionary biologist carries out.&lt;br /&gt;&lt;/p&gt;&lt;h3&gt;&lt;a href="http://en.wikipedia.org/wiki/C%2B%2B"&gt;C++&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;If &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt;  is like a microwave, and &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt; is like modern kitchen, then C++ is like an antique Victorian kitchen in the countryside, far, far, far away from any supermarket. You want an apple pie? With C++, you can consider yourself lucky if you do not actually have to grow the apples and harvest the wheat yourself. You will certainly need to mill and sift the flour, churn the butter, pluck the apples, grind the spices, etc. And none of this "set the oven to 400°" business: you will need to chop up the wood for the fuel, start the fire and keep tending the heat while it is baking. You will eventually get the apple pie, but you are going to have to work very, very, very hard to get it, and most of it will be tedious and painful work. And you will probably have at least one or two bugs in the pie when all is done. More likely than not, memory bugs ...&lt;br /&gt;&lt;/p&gt;&lt;p&gt;Stepping out of the cooking/kitchen analogy, if I had to point out the single aspect of programming in &lt;a href="http://en.wikipedia.org/wiki/C%2B%2B"&gt;C++&lt;/a&gt; that makes it such a pain, I would say "memory management". Much of the time programming in &lt;a href="http://en.wikipedia.org/wiki/C%2B%2B"&gt;C++&lt;/a&gt; is spent coding up the overhead involved in managing memory (in the broadest sense, from declaration, definition and initialization of stack variables, to allocation and deallocation of heap variables, to tracking and maintaining both types through their lifecycles), and even more is spent in tracking down bugs caused by memory leaks. The &lt;a href="http://www.cplusplus.com/reference/stl/"&gt;Standard Template Library&lt;/a&gt; certainly helps, and I've come to find it indispensable, but it still exacts its own cost in terms of overhead and its own price in terms of chasing down memory leaks and bugs.&lt;/p&gt;&lt;p&gt;For an example of the overhead, compare looping through elements of a container in &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;:&lt;/p&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; background-color: rgb(238, 238, 238); color: black; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;for i in data:&lt;br /&gt;    print i&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;vs. &lt;a href="http://en.wikipedia.org/wiki/C%2B%2B"&gt;C++&lt;/a&gt; using the &lt;a href="http://www.cplusplus.com/reference/stl/"&gt;Standard Template Library&lt;/a&gt;:&lt;/p&gt;&lt;pre style="border: 1px dashed rgb(153, 153, 153); padding: 5px; overflow: auto; background-color: rgb(238, 238, 238); color: black; font-family: Andale Mono,Lucida Console,Monaco,fixed,monospace; font-size: 12px; line-height: 14px; width: 100%;"&gt;&lt;code&gt;for (std::vector&amp;lt;long&amp;gt;::const_iterator i = data.begin();&lt;br /&gt;        i != data.end();      &lt;br /&gt;        ++i) {&lt;br /&gt;    std::cout &amp;lt;&amp;lt; *i &amp;lt;&amp;lt; std::endl;&lt;br /&gt;}&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;And for an example of insiduous memory bug even with the &lt;a href="http://www.cplusplus.com/reference/stl/"&gt;Standard Template Library&lt;/a&gt;, consider this: what &lt;i&gt;might&lt;/i&gt; happen &lt;i&gt;sometimes&lt;/i&gt; to a pointer that you have been keeping to an element in a vector, when some part of your code appends a new element to the vector? It can be quite ugly.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;So what does all that extra work and pain get you?&lt;/p&gt;&lt;p&gt;Performance.&lt;/p&gt;&lt;p&gt;When it comes to performance, &lt;a href="http://en.wikipedia.org/wiki/C%2B%2B"&gt;C++&lt;/a&gt; rocks. My initial attempt at a complex phylogeography simulator was in &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;. It took me a week to get it working. I could manage population sizes of about 1000 on a 1G machine, and it could complete 10000 generations in about a week. I rewrote it in C++. Instead of a week, it took me two and a half months to get it to the same level of complexity. When completed, however, it could manage population sizes of over 1 million on a 1 G machine, and run 2.5 million generations in 24 hours.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;After that experience, when I am thinking of coding up something that might be computationally intensive or push the memory limits of my machines, the language that comes to mind is  &lt;a href="http://en.wikipedia.org/wiki/C%2B%2B"&gt;C++&lt;/a&gt;. More likely than not, however, I would probably still try to code up the initial solution in &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;, and only turn to &lt;a href="http://en.wikipedia.org/wiki/C%2B%2B"&gt;C++&lt;/a&gt; when it becomes clear that &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;'s performance is not up to the task.&lt;br /&gt;&lt;/p&gt;&lt;h3&gt;&lt;a href="http://www.java.com/"&gt;Java&lt;/a&gt;&lt;/h3&gt;&lt;p&gt;&lt;a href="http://www.java.com/"&gt;Java&lt;/a&gt;, like &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;, is a modern kitchen, allowing for a full range of operations with all the sanity-preserving conveniences and facilities (e.g., garbage-collection/memory-management). But it is a large, industrial kitchen, with an executive chef, sous chefs, and a full team of chefs de partie running things. And so, while you can do everything from making toast to multi-course meals, even the simplest tasks takes a certain minimal investment of organization and overhead. At the end of the day, for many simpler things, such as scrambled eggs and toast, you would get just as good results a lot quicker using &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;.&lt;br /&gt;&lt;/p&gt;&lt;p&gt;I think that &lt;a href="http://www.java.com/"&gt;Java&lt;/a&gt; is really nice language, and I do like its idioms and syntax, which, by design, enforces many good programming practices. It is also probably much more suited for enterprise-level application development than &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt;. But I find it very top-heavy for the vast majority of things that I do, and the extra investment in programming overhead that it imposes (think: getters and setters) does not buy me any performance benefit at all. As a result, I have not found a need to code a single application in &lt;a href="http://www.java.com/"&gt;Java&lt;/a&gt; since I started using &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt; several years ago.&lt;/p&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-7185024834219562805?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/7185024834219562805/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2009/10/idiosyncratic-overview-of-some.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/7185024834219562805'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/7185024834219562805'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2009/10/idiosyncratic-overview-of-some.html' title='An Idiosyncratic Analogical Overview of Some Programming Languages from an Evolutionary Biologist&apos;s Perspective'/><author><name>Jeet Sukumaran</name><uri>http://www.blogger.com/profile/04874973927331407332</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_bSNhphjTeao/S23tnWOSfNI/AAAAAAAAABI/FhFQDr3WBZI/s1600-R/f83bd77d036b0a5d1a595c697a26ef6c.png'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-1758044034938036073</id><published>2009-10-01T21:15:00.000-07:00</published><updated>2009-10-02T00:51:11.126-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ABC'/><category scheme='http://www.blogger.com/atom/ns#' term='msbayes'/><category scheme='http://www.blogger.com/atom/ns#' term='approximate Bayesian Computation'/><title type='text'>msBayes Basics Part VI: Tolerances and Local-Linear Regression Weighting of the Posteriors</title><content type='html'>This is the sixth of a &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-i-introduction.html"&gt;multi-part series&lt;/a&gt; on &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; basics. In the &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-v-rejection-sampling-to.html"&gt;previous part&lt;/a&gt;, I discussed how you might run "msReject" to carry out rejection sampling on your samples from the prior, to obtain samples from the posterior. One crucial setting for this previous step was the choice of tolerance, and this is something I discuss in detail here.&lt;br /&gt;&lt;br /&gt;First, let us consider why there is a need for tolerances. In general, the more summary statistics we use, the better our ability to discriminate between models and correctly approximate the posterior becomes (as long as we are reasonable about the summary statistics we use). But, at the same time, the more difficult it becomes to simulate a sample from the priors that result in summary statistics that equal the summary statistics calculated on our observed data. The variation in some of the nuisance parameters that we are trying to integrate out, as well as stochasticity in the processes that we are trying to model, make it difficult to generate a sample with identical summary statistics even under a perfect (correct) model, even with a few summary statistics. So we allow some "jiggle room": we say that we will accept any sample from the prior that comes &lt;span style="font-style: italic;"&gt;close&lt;/span&gt; to the observed in terms of the summary statistic, even if they are not exactly equal. And this jiggle room is what we call the tolerance. The more summary statistics that we use, the larger the tolerance that we will need to allow us to accept simulations at a reasonable rate.&lt;br /&gt;&lt;br /&gt;However, as the tolerance gets larger and larger, we begin introducing more and more distortion in our approximation. This is because all samples accepted within the tolerance zone are weighted equally in the posterior. So a sample far from the observed in terms of summary statistic distance will be contribute the same proportion to the posterior as one very close, as long as they are both within the tolerance zone. In the limit, as our tolerance goes to infinity, we will accept &lt;span style="font-style: italic;"&gt;all&lt;/span&gt; samples from the prior, and thus the direction of bias due to large tolerances is to weight our posteriors toward the prior.&lt;br /&gt;&lt;span style="visibility: visible;" id="main"&gt;&lt;span style="visibility: visible;" id="search"&gt;&lt;br /&gt;As it happens, recent work by &lt;a href="http://www.genetics.org/cgi/content/full/162/4/2025"&gt;Beaumont et al.&lt;/a&gt; provides a solution to this problem. They use local-linear regression to weight the posteriors so that samples with summary statistics closer to the observed summary statistics are given a greater weight than those further away. With this, the analysis becomes relatively insensitive to choice of tolerances, as long as the number of posterior samples are large enough to allow accurate regression. This is vividly illustrated in the data below, which shows the results of rejection sampling at different tolerances, comparing the mean of the raw posterior distribution of a parameter vs. the mean of the weighted posterior distribution of the same parameter:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;TOL.    NUM.    UNWEIGHTED           WEIGHTED&lt;br /&gt;1e-05   500     0.148147697394790    0.188623205810913&lt;br /&gt;2e-05  1000    0.178475544544545    0.138229444728950&lt;br /&gt;1e-04   5000    0.217240549309862    0.126658030148325&lt;br /&gt;2e-04   10000   0.235878721472147    0.130784297340281&lt;br /&gt;0.001   50000   0.308834221344427    0.123652684056336&lt;br /&gt;0.002   100000  0.355224938539385    0.123306496530203&lt;br /&gt;&lt;/pre&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;As can be seen, both the range and variance of the regression-weighted parameter distribution across different tolerances are an order of magnitude less than the unweighted parameter distribution (0.0653167 vs. 0.2070772, and 0.0006326896 vs. 0.006153922, respectively). The posterior samples size under the first tolerance setting, 1e-05, is 500, and this small size probably led to a relatively inaccurate regression, as is evidence by the mean of the weighted distribution associated with it being rather far (&gt; 7 s.d. units) from all the other values. Considering posterior sizes of 1000 or more, we find that the variance of the means of the weighted distributions vs. the unweighted is 3.843462e-05 vs. 0.005126309, and, regardless of our choice of the tolerance, we can safely place the mean between 0.12 and 0.14 (a difference of 0.02), while the mean for the unweighted parameter ranges from 0.18 to 0.36 (a difference of 0.18).&lt;br /&gt;&lt;br /&gt;This informal analysis demonstrates that the local linear regression weighting approach of &lt;span style="visibility: visible;" id="main"&gt;&lt;span style="visibility: visible;" id="search"&gt;&lt;a href="http://www.genetics.org/cgi/content/full/162/4/2025"&gt;Beaumont et al.&lt;/a&gt;  is remarkably effective in making the estimates insenstive to our choice of tolerances by correcting for the bias toward the prior introduced by large tolerances. This allows us to use larger tolerances&lt;/span&gt;&lt;/span&gt; in our analysis, which in turn allows us to use more summary statistics and yet collect samples from the posterior at a reasonable rate. Of course, the regression weighting only applies to estimates of continuous parameter values. When doing model selection (such as the numbers of divergence times), then this method is not suitable. In the case of numbers of divergence times, the best approach might be to use the regression to weight the variance and estimate of divergence times in your data (Var(t) and E(t)), and use the ratio of this (Omega = Var(t)/E(t)) to assess support for a single divergence time.&lt;br /&gt;&lt;br /&gt;The story does not quite end here, however. A further complication is in the way that msBayes carries out rejection sampling. In all the discussion so far, we have referred to the process of accepting samples from the prior that come within some specified distance (the tolerance) of the observed. One would imagine carrying out simulations from the prior, and, for each simulation, accepting the sample into the posterior only if the Euclidean distance between the summary statistics of the simulated data and the observed data fall within some specified tolerance distance. In this scheme, we would run the simulations from the prior continuously, building our samples from the posterior as we go along, until our posterior samples reach some reasonable size. However, in implementing the rejection sampling, msBayes does things a little differently. As has been covered in painful detail, we generate samples from the prior of a specified size before we even begin the rejection sampling. The samples are then ranked in terms of the Euclidean distance of their summary statistics to the summary statistics of the observed data, and pre-specified proportion of those samples closest in distance to the observed data will be accepted into the posterior. It is confusing that this proportion is also called a &lt;span style="font-style: italic;"&gt;tolerance&lt;/span&gt;, because it is not the same tolerance as discussed above. Rather, depending on the summary statistics values of the specific samples that happen to get accepted, an effective tolerance in the previous sense is implied or induced. For example, if we say that we will accept 0.02 of the 100000 prior samples into the posterior, then the 2000 samples out of the 100000 closest in the summary statistic distance will constitute our sample from the posterior, irrespective of the actual absolute distances. The tolerance in the sense discussed previously will then be given by the largest distance between the summary statistics of  the 2000 samples in the posterior and the summary statistics of the observed data.&lt;br /&gt;&lt;br /&gt;So where does this all leave us with regards to choice of tolerances? All I can offer here are some general habits that I practice, based on my experiences and understanding, which may or may not be applicable or appropriate in the context of your analyses (in other words, YMMV). I think that the most important guideline would be to have  a posterior sample size at least 1000 or larger. I personally would probably not be too comfortable with less than a million samples from the prior, and prefer to work with prior sizes of 10 million or greater. Thus, I generally tend to have my "tolerance" (&lt;span style="font-style: italic;"&gt;sensu&lt;/span&gt; msBayes) on the order of 1e-4 to 1e-5 or so, and ensure that my prior sample size is large enough so this choice of "tolerance" results in posteriors that are 1000 samples or more.&lt;br /&gt;&lt;br /&gt;In the next post, I will provide an R script based on one that &lt;a href="http://qcpages.qc.cuny.edu/Biology/fac_stf/hickerson.php"&gt;Mike Hickerson&lt;/a&gt; kindly sent to me to help carry out the local linear regression-based weighting of your posterior values, and discuss how to carry out this procedure.&lt;br /&gt;&lt;br /&gt;References:&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;&lt;br /&gt;@article{beaumont2002approximate,&lt;br /&gt;title={{Approximate Bayesian computation in population genetics}},&lt;br /&gt;author={Beaumont, M.A. and Zhang, W. and Balding, D.J.},&lt;br /&gt;journal={Genetics},&lt;br /&gt;volume={162},&lt;br /&gt;number={4},&lt;br /&gt;pages={2025--2035},&lt;br /&gt;year={2002},&lt;br /&gt;publisher={Genetics Soc America}&lt;br /&gt;url={http://www.genetics.org/cgi/content/full/162/4/2025}&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;@article{hickerson2006test,&lt;br /&gt;title={{Test for simultaneous divergence using approximate Bayesian computation}},&lt;br /&gt;author={Hickerson, M.J. and Stahl, E.A. and Lessios, HA and Crandall, K.},&lt;br /&gt;journal={Evolution},&lt;br /&gt;volume={60},&lt;br /&gt;number={12},&lt;br /&gt;pages={2435--2453},&lt;br /&gt;year={2006},&lt;br /&gt;publisher={BioOne}&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;@article{hickerson2007msbayes,&lt;br /&gt;title={{msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation}},&lt;br /&gt;author={Hickerson, M.J. and Stahl, E. and Takebayashi, N.},&lt;br /&gt;journal={BMC bioinformatics},&lt;br /&gt;volume={8},&lt;br /&gt;number={1},&lt;br /&gt;pages={268},&lt;br /&gt;year={2007},&lt;br /&gt;publisher={BioMed Central Ltd}&lt;br /&gt;}&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-1758044034938036073?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/1758044034938036073/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2009/10/msbayes-basics-part-vi-tolerances-and.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1758044034938036073'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1758044034938036073'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2009/10/msbayes-basics-part-vi-tolerances-and.html' title='msBayes Basics Part VI: Tolerances and Local-Linear Regression Weighting of the Posteriors'/><author><name>Jeet Sukumaran</name><uri>http://www.blogger.com/profile/04874973927331407332</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_bSNhphjTeao/S23tnWOSfNI/AAAAAAAAABI/FhFQDr3WBZI/s1600-R/f83bd77d036b0a5d1a595c697a26ef6c.png'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-6495174609015166165</id><published>2009-09-30T12:34:00.000-07:00</published><updated>2009-10-01T22:57:58.675-07:00</updated><title type='text'>msBayes Basics V: Rejection Sampling to Extract the Posteriors from Your Priors</title><content type='html'>This is the fifth of a &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-i-introduction.html"&gt;multi-part series&lt;/a&gt; on &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; basics. Here I will describe how to use programs distributed as part of the &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; package to carry out rejection sampling on your simulations from the prior, and supply &lt;a href="http://jeetworks.org/runreject"&gt;a Python script&lt;/a&gt; that makes some of the practical running of this a little bit less error-prone and simpler.&lt;br /&gt;&lt;br /&gt;At this stage of the proceedings, you will have a large collection of samples simulated from the prior, in the form of a file of &lt;a href="http://en.wikipedia.org/wiki/Delimiter-separated_values"&gt;tab-delimited values&lt;/a&gt;. The first row will consist of column or field names, and each subsequent row will consist of two sets of fields that describe each sample: a set of parameters drawn from their prior distribution, and a set of summary statistics calculated on data simulated using those parameters. You should also have a file of summary statistics calculated on your observed data. Using the &lt;a href="http://jeetworks.org/programs/regularizecols"&gt;Python script to regularize columns&lt;/a&gt; between files, as discussed in a &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-iv-preparing-your-files.html"&gt;previous post&lt;/a&gt;, the columns in both the simulated and observed summary statistic files should now be compatible.&lt;br /&gt;&lt;br /&gt;The next step is now to run "msReject" on your samples from the prior. The call to this program requires that you specify, in order:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;the path to your summary statistics file&lt;/li&gt;&lt;li&gt;the path to your file of samples from the prior&lt;/li&gt;&lt;li&gt;the sampling &lt;span style="font-style: italic;"&gt;tolerance&lt;/span&gt;&lt;br /&gt;&lt;/li&gt;&lt;li&gt;a list of columns (1-based column index numbers) which identify the summary statistic columns&lt;/li&gt;&lt;/ul&gt;For example, the following:&lt;br /&gt;&lt;pre style="background: rgb(0, 0, 0) none repeat scroll 0% 0%; -moz-background-clip: border; -moz-background-origin: padding; -moz-background-inline-policy: continuous; color: rgb(0, 255, 0);"&gt;$ msReject obs_sum_stats.txt priors.txt 0.2 6 7 8 9 10 11 12&lt;br /&gt;&lt;/pre&gt;asks "msReject" to carrying out rejection sampling on the data in "priors.txt", using the summary statistics given in "obs_sum_stats.txt" as the observed summary statistics, and a tolerance of "0.2", with columns 6 through 12 taken to be summary statistic columns (and the remaining assumed to be prior/parameter columns).&lt;br /&gt;&lt;br /&gt;It can be a little bit tedious, not to mention error-prone, needing to specify the summary statistic columns when invoking "msReject". To this end, I've written &lt;a href="http://jeetworks.org/runreject"&gt;a Python script&lt;/a&gt; that automatically identifies the summary statistic columns in the file and composes the proper call to "msReject". By default it prints the command to the standard output, to show you what it is doing, and then invokes the command itself. Alternatively, it can be run in an "no execute" mode, where it just prints the correct command and exits, so that you can copy the command and use it in your own scripts.&lt;br /&gt;&lt;br /&gt;The following example shows it in action, running "msReject" with a tolerance of 0.01 and directing the results to the file "posterior_accepted_samples.txt"&lt;br /&gt;&lt;br /&gt;&lt;pre style="background: rgb(0, 0, 0) none repeat scroll 0% 0%; -moz-background-clip: border; -moz-background-origin: padding; -moz-background-inline-policy: continuous; color: rgb(0, 255, 0);"&gt;$run-reject.py obs_sum_stats priors1.txt 0.01 &gt; posterior_accepted_samples.txt&lt;br /&gt;Following columns identified as PARAMETERS:&lt;br /&gt;[1]:"PRI.numTauClass"&lt;br /&gt;[2]:"PRI.Psi"&lt;br /&gt;[3]:"PRI.var.t"&lt;br /&gt;[4]:"PRI.E.t"&lt;br /&gt;[5]:"PRI.omega"&lt;br /&gt;&lt;br /&gt;Following columns identified as SUMMARY STATISTICS:&lt;br /&gt;[6]:"pi.b.1"&lt;br /&gt;[7]:"pi.b.2"&lt;br /&gt;.&lt;br /&gt;.&lt;br /&gt;(etc.)&lt;br /&gt;.&lt;br /&gt;.&lt;br /&gt;[106]:"ShannonsIndex.Pop2.5"&lt;br /&gt;[107]:"ShannonsIndex.Pop2.6"&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Command to be executed:&lt;br /&gt; msReject obs_sum_stats priors1.txt 0.02 6 7 8 9 10&lt;br /&gt;11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28&lt;br /&gt;29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46&lt;br /&gt;47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64&lt;br /&gt;65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82&lt;br /&gt;83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100&lt;br /&gt;101 102 103 104 105 106 107&lt;br /&gt;&lt;br /&gt;Running with PID: 10927.&lt;br /&gt;Completed with exit status: 0.&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Once this process is done, you will have effectively sampled the posterior probability distribution of your parameters. The posterior probability of your model is the approximately the proportion of times it has been sampled in the posterior distribution. So, for example, the proportion of samples with 1 divergence time gives the approximate posterior probability of a single distinct divergence time between every population pair in your dataset. I have a small Python script that I use to calculate the proportions of the various parameters, but you can also use R or Perl or what-have-you.&lt;br /&gt;&lt;br /&gt;One issue that I have not discussed in detail is the choice of tolerance and how that affects the analysis. As it turns out, this your analysis can be quite sensitive to the choice of tolerance, and there are steps that you can take to correct this. I will discuss the issue of tolerances in future posts, as well as provide an R script that weights the posterior probability of your parameters to reduce their sensitivity to your choice of tolerance.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-6495174609015166165?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/6495174609015166165/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-v-rejection-sampling-to.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/6495174609015166165'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/6495174609015166165'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-v-rejection-sampling-to.html' title='msBayes Basics V: Rejection Sampling to Extract the Posteriors from Your Priors'/><author><name>Jeet Sukumaran</name><uri>http://www.blogger.com/profile/04874973927331407332</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_bSNhphjTeao/S23tnWOSfNI/AAAAAAAAABI/FhFQDr3WBZI/s1600-R/f83bd77d036b0a5d1a595c697a26ef6c.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-4049736920355559236</id><published>2009-09-20T19:16:00.000-07:00</published><updated>2009-09-30T17:25:36.877-07:00</updated><title type='text'>msBayes Basics IV: Preparing Your Files for Sampling from the Posterior</title><content type='html'>This is the fourth of a &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-i-introduction.html"&gt;multi-part series&lt;/a&gt; on &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; basics. Here I will discuss some glitches/incompatibilties in the msBayes pipeline when going into the rejection sampling stage that you have to watch out for. I will also present a Python script that fixes the incompatibilities in the pipeline and make the process easier.&lt;br /&gt;&lt;br /&gt;The rejection sampling step requires two files: samples from the prior and summary statistics calculated on the observed data.&lt;br /&gt;&lt;br /&gt;Our samples from the prior as generated by the script "msBayes.pl" are given as a file of &lt;a href="http://en.wikipedia.org/wiki/Delimiter-separated_values"&gt;tab-delimited values&lt;/a&gt;, with the first row of the file giving the field names. The first several fields consist of the parameter values drawn from the prior, while the remaining fields are the summary statistics calculated on data simulated with those parameter values. Parameter fields can be identified on the leading "PRI." associated with their names (e.g., "PRI.numTauClass", "PRI.Psi", "PRI.var.t"). The remaining fields are the summary statistics.&lt;br /&gt;&lt;br /&gt;The summary statistics on the observed data are generated using the program "obsSumStats.pl". This is invoked with a single argument, the path to the master configuration file (discussed &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-ii-data-assembly.html"&gt;here&lt;/a&gt;), and by default writes the results to the standard output, but we will redirect this to a file "obs_sum_stats":&lt;br /&gt;&lt;pre style="background: rgb(0, 0, 0) none repeat scroll 0% 0%; -moz-background-clip: border; -moz-background-origin: padding; -moz-background-inline-policy: continuous; color: rgb(0, 255, 0);"&gt;$ obsSumStats.pl msbayes_master &gt; obs_sum_stats&lt;/pre&gt;The file "obs_sum_stats" is also a &lt;a href="http://en.wikipedia.org/wiki/Delimiter-separated_values"&gt;tab-delimited value file&lt;/a&gt;, and consists of two rows, a header row and a data row, with the latter comprising the summary statistics calculated on the observed data.&lt;br /&gt;&lt;br /&gt;Now here is the problem with the msBayes pipeline: the observed summary statistics files generated by "obsSumStats.pl" and the file of simulations from the prior generated by "msBayes.pl" are incompatible!&lt;br /&gt;&lt;br /&gt;In particular, the simulations from the prior file has a number of columns that are not present in the observed summary statistics. The file of observed summary statistics has 106 columns, with the first 4 columns being parameters and the remaining 102 columns being the summary statistics of interest. However, the file of simulations from the prior has a varying number of parameter columns, contingent upon your hyper-prior model settings, before being followed by the 102 columns of summary statistics. For example, if you do not constrain the number of divergence times ("numTauClasses = 0"), the simulations from the prior will have 107 columns, with the first 5 columns being parameters and the remaining 102 columns being summary statistics. On the other hand, if you constrain the run to have only 1 divergence time ("numTauClasses = 1"), then the simulations from the prior will have 109 columns, with the first 7 columns being parameters and the remaining 102 columns being summary statistics.&lt;br /&gt;&lt;br /&gt;When dealing with just one file of observed summary statistics, and files of simulated samples from the prior generated under the exact same model, then fixing this incompatibility is trivial (&lt;span style="font-weight: bold; font-style: italic;"&gt;IF&lt;/span&gt; you are aware of it; I was not the first few times that I ran msBayes, and because my results were not scarily unreasonable, it was only when I started to look at the files in detail that I realized what was happening). However, the moment that you begin doing anything more complex (such as runs under different hyper-priors for numbers of distinct divergence times), then it becomes really quite complicated and tedious.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;UPDATE Sept 22 2009&lt;/span&gt;: &lt;span style="font-style: italic;"&gt;It appears that "acceptRej.pl" not only correctly handles, but actually requires, that the file of summary statistics produced by "obsSumStats.pl" has one column short in relation to the file of simulations from the prior, i.e. 106 columns (of which the first 4 are parameters) instead of 107 (of which the first 5 are parameters). It is not clear to me at this point how it handles the situation when the file of simulations from the prior has &gt; 107 columns, e.g., if the number of divergence times are constrained. One way or another though, to use "msReject", you will still want the column patterns consistent across all your files.&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To this end, I've written a &lt;a href="http://jeetworks.org/programs/regularizecols"&gt;Python script&lt;/a&gt; that greatly simplifies the task of regularizing columns across multiple files. It depends crucially on a header row of column names as the first line of each file. You will pass it the path to a reference or "model" file that has a column pattern that you want to duplicate across all the source files. For each of the remaining source files you specify, it will examine the header row to identify the positions and names of dummy columns that it will need to insert to make the column pattern identical to the reference or model file. It then prints out the columns of data from the source file, inserting columns of dummy values as needed. If the column pattern of the source file matches, then it simply outputs the data rows as they are given. If you pass it multiple files as input, the data rows (i.e., all lines from the second line onwards) from all these files will be written out seamlessly so that the final output will consist of a single header row of column names followed by the data rows of all the source files, with dummy fields inserted into each row as neccessary so as to have a consistent column pattern.&lt;br /&gt;&lt;br /&gt;For example, if you want your file of summary statistics calculated on your observed data (using "obsSumStats.pl") to match the column pattern of your file of simulations from the prior (generated using "msBayes.pl"), you would:&lt;br /&gt;&lt;pre style="background: rgb(0, 0, 0) none repeat scroll 0% 0%; -moz-background-clip: border; -moz-background-origin: padding; -moz-background-inline-policy: continuous; color: rgb(0, 255, 0);"&gt;$ regularize-cols.py -m prior_sims obs_sum_stats &gt; obs_sum_stats_regularized&lt;br /&gt;(found 107 fields in model file)&lt;br /&gt;-- Processing 1 of 1: obs_sum_stats&lt;br /&gt;* Source file has 106 columns instead of 107.&lt;br /&gt;* Columns to be inserted: [1]:"PRI.numTauClass"&lt;br /&gt;&lt;/pre&gt;As mentioned, the files of simulations from the prior generated by "msBayes.pl" have different numbers of columns, depending on the simulation model hyper-parameter settings. Since "regularize-cols.py" concatenates the data rows of all files passed to it as arguments, you can use it to merge simulations from the priors from different runs of "msBayes.pl" into a single large file for rejection sampling, and it will inspect each file individually to ensure that column patterns are identical to the model file. For example, the following takes 4 files containing data generated under two different models and concatenates the data into a single file with dummy fields inserted as neccessary to ensure uniformity and consistency in columns:&lt;br /&gt;&lt;pre style="background: rgb(0, 0, 0) none repeat scroll 0% 0%; -moz-background-clip: border; -moz-background-origin: padding; -moz-background-inline-policy: continuous; color: rgb(0, 255, 0);"&gt;$ regularize-cols.py -m priors2_1 priors1_1 prior1_2 priors2_1 priors2_2 &gt; priors_all&lt;br /&gt;(found 111 fields in model file)&lt;br /&gt;-- Processing 1 of 4: priors1_1&lt;br /&gt;* Source file has 109 columns instead of 111.&lt;br /&gt;* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"&lt;br /&gt;-- Processing 2 of 4: priors1_2&lt;br /&gt;* Source file has 109 columns instead of 111.&lt;br /&gt;* Columns to be inserted: [3]:"PRI.Tau.2" [5]:"PRI.Psi.2"&lt;br /&gt;-- Processing 3 of 4: priors2_1&lt;br /&gt;* Source file has same number of columns as model file.&lt;br /&gt;* No columns to be inserted.&lt;br /&gt;-- Processing 4 of 4: priors2_2&lt;br /&gt;* Source file has same number of columns as model file.&lt;br /&gt;* No columns to be inserted.&lt;br /&gt;&lt;/pre&gt;Things to note regarding the operation of the script:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Let me repeat this with emphasis: &lt;span style="font-weight: bold;"&gt;the script depends crucially on a header row of column names as the first line of each file&lt;/span&gt;!&lt;/li&gt;&lt;li&gt;It only inserts columns into rows of the source files so that their column pattern matches the reference or model file. It does not delete any columns. Thus, the model file should have a column pattern that is a &lt;a href="http://en.wikipedia.org/wiki/Subset"&gt;superset&lt;/a&gt; of all the column patterns across the source files. You can use the UNIX command "&lt;code&gt;head &amp;lt;FILENAME&gt; | awk '{print NF}'&lt;/code&gt;" to quickly count columns in files.&lt;/li&gt;&lt;li&gt;It not only does not re-order columns in the source files, its logic depends on the column order of the source files matching the order of columns in the model file (sans missing columns, of course).&lt;/li&gt;&lt;li&gt;All the above conditions are met perfectly by the output files produced by &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt;, so as long as you do not modify these files directly, or are careful to maintain the above conditions if you do modify the files, there should be no problem in the execution of the script.&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;So, to summarize, as things stand when I write this, the msBayes pipeline generates files of varying column patterns. You will need to ensure that your column patterns are identical across and within files before actually carrying out the rejection sampling procedure. The script given &lt;a href="http://jeetworks.org/programs/regularizecols"&gt;here&lt;/a&gt; will automate this task for you, removing the tedium and reducing the risk of errors.&lt;br /&gt;&lt;br /&gt;In my next post, I will discuss the actual invocation and running of "msReject", and present a Python script that makes the call to "msReject" simpler by identifying the summary statistic columns for you.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-4049736920355559236?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/4049736920355559236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-iv-preparing-your-files.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/4049736920355559236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/4049736920355559236'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-iv-preparing-your-files.html' title='msBayes Basics IV: Preparing Your Files for Sampling from the Posterior'/><author><name>Jeet Sukumaran</name><uri>http://www.blogger.com/profile/04874973927331407332</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_bSNhphjTeao/S23tnWOSfNI/AAAAAAAAABI/FhFQDr3WBZI/s1600-R/f83bd77d036b0a5d1a595c697a26ef6c.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-4014758563510147972</id><published>2009-09-20T10:45:00.000-07:00</published><updated>2009-09-21T12:48:46.851-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ABC'/><category scheme='http://www.blogger.com/atom/ns#' term='msbayes'/><category scheme='http://www.blogger.com/atom/ns#' term='approximate Bayesian Computation'/><title type='text'>msBayes Basics III: Approaches to Sampling from the Posterior</title><content type='html'>This is the third of a &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-i-introduction.html"&gt;multi-part series&lt;/a&gt; on &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; basics. Here I will provide a brief overview of the ways you can carry out rejection sampling on your &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-ii-data-assembly.html"&gt;simulations from the prior&lt;/a&gt; using the programs distributed with &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt;. Later posts will describe the procedure in detail.&lt;br /&gt;&lt;br /&gt;In the previous step, you have accumulated a large number parameter values sampled from the prior. Each sample of parameter values is associated with a set of summary statistics calculated on data simulated using those parameter values. In the current step, a specified proportion of these samples with summary statistics that are closest to the summary statistics calculated on the observed data will be retained, while the other samples will be rejected. Once you have completed this, you will have transformed your samples from the prior to samples from the posterior. It should be noted that this approach of generating posterior samples from the prior is only one variant of many, and is the approach adopted by the programs supplied with &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt;. An excellent summary of some alternative approaches, along with references, can be seen &lt;a href="http://www.biomedcentral.com/1471-2156/10/35"&gt;here&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I realize that up to this point I have not discussed the actual simulation model of &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; at all, nor the summary statistics. I definitely think it would be interesting and useful to delve into the former at some point, but for now I am going to simply refer you to the papers that describe the models in detail, &lt;a href="http://www.bioone.org/doi/abs/10.1554/05-578.1"&gt;here&lt;/a&gt; and &lt;a href="http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1949838"&gt;here&lt;/a&gt;, as well as just include a figure illustrating the model (as given in the &lt;a href="http://www.bioone.org/doi/abs/10.1554/05-578.1"&gt;original paper&lt;/a&gt;):&lt;br /&gt;&lt;br /&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_bSNhphjTeao/SrZ9TvENNFI/AAAAAAAAAAM/fhCDIH5wFpM/s1600-h/i0014-3820-60-12-2435-f01.jpeg.jpg"&gt;&lt;img style="cursor: pointer; width: 320px; height: 226px;" src="http://4.bp.blogspot.com/_bSNhphjTeao/SrZ9TvENNFI/AAAAAAAAAAM/fhCDIH5wFpM/s320/i0014-3820-60-12-2435-f01.jpeg.jpg" alt="" id="BLOGGER_PHOTO_ID_5383628182573102162" border="0" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;The summary statistics, and, just as importantly, the metric used to calculate distances between different sets of summary statistics are also definitely worthy of discussion. In fact, improvement of existing and development of new generalized summary statistics and corresponding distance metrics are where I think (and hope!) we will see the most exciting progress in &lt;a href="http://en.wikipedia.org/wiki/Approximate_Bayesian_computation"&gt;Approximate Bayesian Computation&lt;/a&gt; in phylogeographical analysis in the near future. Right now, &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; employs a suite of over a 100 traditional population genetic statistics (such as pi, Watterson's theta, Tajima's D, Shannon's Index, etc.) that have been &lt;a href="http://www3.interscience.wiley.com/journal/118573798/abstract?CRETRY=1&amp;amp;SRETRY=0"&gt;tested in simulation studies&lt;/a&gt; as its core summary statistics. For its distance metric, &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; uses the simple multivariate Euclidean distance between the summary statistics as a first step. For continuous parameter values, the posterior samples are then regressed onto the distance between the simulated data and the observed data summary statistics.&lt;br /&gt;&lt;br /&gt;So, returning to our msBayes pipeline "walk-through", how do we go from our samples from the prior to samples to the posterior? We have a number of different options. &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; itself provides two approaches, one using a Perl script, "acceptReject.pl", which drives an &lt;a href="http://www.r-project.org/"&gt;R&lt;/a&gt; script that carries out the business, and the other a &lt;a href="http://en.wikipedia.org/wiki/C_%28programming_language%29"&gt;C&lt;/a&gt; program, "msReject".&lt;br /&gt;&lt;br /&gt;The Perl+R approach, using "acceptReject.pl", automatically does the local linear-regression weighting for you, and produces nice and fancy PDF's of the posterior probability distribution of the number of distinct divergence times for you. I do not recommend it for two reasons. The first reason is that the R analysis is extremely slow and is a phenomenal memory hog. I had to switch to using 16G machines to processes some files, and even then I was pushing the envelope. The second reason is that while the PDF plots of the posterior probability distribution of the number of divergence times are very nice and informative, I would far rather have the actual samples from the posterior. With this, not only will it be possible to generate any of a wide variety of plots that illustrate the posterior probability distribution of any recorded parameter value, I can also summarize the values in a number of other ways, or use the samples in other analyses.&lt;br /&gt;&lt;br /&gt;The "msReject" approach carries out rejection sampling on the simulations from the prior using Euclidean distances. The program is extremely fast and has a relatively miniscule memory footprint. It routinely handles the processing of 50 million sample files without any trouble. However, it does not carry out the locally-weighted linear regression of the posterior values, and you will have to do this yourself separately. Nonetheless, I highly recommend using this program over the Perl script.&lt;br /&gt;&lt;br /&gt;In the next post, I will go over in detail how you generate summary statistics on your observed data, provide a Python script that regularizes the file so that it is compatible with the samples simulated from the prior, and provide another Python script that identifies the summary statistic columns in the file and calls on msReject with the appropriate options to correctly generate a sample from the posterior.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-4014758563510147972?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/4014758563510147972/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-iii-approaches-to.html#comment-form' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/4014758563510147972'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/4014758563510147972'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-iii-approaches-to.html' title='msBayes Basics III: Approaches to Sampling from the Posterior'/><author><name>Jeet Sukumaran</name><uri>http://www.blogger.com/profile/04874973927331407332</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_bSNhphjTeao/S23tnWOSfNI/AAAAAAAAABI/FhFQDr3WBZI/s1600-R/f83bd77d036b0a5d1a595c697a26ef6c.png'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_bSNhphjTeao/SrZ9TvENNFI/AAAAAAAAAAM/fhCDIH5wFpM/s72-c/i0014-3820-60-12-2435-f01.jpeg.jpg' height='72' width='72'/><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-5419075818751559720</id><published>2009-09-19T23:04:00.000-07:00</published><updated>2009-09-20T10:21:17.441-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ABC'/><category scheme='http://www.blogger.com/atom/ns#' term='msbayes'/><category scheme='http://www.blogger.com/atom/ns#' term='approximate Bayesian Computation'/><title type='text'>msBayes Basics II: Data Assembly, Configuration File Preparation, and Generating Samples from the Prior</title><content type='html'>This is the second of a &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-i-introduction.html"&gt;multi-part series&lt;/a&gt; on &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; basics, and here I will discuss the preparation of the input files to the pipeline, the creation of the configuration file to control the simulations from the prior, as well as the running of the simulator itself.&lt;br /&gt;&lt;br /&gt;Data assembly is probably one of the more tedious parts of the whole &lt;a href="http://en.wikipedia.org/wiki/Approximate_Bayesian_computation"&gt;ABC&lt;/a&gt; &lt;a href="http://en.wikipedia.org/wiki/Pipeline_%28computing%29"&gt;pipeline&lt;/a&gt;, and I only provide a brief overview. Much more detailed and very helpful information can be found in the &lt;a href="http://msbayes.sourceforge.net/msbayes/README"&gt;msBayes documention&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;The basic sequence data files themselves are pretty straightfoward: for each species or taxon distributed across the barrier, you need to provide a single &lt;a href="http://en.wikipedia.org/wiki/FASTA_format"&gt;FASTA&lt;/a&gt; file with all the sequences listed in order (i.e., with sequences from population A listed first, followed by sequences from population B). That's the easy part. The part that requires a bit of work is collecting the information required to flesh out the configuration file. This includes a distinct sequence evolution model for each population pair. The msBayes simulator needs to generate realistic genetic sequence differences between individuals from each population under a standard molecular evolution model (&lt;a href="http://www.molecularevolution.org/mbl/resources/models/dnamodels.php"&gt;HKY&lt;/a&gt;), and thus requires that you provide estimates of the transition-transversion rate ratio and nucleotide base frequencies for each population pair.  So, for example, you might want to use &lt;a href="https://www.nescent.org/wg_garli/Main_Page"&gt;GARLI&lt;/a&gt; to infer a &lt;a href="http://bioinf.ncl.ac.uk/molsys/data/like.pdf"&gt;maximum likelihood&lt;/a&gt; &lt;a href="http://en.wikipedia.org/wiki/Phylogenetic_tree"&gt;tree&lt;/a&gt; under an &lt;a href="http://www.molecularevolution.org/mbl/resources/models/dnamodels.php"&gt;HKY model&lt;/a&gt; for the sequences of each population pair, or alternatively, bring your favorite tree into &lt;a href="http://paup.csit.fsu.edu/"&gt;PAUP*&lt;/a&gt; along with the sequences and ask &lt;a href="http://paup.csit.fsu.edu/"&gt;PAUP*&lt;/a&gt; to provide ML estimates of the &lt;a href="http://www.molecularevolution.org/mbl/resources/models/dnamodels.php"&gt;HKY&lt;/a&gt; parameters.&lt;br /&gt;&lt;br /&gt;For the actual format of the master configuration file, please see the &lt;a href="http://msbayes.sourceforge.net/msbayes/README"&gt;msBayes documention&lt;/a&gt;, but basically for each population pair, you need to provide the total number of sequences, the number of sequences from population A, the number of sequences from population B,  the &lt;a href="http://www.molecularevolution.org/mbl/resources/models/dnamodels.php"&gt;HKY&lt;/a&gt; parameters, the length of the sequence, and  information the path to the &lt;a href="http://en.wikipedia.org/wiki/FASTA_format"&gt;FASTA&lt;/a&gt; file.&lt;br /&gt;&lt;br /&gt;I would highly recommend that you also go the extra step and embed the msBayes run parameters in the configuration file as well, rather than rely on the interactive queries. This is neccessary if you want to run the simulations remotely on a cluster through a scheduler, but far more importantly, it provides a permanent record of the simulation settings so as to allow you to track down mistakes or reproduce the same conditions.&lt;br /&gt;&lt;br /&gt;Here is an example of a complete simulation configuration file, where the run settings, hyper-priors, priors, and observed sequence data are all specified:&lt;br /&gt;&lt;br /&gt;&lt;pre style="border: 1px solid ; margin: 0pt; padding: 0em; background: rgb(255, 255, 255) none repeat scroll 0% 0%; -moz-background-clip: border; -moz-background-origin: padding; -moz-background-inline-policy: continuous; color: rgb(0, 0, 0);"&gt;&lt;br /&gt;numLoci = 1&lt;br /&gt;lowerTheta = 0.01&lt;br /&gt;upperTheta = 75.0&lt;br /&gt;upperTau = 10.0&lt;br /&gt;numTauClasses = 0&lt;br /&gt;upperMig = 0.0&lt;br /&gt;upperRec = 0.0&lt;br /&gt;upperAncPopSize = 0.5&lt;br /&gt;constrain = 0&lt;br /&gt;subParamConstrain = 000000000&lt;br /&gt;reps = 5000000&lt;br /&gt;&lt;br /&gt;BEGIN SAMPLE_TBL&lt;br /&gt;100 7   93  3.71    2356    0.2 0.3 0.1 a1-a2.fas&lt;br /&gt;16  4   12  4.08    2559    0.3 0.2 0.2 b1-b2.fas&lt;br /&gt;10  5   5   4.08    1356    0.3 0.2 0.2 c1-c2.fas&lt;br /&gt;29  8   21  4.09    2367    0.2 0.2 0.1 d1-d2.fas&lt;br /&gt;11  2   9   4.13    1372    0.3 0.2 0.1 e1-d2.fas&lt;br /&gt;69  38  31  4.13    1372    0.2 0.2 0.1 f1-f2.fas&lt;br /&gt;END SAMPLE_TBL&lt;br /&gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Once you have assembled your data files and created the simulation configuration file, the next step is to actually run the msBayes simulator to generate samples from the prior. This can be as simple as:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre style="margin: 0pt; background: rgb(0, 0, 0) none repeat scroll 0% 0%; padding-bottom: 0em; -moz-background-clip: border; -moz-background-origin: padding; -moz-background-inline-policy: continuous; color: rgb(0, 255, 0);"&gt;&lt;br /&gt;$ msbayes.pl -c msbayes_master -o prior_samples&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;This calls on "msbayes.pl", a Perl script that is part of the msBayes distribution, passing it our master configuration file "msbayes_master" as an argument and asking it to direct its results to a file called "prior_samples".&lt;br /&gt;&lt;br /&gt;"msbayes.pl" is the primary driver for the simulations from the prior. It wraps up a whole bunch of complexity:  generation of hyper-parameters from their prior distribution, the generation of the actual model parameters to be passed to the underlying coalescent-based sequence simulator (using "msprior"), calling the coalescent-based sequence simulator ("msDQH") to simulate a sample of summary statistics based on the generated model parameters, assembly of a row of data that couples together the set of generated parameter values and the corresponding summary statistics simulated based on this values, and the output of this row as a tab-delimited line of data. This gets repeated again and again, for the number of replicates specified in the master configuration file ("reps = 5000000" in the example file above), to build up a large sample drawn from the parameter space according to the given prior distributions.&lt;br /&gt;&lt;br /&gt;All this is a lot of work, but is extremely computationally efficient: the CPU and memory footprint of a single process running is almost negligible, and you can run multiple simultaneous simulations without much overhead even on a single middle-of-the-road machine. Which leads to my recommendation here: run multiple processes to generate multiple files with samples from the prior simultaneously, concatenating these files before going to the next step in the &lt;a href="http://en.wikipedia.org/wiki/Approximate_Bayesian_computation"&gt;ABC&lt;/a&gt; &lt;a href="http://en.wikipedia.org/wiki/Pipeline_%28computing%29"&gt;pipeline&lt;/a&gt;. So, for example, if you want 10 million samples from the prior, you can run 10 simultaneous processes, each drawing 1 million samples from the prior; concatenating the files then gives you 10 million samples in 1/10th the time.&lt;br /&gt;&lt;br /&gt;Once you have you samples from the prior, the next step is to carry out rejection sampling on this data to provide samples from the posterior. And this is where the going gets pretty interesting ...&lt;br /&gt;&lt;br /&gt;You see, to be honest, I do not think my posts describing the procedures up to this stage have really provided all that much more new or useful information than already existed in the documentation. I think that the operation of the msBayes ABC pipeline up to this point has been smooth and relatively trouble-free, and has proceeded as expected based on the documentation and help files. However, going beyond the generation of samples from the prior is where I ran into a number of difficulties, gotchas, glitches, and trip-me-up's. My next post will discuss these issues, share my experiences, lessons learned and solutions, as well as offer some scripts that I wrote to patch up the "leaks" and unexpected incompatibilities between different parts of the pipeline.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-5419075818751559720?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/5419075818751559720/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-ii-data-assembly.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/5419075818751559720'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/5419075818751559720'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-ii-data-assembly.html' title='msBayes Basics II: Data Assembly, Configuration File Preparation, and Generating Samples from the Prior'/><author><name>Jeet Sukumaran</name><uri>http://www.blogger.com/profile/04874973927331407332</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_bSNhphjTeao/S23tnWOSfNI/AAAAAAAAABI/FhFQDr3WBZI/s1600-R/f83bd77d036b0a5d1a595c697a26ef6c.png'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-1136822106542715699</id><published>2009-09-19T21:57:00.000-07:00</published><updated>2009-10-01T23:14:46.090-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='ABC'/><category scheme='http://www.blogger.com/atom/ns#' term='msbayes'/><category scheme='http://www.blogger.com/atom/ns#' term='approximate Bayesian Computation'/><title type='text'>msBayes Basics I: Introduction</title><content type='html'>I've been working quite a bit with &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; recently, and thought I would share some of the stuff that I've learnt, with an emphasis on the practical side of things. Over a series of posts, I am going to summarize the steps required to carry out a basic &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; analysis, provide some "lessons learned" and tips for each of these steps, as well as provide some supplemental &lt;a href="http://www.python.org/"&gt;Python &lt;/a&gt;scripts that I find greatly facilitate the flow of data from one stage of the pipeline to another. This first post serves as a general introduction to the approach and the program, and a general overview of the procedures involved in an &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; analysis.&lt;br /&gt;&lt;br /&gt;Most of you are probably already quite familiar with this method of analysis, but for those of you that are not, perhaps a brief explanation may be in order.&lt;a href="http://msbayes.sourceforge.net/"&gt; msBayes&lt;/a&gt; is a suite of tools that plug into a &lt;a href="http://en.wikipedia.org/wiki/Pipeline_%28computing%29"&gt;pipeline&lt;/a&gt; for &lt;a href="http://en.wikipedia.org/wiki/Approximate_Bayesian_computation"&gt;Approximate Bayesian Computation&lt;/a&gt; (ABC) phylogeographic analysis, estimation and hypothesis testing. The ABC approach is a computationally efficient approach to complex estimation and inference problems. At its &lt;span style="font-style: italic;"&gt;most&lt;/span&gt; basic, it works as follows:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;A set of summary statistics is calculated on an observed data set.&lt;/li&gt;&lt;li&gt;Multiple sets of data are simulated using parameters drawn from a prior distribution.&lt;br /&gt;&lt;/li&gt;&lt;li&gt;The same same set of summary statistics is calculated for each of these simulated data sets.&lt;/li&gt;&lt;li&gt;The parameters that result in simulated data sets with summary statistics that are equal to or within some threshold distance of the summary statistics of the observed data set are recorded. &lt;/li&gt;&lt;li&gt;The posterior probability of the parameter values is given by the proportion of those values in the final set of accepted parameters.&lt;br /&gt;&lt;/li&gt;&lt;/ol&gt;Each of these steps involves multiple levels of complexities and and a smorgasbord of variant approaches, of course, from the full nuts-and-bolts of the simulation model to various flavours of importance/rejection sampling (step 4), but the core concept remains the same: simulate the data under a range of parameter values, and keep those parameter values that result in simulated data sufficiently close to the observed data. This approach is extremely attractive: for many complex problems and questions, data can be simulated relatively easily, whereas sometimes even formulating the corresponding likelihood statement can be pretty challenging. Even if the likelihood can be formulated, computing it tends to be extremely expensive in terms of CPU power and time as well as memory, while simulating data is relatively cheap as far as those resources go. However, be warned: the &lt;a href="http://en.wikipedia.org/wiki/Approximate_Bayesian_computation"&gt;ABC&lt;/a&gt; approach does exact a cost in storage space; all those samples from the prior have to be retained until they can be processed, and I have had analyses that walk into the realm of 50-60 GB.&lt;br /&gt;&lt;br /&gt;&lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; can be used to answer a broad variety of questions using &lt;a href="http://en.wikipedia.org/wiki/Approximate_Bayesian_computation"&gt;ABC&lt;/a&gt;, but a major one, and the one which I am currently using it for, is to test the hypothesis of &lt;a href="http://www.ncbi.nlm.nih.gov/pubmed/17263107"&gt;simultaneous divergence&lt;/a&gt; amongst a set of taxa co-distributed across a geographical barrier. Specifically, I am asking "given a bunch of sequences of pairs of populations of different species on either side of a barrier, what is the posterior probability a single distinct divergence time between each pair?"&lt;br /&gt;&lt;br /&gt;Using &lt;a href="http://msbayes.sourceforge.net/"&gt;msBayes&lt;/a&gt; to answer this question involves the following steps:&lt;br /&gt;&lt;ol&gt;&lt;li&gt;Assemble the observed data files.&lt;/li&gt;&lt;li&gt;Create the configuration file to drive the simulations from the prior.&lt;/li&gt;&lt;li&gt;Run the simulator to produce a large number of samples from the prior.&lt;/li&gt;&lt;li&gt;Calculate the summary statistics on the observed data.&lt;/li&gt;&lt;li&gt;Regularize the columns / fields in the summary statistics file and the simulated data.&lt;/li&gt;&lt;li&gt;Carry out the acceptance/rejection sampling on the prior samples to produce samples from the posterior.&lt;/li&gt;&lt;li&gt;Summarize the posterior probability distribution of discrete parameter values (such as number of divergence times) by assessing the proportion of those values in the posterior samples.&lt;/li&gt;&lt;li&gt;Weight posterior samples of continuous parameter using local linear regression before summarizing their distribution.&lt;/li&gt;&lt;/ol&gt;Each of these steps will be covered in detail in future posts, and I will also be providing some scripts that I found really useful in helping to carry out some of these steps:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Part II: &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-ii-data-assembly.html"&gt;Data Assembly, Configuration File Preparation, and Generating Samples from the Prior&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Part III: &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-iii-approaches-to.html"&gt;Approaches to Sampling from the Posterior&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Part IV: &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-iv-preparing-your-files.html"&gt;Preparing Your Files for Sampling from the Posterior&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Part V: &lt;a href="http://geodendron.blogspot.com/2009/09/msbayes-basics-v-rejection-sampling-to.html"&gt;Rejection Sampling to Extract the Posteriors from Your Priors&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Part VI: &lt;a href="http://geodendron.blogspot.com/2009/10/msbayes-basics-part-vi-tolerances-and.html"&gt;Tolerances and Local-Linear Regression Weighting of the Posteriors&lt;/a&gt;&lt;br /&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-1136822106542715699?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/1136822106542715699/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-i-introduction.html#comment-form' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1136822106542715699'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/1136822106542715699'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2009/09/msbayes-basics-i-introduction.html' title='msBayes Basics I: Introduction'/><author><name>Jeet Sukumaran</name><uri>http://www.blogger.com/profile/04874973927331407332</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='32' src='http://3.bp.blogspot.com/_bSNhphjTeao/S23tnWOSfNI/AAAAAAAAABI/FhFQDr3WBZI/s1600-R/f83bd77d036b0a5d1a595c697a26ef6c.png'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-2531775326563628423.post-9211584805560375708</id><published>2009-07-09T10:08:00.000-07:00</published><updated>2009-07-09T10:35:37.147-07:00</updated><title type='text'>Welcome to geodendron!</title><content type='html'>Now that this blog has been up for several weeks, it's probably time to post something.  It's become increasingly apparent (at least in my opinion) that the time is ripe for some serious testing of phylo- and biogeographic methods.  Hopefully this blog will serve as a forum for thoughts and developments related to (i) approaches to simulating appropriate benchmark datasets, (ii) simulation studies to test existing methods, and (iii) the development of new analytical tools. &lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I envision a few posts each week on published papers or unpublished thoughts, each of which will spark comments and discourse on the post's topic.  Not that this should need to be pointed out, but just in case: &lt;b&gt;all discussions on this blog should remain civil and respectful in tone&lt;/b&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Jeet and I realized the potential usefulness of this blog when we discovered that we were duplicating each other's efforts in writing generalized simulation programs for phylogeographic data.  Roughly, both of our programs relax coalescent assumptions (among other things) and allow individuals to move on a continuous landscape.  My collaborators and I are in the process of writing up an application note about our program and are trying to figure out what simple simulations would be good for illustrating the utility of continuous-landscape simulations over existing coalescent-based simulations.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So, to kick things off, what are your thoughts on this?  What are some simple scenarios where relaxing the coalescent assumption might be particularly important?&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/2531775326563628423-9211584805560375708?l=geodendron.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://geodendron.blogspot.com/feeds/9211584805560375708/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://geodendron.blogspot.com/2009/07/welcome-to-geodendron.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/9211584805560375708'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/2531775326563628423/posts/default/9211584805560375708'/><link rel='alternate' type='text/html' href='http://geodendron.blogspot.com/2009/07/welcome-to-geodendron.html' title='Welcome to geodendron!'/><author><name>Jeremy Brown</name><uri>http://www.blogger.com/profile/03105958793710069493</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='25' height='32' src='http://1.bp.blogspot.com/_KkgJhpfFV3g/Sj2QLpX3rrI/AAAAAAAAAAM/C-cQ9k-WqEo/S220/n7938591_48119704_6896.jpg'/></author><thr:total>3</thr:total></entry></feed>
