Sunday, 17 April 2016

Phylogenetics software: My own journey

This is just a personal note to conclude the ruminations on the popularity and user friendliness of phylogenetics software. Thinking back to my time as a student, and looking at my own publications, what software did I actually use myself, and why?

As I mentioned in the popularity through the years post, my very first exposure to phylogenetics software was through an undergraduate course in Systematic Botany in my third year at G�ttingen University, which should have been in 1998/99. We had parsimony analysis explained to us and were shown how to use Hennig86. I remember thinking that its terminal interface was rather clunky, and the commands hard to intuit. Later in the same semester I also had a course in Phycology, and it is possible that we were shown PAUP, but if so then not in any depth.

The first time I had to do a professional phylogenetic analysis was for my Diplomarbeit, which would here be called an honours project, and one of the two resulting papers, which only came out in early 2005. According to the PDF, I used PAUP (already v4) to do parsimony analysis of the morphological data and distance analysis of the molecular (AFLP) data. Why? Well, there was no informed decision making going on, no careful comparison of the merits of the different programs available. Instead, this is simply what others were using and what the institute had licenses for. So you learned how to write a data file from an older student and ran the analysis on an old Mac, because that is how it was done.

Accordingly I also did not, at this stage at least, question the choice of methods. You analysed morphological data with parsimony, sequence data with likelihood, and restriction site type binary data with distance, because that is what everybody did and what the peer reviewers expected. I was rather proud of discovering Farris' successive reweighting approach for myself though, so it is not as if I didn't do my own method-finding.

The next few years I published the various papers resulting from my doctorate project and continued to use PAUP for parsimony and distance analysis, but would usually get Treecon to compute the distance matrix because PAUP didn't do the desired distance metric for AFLPs.

With a paper that came out in 2008 I added Bayesian phylogenetics to my repertoire, for the first time using MrBayes. The approach I took will be extremely familiar to many colleagues, and it is something that I still see today in many papers I peer-review: Bayesian analysis in MrBayes, parsimony analysis in PAUP, and then show the Bayesian summary tree with support values from both analyses on the branches.

Again I did not do that because I was suddenly convinced of the merits of Bayesian phylogenetics but merely because more experienced people recommended I should use it. My understanding or thinking at that stage could comfortably be summarised as follows: Bayesian phylogenetics is the flashy new thing that reviewers like, it gives better support values in cases where there are few decisive characters (short branches), but it is slow and arcane; parsimony can be mislead under certain circumstances but it is fast and easy to understand; so do both, and if they agree - which they usually do - you are happy.

During my second postdoc 2009-2010 I met and co-authored with a very dedicated Bayesian, and this is probably what finally motivated me to pay a bit more attention to the philosophical / methodological quarrels between the adherents of the various approaches. I learned a lot about the logic behind Bayesian phylogenetics and started to appreciate its advantages, although it has to be said that he failed to convince me entirely.

In one memorable conversation I played the familiar "but how do we know all these priors are realistic" card. His answer was, interestingly, not the usual Bayesian claim that every method has its prior assumption, but in non-Bayesian methods they are hidden away, while Bayesianism has the advantage of making them explicit. Instead, his answer was the equally popular "if the data are strong a wrong prior won't matter" card. It did not then occur to me to reply that this would, if true, defeat the entire purpose of Bayesian analysis.

At any rate the collaboration, which resulted in a paper published in 2012, brought three novelties for me: using BEAST, using Mesquite, and doing species tree analyses both parsimony-based and Bayesian. The last of these was a very conscious decision based on learning about coalescent theory and incomplete lineage sorting.

Since moving to Australia, I have experimented with other species tree methods, but perhaps the most important tool I have picked up is RAxML. In this case the reasons are entirely pragmatic: We have done some work with very large trees derived from mining Genbank, usually to combine them with distribution data. It would simply take unrealistically long to use Bayesian approaches.

As I still have a soft spot for parsimony I have also thrown the same data sets at TNT, for the same speed-related reasons. But unfortunately I generally need a single dichotomous tree for downstream analyses, even if some branches may be very poorly supported, and parsimony analysis invariably produces several equally parsimonious trees. Consequently my use of TNT has not yet translated into publications.

So that is where I am now: Considerably more aware of the strengths and weaknesses of the different approaches than ten years ago, but still not a partisan of any school. I remain as unimpressed by philosophical arguments against statistical methods as by statistical arguments against the distinctly non-statistical method of parsimony. I will happily and pragmatically use all of them, even the distance methods that I may sometimes sound a bit dismissive of. And to be honest, I find exclusive adherents of any particular method a bit odd. These are all tools! I wouldn't tweet "hammer 4 Life" and "boycott journals that insist on using screwdrivers" either.

My current preferences are BEAST for smaller datasets and especially multi-gene species trees, as it seems to be fairly flexible and robust to data that throw other statistical methods off; RAxML for large supermatrix datasets; TNT for quick data exploration especially on Linux at home or while travelling; and PAUP for its richness in more obscure analyses that the others, as special-purpose software, don't necessarily provide.

Just as with the methods it seems like a good idea to be flexible and pragmatic about the software.

No comments:

Post a Comment