Friday, 1 April 2016

The popularity of phylogenetic programs over the years

Some time ago a colleague commented on this blog that "Bayesian (MrBayes/Beast) analyses have become almost stand-alone standard". My feeling is that the situation is much more mixed, with many palaeontologists still using parsimony, people with very big datasets using RAxML, some fields sticking to MEGA, and so on.

I got curious: What is the actual market share of different phylogenetics programs? How has it developed over time?

The following are the programs I considered:

Hennig86: A parsimony analysis program from the 1980ies. It cost money and could not really do a lot, but it was the first program I was exposed to in an undergraduate course.

Phylip: This package offers parsimony, likelihood and distance. I never understood its popularity as I find its interface very off-putting, but I guess traditionally the advantage over PAUP was that it was available for free.

PAUP: PAUP was the first program I used seriously and it was generally popular in my institute during my Diplom and Doctorate time. It offers parsimony, likelihood and distance analyses, but it is rather slow with likelihood and had to be paid for. A new version is in the works, with its command line version to be made available for free.

MEGA: Not quite comparable to the others as it is also used by many people for sequence alignment instead of phylogenetic analysis. But it does offer likelihood, parsimony and distance and is freely available. As mentioned above, it seems to be very popular in some specific areas of systematics. For example I know a nematode researcher who told me that virtually all his colleagues use it, whereas I would at the moment not be able to recall a single botanist who does so. One of its attractions might be the GUI environment and the combination of alignment and phylogenetic analysis under one roof.

Mesquite: Likewise a program that is used for much more than phylogenetic analysis, although in this case it is more downstream analyses in evolutionary biology rather than sequence alignment. Freely available, but I am not entirely sure what phylogenetic tree searches it can do beyond parsimony. The options are a bit confusing, to be honest. I have included it here because I use it for character tracing and parsimony species trees.

TNT: Freely available, can only do parsimony. It is ludicrously fast even for large datasets, but as most molecular phylogeneticists have moved to model-based methods it ironically appears to be used in large part by palaeontologists with fairly small datasets. That being said, I see anecdotally that it may be picked up a bit more by at least a subset of the new generation of people working with huge genomic data and a tolerance for scripting.

PhyML: Can only do likelihood. I have never used it as a stand-alone version, my only exposure to it is through SeaView.

RAxML: Can only do likelihood but is optimised for doing it really, really fast. Kind of the TNT for likelihood (or vice versa). Freely available, more and more important with big datasets, but perhaps not entirely the most user-friendly option.

GARLI: Likelihood analysis software with a claim to high speed. I have never used it but am afraid that it tried to compete for the exact same market niche as RAxML at around the same time and lost (see numbers below).

MrBayes: The first Bayesian phylogenetics tool I ever used, back in my doctorate time. Freely available, not very user-friendly, produces phylograms. At least that is true for the versions I am familiar with.

BEAST: The other well-known Bayesian phylogenetics software package, also freely available. Uses a coalescent approach and therefore produces ultrametric trees. As far as I can tell it is faster than MrBayes and also relatively user-friendly thanks to its GUI. (Which is called BEAUTi. Aren't bioinformaticians' jokes hilarious?)

I may have forgotten some, especially if they had only a small niche in the 1990ies and then disappeared, but I feel it catches all the important players in the field.

I used SCOPUS to query the number of papers each year that cited the software in question. Because some names, especially BEAST, MEGA, Mesquite and TNT, are probably too unspecific and could produce lots of false positives, I combined them with the author name, e.g. by using the search term "TNT Goloboff" instead of just TNT. MrBayes, for example, I felt confident enough to query on its own.

The first graph shows the absolute number of references.


So what do we see? I am a bit worried about that massive peak for MEGA around 2011 being partly false positives, but the others make sense.

The 1990s and first half of the 2000s were the time of Phylip and PAUP. Both started to be widely used in the mid-1990s and peaked around 2004/5. Their decline afterwards coincides beautifully with the rise of PhyML. Although I am speculating here it is possible that this represents users switching over who did likelihood analyses and found PhyML to be either faster or more user-friendly. In contrast to Phylip, PAUP recovered in absolute terms until 2010. It declined again after that year in the face of a diversifying competition but still has a large pool of users.

Although MrBayes started its meteoric rise a few years earlier, the time since about 2007 is what I would consider the era of Bayesian phylogenetics as it started to overtake PAUP. Shortly thereafter BEAST took off, apparently making the first dent in MrBayes use only in the past year. Entirely parallel to BEAST the likelihood phylogenetics tool RAxML is growing in popularity for analyses where speed is of the essence.

There is a small market for the parsimony software TNT that does not appear to be impressed by the availability of other options.

One problem with absolute numbers is that phylogenetic software is used much more widely now than it was in the 1990ies, so that it is harder to see what exactly was going on then. Another way of expressing these trends is as 'market share', in other words what percentage of references in the year were to each program.


Here we see that Hennig86 was relatively popular in the early 1990s, reaching as much as 10% of end users in 1991. It has flatlined since about the turn of the century. The shifts between PAUP and Phylip are interesting to see. The same goes for the rather impressive-looking dent in the market share of MrBayes after 2008, which was entirely invisible on the graph for absolute numbers because the market as a whole grew so much.

To what degree are changes in popularity perhaps driven by method preference? In the following, I have presented market share of the programs by the methods they offer: red for parsimony, yellow for likelihood, appropriately orange for both, and blue for Bayesian.


This makes it look very much as if there is a trend towards specialised software in the last few years, contrasting with the big, flexible packages of earlier years. Although Bayesian phylogenetics is clearly on the rise, this graph also shows that it is still far from the "stand-alone standard". In fact the share has hardly risen above what it was 2007.

It also has to be noted that for many colleagues in my area it was standard operating procedure in the last few years to do a Bayesian analysis with MrBayes, a parsimony analysis in PAUP, and then perhaps also a likelihood analysis in PAUP on the same dataset. They then generally showed the Bayesian tree decorated with its own support values and the bootstrap values from likelihood and parsimony. I know because these "they" include me and many papers I have reviewed.

Unfortunately, none of these graphs show well what is going on for the less popular programs. So here a graph that shows their waxing and waning in relative terms. The maximum value of one is their highest ever market share (in fraction of the references in a year), so that other years show where they were relative to the biggest share they ever managed to capture in their existence.


Peak market share: PAUP in 1988, Hennig86 in 1991, Phylip in 1994. Note however that before 1996 hardly counts because the overall market size was minuscule until then; it would perhaps be more appropriate to locate the peaks in popularity for PAUP in 2002 and for Phylip in 1999, but the one for Hennig86 looks right either way. Moving on, peak market share for MrBayes and PhyML in 2007, GARLI in 2009, MEGA in 2012 (if that isn't noise in my search). The others - Mesquite, BEAST, RAxML, and even parsimony-only TNT - have actually achieved their highest ever market shares in the past year.

That is another signal for two developments: Usage of phylogenetics software is still growing overall, and a market that was long dominated by two to three multi-purpose programs is increasingly atomised into many niche programs with their distinct user communities.

-------

Updated 6 April 2016, and again 10 April 2016: As this has suddenly found so much unexpected interest, I felt I should do a bit better. Below three new graphs.


This is a stacked plot of absolute references now taking into account six additional programs:

POY: An unusual program that I had never heard of before, suggested by a commenter. It accepts unaligned data and allows parsimony or likelihood analyses. Interestingly searching for "POY Wheeler" produces 90% false positives (engineering etc.). I added the term phylogenetic to the search key to get only the correct references.

Treecon: Bit of an embarrassment that I didn't put it in right away as I have used it myself, but it only occurred to me after the comment on distance methods. As the previous sentence implies, it does only distance.

NONA: Another specialised parsimony software that bridges between Hennig86 of the early 1990s and today's TNT. Never used it myself.

PAML: As the ML in the name indicates, this is specialised likelihood phylogenetics software, and perhaps the biggest player I had been missing. Never used it myself.

Tree-puzzle: Likewise offers only likelihood, never used it myself.

Treefinder: Was in the news last year when its author forbade scientists in the USA and much of Europe to use the program because he disagrees with various policies, in particular immigration. Likelihood only, never used it myself.

In addition I have redone the MEGA references after I found that I got about 10% false positives there, but it doesn't really change much in the big picture. No idea, by the way, why overall references are going downhill after 2012. Maybe I am missing some very new software that I am not yet aware of?


Here the same in percentages. Again note that the idea of a market share is a bit misleading given that many of the papers in question may have used two competing programs for the same study. I just reviewed a paper that cited PAUP, RAxML and MrBayes although the trees from the third are the only ones shown.


Finally, the methods graph redone to reflect that PAUP originally offered only parsimony. I am given to understand that version 4.0 was actually the first to add likelihood. As far as I can tell it appears to have come out in 1998, but I am happy to be corrected (the one I bought was already 4.0b10).

This graph says nothing about what methods were used from the orange block, although obviously everybody who cited a program from the other blocks would have used that method.

Note how the addition of NONA and Treecon really makes a difference here, with NONA having served the same community that now uses TNT, and the likelihood only sector is larger after the addition of three more likelihood programs.

No comments:

Post a Comment