viernes, julio 18, 2008

“Weighting” tress with TreeFitter

TreeFitter [1] is a program for matching phylogenies with associations used in biogeography and co-evolutionary studies. It has some problems, as seems that Treefitter move to open-source, maybe this problems can be solved! Here I address some problems produced with the 'weighting tree' procedure. The analogous problems is found in some phylogenetic methods and programs, and I address it laterally (they consequences are fully discussed in [2]).

As many biogeographic programs [3, 4, 5] TreeFitter only dealt with perfectly dicotomic trees. To overcome this fault, it implements a weighting of trees. Then you can put all the dicotomic trees found in the phylogenetic analysis and put a fractional weight to each tree. For example, if you found four trees, each one would be weighted by 1/4. This is in fact a majority rule consensus. Ronquist, who is a defender of bayesian methods, see the weighting of trees as a positive characteristic as it covers the 'uncertainty' of the analysis [6]. Then it express the 'confidence' (support in a most relaxed version) of each clade. But contrary to intuitive expectations, majority rule consensus have nothing to do with the support of a determinate clade. Instead, they favoring ambiguous topologies! [2, 7].
Take this example (after [2] and [7]):
#nexus

ptree new1 weight=0.143 (1,((6,(7,(8,9))),(10,(4,(3,(2,5))))));
range new1 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new2 weight=0.143 (1,((6,(7,(8,9))),(2,(3,(4,(5,10))))));
range new2 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new3 weight=0.143 (1,((6,(7,(8,9))),(2,(3,(5,(4,10))))));
range new3 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new4 weight=0.143 (1,((6,(7,(8,9))),(2,(3,(10,(4,5))))));
range new4 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new5 weight=0.143 (1,((6,(7,(8,9))),(2,((3,10),(4,5)))));
range new5 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new6 weight=0.143 (1,((6,(7,(8,9))),(2,(10,(3,(4,5))))));
range new6 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new7 weight=0.143 (1,((6,(7,(8,9))),((2,10),(3,(4,5)))));
range new7 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree indie1 (1,((6,(7,(8,9))),(4,(3,(2,5)))));
range indie1 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i;

ptree indie2 (1,((6,(7,(8,9))),(2,(3,(4,5)))));
range indie2 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i;

htree host1 (a,((f,(g,(h,i))),(d,(c,(b,e)))));
htree host2 (a,((f,(g,(h,i))),(b,(c,(d,e)))));
The ambiguity is caused by the taxon '10' of data set 'new' that jumps to several positions among the tree. '10' has not influence on the selected tree because it is a product of a dispersal in all topologies. '10' is inestable in all topologies that include (4,5), then the majority rule gives more weight to that topology than to alternative topology, in which '10' position is not ambiguous. In this case both 'tree islands' have the same evidential weight (by the way, that is the reason to prefer strict consensus over other consensus!). But when weights are applied the topology showing (4,5) are preferred as there is more topologies with that clade, then the first tree is preferred because they lack of resolution!

At first look it seems that this case can be solved weighting the whole islands instead of each tree, but within each island it is possible to have the same problems of the first example, and in more complex cases, identifying 'topology' islands became difficult and maybe impossible if there are several combinations in independent clades!

Solutions?

Of course the best solution for the problem of multiple trees in TreeFitter without using weights is a new version that dealt with polytomic trees. I guess that the resistance against polytomic trees is because they 'imply' simultaneous speciation. I do not hold that kind of idea ;), and I have no reason to think that polytomic trees supports such interpretation. Even if that is the interpretation is better than weighting (by the way, if you think that polytomic trees implies multiple speciation, the weighting implies fractional speciation! I think that it is a more problematic idea than 'multiple' speciation!).

But before a new version of TreeFitter--or a similar program--arrives it is necessary to found a solution to the problem of multiple trees. I am not happy with the solution that I propose here, but I have no other idea, so here I go...

Use an Adams consensus to detect the terminal/clades that produce the multiple trees, and remove it form the analysis, so you keep the stable part of the topology. I do not like removal of evidence, but it seems safer than relying in the biased solution of tree weights. Maybe some want to re-run the analysis, but I think that is preferable to use the reduced tree as it is based directly on the whole evidence, then, the effect of data removal, I hope, is minor, as the stable part of the original topology is conserved. Also, TNT [11, 12] has several tools to identify moving taxons, so an analysis of the trees with TNT, provides several ways to found stable topologies.

A second problem is that in TreeFitter, the extinctions had a cost greater than 0 [6], so removing terminals could increase the extinction value, but I think that this is a minor problem compared with given weight to ambiguous data. Actually we can think that the cost increase by the new 'extinctions' is the penalty for the ambiguity of the data.

Weighting trees in other methods

As far as I know there are other phylogenetic methods/programs that weight trees, and suffer from the same flaws as the tree weighting under TreeFitter.

First, is the majority rule consensus. It seems that the main reason to prefer majority rule consensus is because they provide results with more resolution. But as examples shows [2, 7] the resolution created by a majority rule tree is coupled ore with the ambiguity generated by a particular topology skeleton. Maybe a more interesting solution could be the use of a minority rule consensus (note that supported clades, always appear as they minority rule is 100%), an advantage is that at least they knowledge that the number of clade instances are not related with the 'support' of the clade. A comparison between the majority and minority rule consensus can show the parts of the tree that are somewhat unstable. But I think that this kind of analysis is better performed using a combination of an strict consensus and an Adams consensus [8], or several of the tools from TNT [11, 12].

Under Bayesian analysis, the frequency of clades is recognized as a 'posterior probability' (a more catholic interpretation might be clade support). Bayesian analysis differs from typical consensus because they made a consensus from trees with different optimality value. But all explored topologies are taken as they are found, and branches are never collapsed, so they are subject of the problems of majority rule consensus [2, 7]. Then in cases where a particular terminal/clade is producing ambiguity, the final topology favors the ambiguous topology. This can produce some illogical results when the data sets even with a single optimal tree, are not very decisive [9], with several near-optimal fit trees. In that case, is is possible that the method prefer the ambiguous topologies from the sub-optimal trees, over the topology in the optimal one, even with high probability values! I recommend you to read [2] for a full criticism of bayesian analysis.

There is also other form of majority rule consensus is used: for support measuring using resampling (like jacknife or bootstrap). In this case the use of majority rule reflects the amount of support for each clade. First, from each resampled matrix analyzed a strict consensus tree is build, then the final majority rule consensus shows the amount of times in which a supported clade appears. In this case there is no bias against or for a particular clade, because the trees from resampled matrix are collapsed. This is not the case for PAUP [2, 10], because in PAUP the trees are weighted in each resampled search, then, the majority rule consensus is a majority rule consensus of several individual, and un-collapsed trees, then it is fully prone to the ambiguity problems of majority rule consensus.

[1] Ronquist, F. 2001. TreeFitter, program and documentation. Available at: http://www.ebc.uu.se/systzoo/research/treefitter/treefitter.html
[2] Goloboff, P.A., Pol, D. 2005. Parsimony and Bayesian phylogenetics. In: Albert, V.A. Ed. Parsimony, phylogeny, and genomics. Oxford univ., Oxford, pp. 148-159.
[3] Page, R. D. M. 1993. Component 2.0, program and documentation. Available at: http://taxonomy.zoology.gla.ac.uk/rod/cpw.html
[4] Page, R. D. M. 1994. TreeMap 1.0, program and documentation. Available at: http://taxonomy.zoology.gla.ac.uk/rod/treemap.html
[5] Ronquist, F. 1996. DiVa, program and documentation. Available at: http://www.ebc.uu.se/systzoo/research/diva/diva.html
[6] Ronquist, F. 2003. Parsimony analysis of coevolving species associations. In: Page, R.D.M. Ed. Tangled trees. Chicago Univ., Chicago, pp. 22-64.
[7] Sharkey, M.J., Leathers, J.W. 2001. Majority does not rule: the trouble with majority-rule consensus trees. Cladistics 17: 282-284. doi: 10.1006/clad.2001.0174
[8] Kearney, M. 2002. Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. Systematic biology 51: 369-381. doi: 10.1080/10635150252899824
[9] Goloboff, P.A. 1991. Homoplasy and the choice among cladograms. Cladistics 7: 215-232. doi: 10.1111/j.1096-0031.1991.tb00035.x
[10] Swoford, D. 1998. PAUP, program and documentation, Sinauer, Sunderland (USA).
[11] Goloboff, P.A., Farris, J.S., Nixon, K.C. 2008. TNT, program and documentation. Available at: http://www.zmuc.dk/public/phylogeny/TNT/
[12] Goloboff, P.A., Farris, J.S., Nixon, K.C. 2008. TNT, a free program for phylogenetic analysis. Cladistics, in press. doi: 10.1111/j.1096-0031.2008.00217.x

6 comentarios:

Mike Keesey dijo...

Why couldn't there be such a thing as simultaneous speciation? Take a population, split it into three subsets simultaneously and wait an appropriate number of generations....

Salva dijo...

Multiple speciation is, of course, possible. But actually, in a cladogram, a polytomy do not provides evidence for multiple speciation. The polytomy can be caused by real multiple speciation, or by data ambiguity, but you never know. Then, the safer idea is to treat polytomies as ambiguity.

Mike Keesey dijo...

Good point.

Anónimo dijo...

hola .. oye el criterio de remocion de terminales despues de realizar el consenso de Adams.. seria con base en algún criterio??

att= Juliana

Salva dijo...

Hola Juliana

El procedimiento de eliminación es el descrito por Kearny [8]. Ella lo que hace es comparar el consenso de Adams y el estricto, y así determina los terminales "saltarines".

Pero creo que es mejor usar el método implementado en TNT, que es más directo (eliminar taxones para mejorar la resolución del consenso).

Por supuesto, lo ideal sería no usar TF xD, o que existiera algo como TF que tuviese en cuenta las politomias!

Anónimo dijo...

Hola¡¡

Oye dado el caso que uno no tenga la matriz de datos, si no solo las topologias y apartir de ellas quiera realizar el consenso entonces no podria utilizar TNT, entonces lo mas factible seria utilizar la metodologia de Kearny.

Juliana