Computer cladistics / ¡Cladística a la lata!: 2008

miércoles, noviembre 12, 2008

Core coding

En estos días, algunas de las ideas que tengo de mi proyecto, no están saliendo como yo quería, así que me he dedicado a la programación de algunos detalles computacionales... como el asunto, aunque muy relacionado con computadoras, pero poco con cladística (aparte claro, de ser parte de mi proy xD).. lo deje en mi blog personal :P.

English version:

How can Java (and several other languages) programmers live without pointers?

lunes, noviembre 03, 2008

Hennig XVII: “Live” blogging, day 4

It was difficult to get up early after la fiesta salvaje, but more difficult to not fall asleep, not because the presentation were boring, but because the whole meeting schedule catch me!

Torbjorn Ekrem try to produce “robust backbone trees” to perform phylogenetic analyses, but he uses a form of character elimination, so I don't like his idea.

Pancho Prevosti, works with a simultaneous analysis of otters, using morphology, molecules, and fossils.

Torsten Dikow shows a phylogenetic analysis of Asilidae robber flyes, into a context of disponible fossils of the group. And Johnatan Liria, uses morphometrical data, using TNT, to search the phylogeny of some particular Culicidae mosquitoes. For the moment, I feel really suspicious about the use of “warps” or “PCA” as characters!

Gabriel Ruá, explore several k values to choose a k value, and analyze a clade of Asteraceae. But I think that his use of a broad range of k values, made his inferences somewhat useless.

Tim Crowe, analyze a big data set to evaluate the position of some african quails, he shows that many identifications based solely on feather colors are plainly wrong. But I like his exposition of several song calls by these quails.

Using the large data set, Norberto Giannini examine some phylogenetic hypothesis of mammals, and found some interesting consequences! I like the way in which he present the results, and the good and problematic points of the study.

Gitte Peterson shows a weird behavior of his molecular data, and she connects it with paralogy, and mRNA edition. Although I think that there are better ways to attack the problem, I think that her exploration of the results was wonderful!

Then Kevin Nixon shows how some character codings used for the origin of seed plants, are highly problematic, because they are based on a preconception and a chain of errors in the description of some fossil material.

After lunch, Fabián Michelangeli gives a talk about a Melastomataceae clade, he uses molecular results, but also include a mapping of an initial set of morphological characters that he and his co-authors were working, and how this evidence to be added fit to the molecular results. I like the detailed working of his fruit characters! In the same line of Melastomataceae Renato Goldenberg talks about the preliminary results of a more inclusive project of this group of plants.

Pd. Although published few days after the meeting, this post was written the last day of the meeting. I not publish it before, because after the last talk, I like everyone, go out to celebrate the end ;).

Pd2. As part of the organization of the meeting, I want to thank every one for comming!! :D

viernes, octubre 31, 2008

Hennig XVII: “Live” blogging, day 3 (banquet)

Of course, one of the most expected things of the Hennig meeting was the banquet, and the banquet speech :)!

Steve Farris announced the the students aware, two argentinian, and friends with some of them :). Sebastian Barrionuevo “el negro”, wins the Rosen Award for best poster. Santi Catalano wins the Hennig award for the best talk (about the use of landmarks in phylogenies).

The banquet speech was given in a pair. Jyrky Mouna was originally assinged for the speech, but he was unable to come, so Kevin Nixon take his spot, talking about the different kinds of trees. It was EXTREMATELLY funny xD. As the meeting is joining with “Reunión Argentina de Cladística y Biogeografía”, Julian Faivovich dío su charla en español ;)... Por su puesto, la charla estaba dirigida sin lugar a dudas a el publico argentino, pero con el background de la misma reunión era suficiente para divertirse mucho!! Fue GENIAL!

My greetings to both speakers :D!!

jueves, octubre 30, 2008

Hennig XVII: “Live” blogging, day 3

Today was a highly theoretical day ;)... Most of the talks were highly methodological, with some scattered practical works.

Ward Wheeler, in a line similar to Grant and Kluge, argues that “objective support”, like Bremer support of Likelihood ratios are different to “average support”. His main argument, is that objective support is a better measure than average support, because it is based on a direct comparison of optimality criterion... I'm not agree xD.

Next, Pablo Goloboff showed that the common argument against weighting, that a character which is poor in clade is underweighted in a clade in which character has low homoplasy, is a problematic one, and that parsimony, as we know it, imply homogeneous weighting across the whole cladogram.

John Wenzel seems to be unsuccessful to show a different way to attack consensus trees. I think that agreement subtrees, the method that he defends is not as good as reduced consensus that can be found with TNT.

In an interesting talk, from philosophical, and statistical point of view, Chris Randle, showed that as actually implemented, Bayesian analysis in not bayesian, because the impossibility to implement a real definition of clade priors.

After coffee break, Steve Farris give an entertaining and clever talk about some misrepresentation of ideas of support by Grant & Kluge, and of course, re-affirms his masterful conclusion from his 1983 classic: parsimony is minimization of ad hoc hypotheses of homoplasy. I'm very happy to see the one that gives shape to actual numerical cladistics (and, I thinks, the major contributor of the theoretical development of phylogenetics in general!).

Then, a bunch of papers based on Pablo's implementation of continuous characters, using Farris' optimization, using Opiliones. But the most interesting contribution was from Santiago Catalano, who shows that landmark data can be viewed as a generalization of Sankoff's parsimony!

Afternoon starts with a presentation of the possibilities of EOL.org (Encyclopedia of Life) by Torstein Dikow, actually, apart of being as wonderful as Wikipedia, I do not see any application for EOL... (see Page's blog!)

Matthew Yoder, shows some wonderful ways to work using open source, in the development of his software for multi-author phylogenetic studies, with his sever-based Mx.

Then Rasmus Hovmoller, shows some interesting work to understand the spreading of avian influenza A, unfortunately, external problems was an obstacle to enjoy their results.

Federico López gives a talk about using conservation indexes to conservation in amazonia. I'm quite suspicious of that kind of indexes (although they are also bad, I think that Faith's PD is far better than Vane-Wright indexes!).

Norberto Giannini shows a new way to treat correlation of characters (“comparative method”) into a truly phylogenetic way. The method is excellent, and I think a real improvement in that field!

Fernando Noll, showed a beautiful work of behavioral data for Meliponini bees, that include oviposition and nest architecture.

Jeffrey Skevington use dragonflies from Fidji, and he tries to explain the origin of sexual bias on this beautiful insects. Juan Larrain shows his molecular analysis of a group of mosses, and compare his results with a preliminar set of morphological characters.

Martín Ramírez gives an excellent talk about the usefulness of ontologies for phylogenetic analysis! I feel that ontologies are an important step in the maintainability of morphological data (and their subsequent usage), but I think that although wonderful, the re-using, specially from authors extern to the original work, seems to be difficult (at least, as actually doing).

To finish the day, Johnatan Liria talks about k selection using some of my old TNT scripts xD...

miércoles, octubre 29, 2008

Hennig XVII: “Live” blogging, day 2

Today the meeting starts with a highly molecular morning. Gonzalo Giribet present a symposium about new methods for “phylogenomics” (organized by him, Ward Wheeler, and Jirky Mouna). Three talk were about the use of gene order, inversions, recombination, in the context of phylogenetic analysis. All of the presented analysis are POY-oriented ;). It is sad that Andres Varón, a colombian working with POY, was unable to assist.

The most interesting one, is the talk of Gonzalo who insists the usefulness of morphological data, and present some new analysis with his EST data for metazoa (published in Dunn et al. 2008), with more than 200 morphological characters for metazoans in a network of experts. His analysis show that morphology provide evidence for grouping at all levels of the tree. He was very suspicious about the “groundpland” coding. I always prefer exemplar coding, but sometimes, some useful information available from not directly analyzed terminals (for example, several developemental data) can provide an excellent source of information.

Prosanta Chakrabarty, try to test sexual selection in a group of luminous fishes, I get impressed with the diversity of that fishes, but, I think that the questions about selection preasures can not be answered in the way showed by Prosanta (or in any way!).

Next, there are two nice works on Curculionids. In the first Analía Lanteri, showed a particular group of broad nose weevils, then Adriana Marvaldi, showed his most recent advances in the understanding of the whole morphology (and phylogeny) of Curculionoidea, and how many of the sequence data recently assembled for that beetles, is highly congruent with the morphological results.

Afternoon talks are more interesting to me, because they are about biogeography ;) --I only have 3 interests: parsimony methodology, morphological phylogenetics, and REAL biogeography xD--.

In the first one, Peter Hovenkamp, shows a very interesting parallel between ideas form phytosociology and endemism, he founds that many of the implementations of Josias Braun-Blanquet (or the europoean school of phytosociology really did not implement that methods! He thinks that NDM (of Szumik and Goloboff, 2004) can be a better tool for phytosociological ecology!

Another wonderful talk was giving by Claudia Szumik, who made an analysis of endemism of northern argentina. The good thing about the study, is that it includes data taken directly from the experts of each group, with several data collected by Claudia and their co-authors!

Loló (Dolores Casagranda) present a comparison between NDM and PAE, I work on that talk, so I feel that a I will give a biased report, so I pass xD.

Erika Parada, an student from my undergrad university, talks about his analysis of northern andes, she uses tree-fitter, a program that I do not like a bit :P, because it has several problems, I think she made a great work, but unfortunatelly results with TF are, for me, doubtful! :P

I do not like the talk from Dalton Amorim, but the discussion that follows the talk was very interesting, with James Liebherr, given several strong (and clever!) points against Dalton ideas.

martes, octubre 28, 2008

Hennig XVII: “Live” blogging, day 1

Yesterday several people came to the reception, I talk with several nice people, it was a very cool afternoon :D!

Today, the meeting start in proper. The site, San Javier, is wonderful, it is atop of a mountain, just in front of Tucumán, so you can see the whole plain that extends to the east, incluing, of course, the city of Tucumán. Excellent place!

There are several talks, some of them are somewhat difficult to get (to me at least xD), but overall, they are really nice. I really, really like the talk of Cecilia Kopuchian about the phylogenetics of Furnaridae (Aves). She combines in a wonderful way the pics of his characters, with his results, so even, if you know noting about the group (like me), you learn several thinks about it, and you are always on the subject of the talk!

For paleontologist, Diego “el caco” Pol give a talk about some fossil Crocodiles from Argentina, and Africa, Mesosuchia (I hope I remember right xD) which were the last surviving taxa of non-modern Crocs.

Julián Faivovich, give a molecular talk about phylogenetics of Hylidae, and to keep the talk interesting, he tries to put his work into a some biogeographical framework (“one taxon” approach, but he tries to make some predictions with his data). I think his results can be very interesting for POY users, as he found a (manual) trick to speed up searches.

Afternoon, James Liebherr, gives a talk about Blackburnia a beetle from Hawaii, he uses live taxa, and several “semi” fossil taxa found in a cave. Jim speaks somewhat slowly, but I love the deep of his work!

Camilo Mattoni, shows the development of an Scorpion data set, from the use of some general data sets of morphology, to specific morphology of the Bothriuridae, to different molecular markers, to a simultaneous analysis.

Louise Crowley works on morphology (and mol. Secs) of a very hard group (also a literal meaning xD), such as oysters. I can believe how many chars you can found in a single shell!

And of course, Santi and Marcos talk about the redefinition of large :).

martes, octubre 21, 2008

Next week: A summit of cladistics

The Hennig meeting will be in the next week! Here everyone is working on several details :).

I do not know if Hotel sol has WiFi connection, if it does, I will “live blogging” about some speechs ;)... if not, then I hope to post a review in the next week xD.

The official web site of the meeting is: www.hennig27.com.ar, there you can found the program, it looks really nice!! :D

viernes, julio 18, 2008

“Weighting” tress with TreeFitter

TreeFitter [1] is a program for matching phylogenies with associations used in biogeography and co-evolutionary studies. It has some problems, as seems that Treefitter move to open-source, maybe this problems can be solved! Here I address some problems produced with the 'weighting tree' procedure. The analogous problems is found in some phylogenetic methods and programs, and I address it laterally (they consequences are fully discussed in [2]).

As many biogeographic programs [3, 4, 5] TreeFitter only dealt with perfectly dicotomic trees. To overcome this fault, it implements a weighting of trees. Then you can put all the dicotomic trees found in the phylogenetic analysis and put a fractional weight to each tree. For example, if you found four trees, each one would be weighted by 1/4. This is in fact a majority rule consensus. Ronquist, who is a defender of bayesian methods, see the weighting of trees as a positive characteristic as it covers the 'uncertainty' of the analysis [6]. Then it express the 'confidence' (support in a most relaxed version) of each clade. But contrary to intuitive expectations, majority rule consensus have nothing to do with the support of a determinate clade. Instead, they favoring ambiguous topologies! [2, 7].
Take this example (after [2] and [7]):

#nexus

ptree new1 weight=0.143 (1,((6,(7,(8,9))),(10,(4,(3,(2,5))))));
range new1 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new2 weight=0.143 (1,((6,(7,(8,9))),(2,(3,(4,(5,10))))));
range new2 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new3 weight=0.143 (1,((6,(7,(8,9))),(2,(3,(5,(4,10))))));
range new3 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new4 weight=0.143 (1,((6,(7,(8,9))),(2,(3,(10,(4,5))))));
range new4 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new5 weight=0.143 (1,((6,(7,(8,9))),(2,((3,10),(4,5)))));
range new5 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new6 weight=0.143 (1,((6,(7,(8,9))),(2,(10,(3,(4,5))))));
range new6 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree new7 weight=0.143 (1,((6,(7,(8,9))),((2,10),(3,(4,5)))));
range new7 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i, 10:g;

ptree indie1 (1,((6,(7,(8,9))),(4,(3,(2,5)))));
range indie1 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i;

ptree indie2 (1,((6,(7,(8,9))),(2,(3,(4,5)))));
range indie2 1:a, 2:b, 3:c, 4:d, 5:e, 6:f, 7:g, 8:h, 9:i;

htree host1 (a,((f,(g,(h,i))),(d,(c,(b,e)))));
htree host2 (a,((f,(g,(h,i))),(b,(c,(d,e)))));

The ambiguity is caused by the taxon '10' of data set 'new' that jumps to several positions among the tree. '10' has not influence on the selected tree because it is a product of a dispersal in all topologies. '10' is inestable in all topologies that include (4,5), then the majority rule gives more weight to that topology than to alternative topology, in which '10' position is not ambiguous. In this case both 'tree islands' have the same evidential weight (by the way, that is the reason to prefer strict consensus over other consensus!). But when weights are applied the topology showing (4,5) are preferred as there is more topologies with that clade, then the first tree is preferred because they lack of resolution!

At first look it seems that this case can be solved weighting the whole islands instead of each tree, but within each island it is possible to have the same problems of the first example, and in more complex cases, identifying 'topology' islands became difficult and maybe impossible if there are several combinations in independent clades!

Solutions?

Of course the best solution for the problem of multiple trees in TreeFitter without using weights is a new version that dealt with polytomic trees. I guess that the resistance against polytomic trees is because they 'imply' simultaneous speciation. I do not hold that kind of idea ;), and I have no reason to think that polytomic trees supports such interpretation. Even if that is the interpretation is better than weighting (by the way, if you think that polytomic trees implies multiple speciation, the weighting implies fractional speciation! I think that it is a more problematic idea than 'multiple' speciation!).

But before a new version of TreeFitter--or a similar program--arrives it is necessary to found a solution to the problem of multiple trees. I am not happy with the solution that I propose here, but I have no other idea, so here I go...

Use an Adams consensus to detect the terminal/clades that produce the multiple trees, and remove it form the analysis, so you keep the stable part of the topology. I do not like removal of evidence, but it seems safer than relying in the biased solution of tree weights. Maybe some want to re-run the analysis, but I think that is preferable to use the reduced tree as it is based directly on the whole evidence, then, the effect of data removal, I hope, is minor, as the stable part of the original topology is conserved. Also, TNT [11, 12] has several tools to identify moving taxons, so an analysis of the trees with TNT, provides several ways to found stable topologies.

A second problem is that in TreeFitter, the extinctions had a cost greater than 0 [6], so removing terminals could increase the extinction value, but I think that this is a minor problem compared with given weight to ambiguous data. Actually we can think that the cost increase by the new 'extinctions' is the penalty for the ambiguity of the data.

Weighting trees in other methods

As far as I know there are other phylogenetic methods/programs that weight trees, and suffer from the same flaws as the tree weighting under TreeFitter.

First, is the majority rule consensus. It seems that the main reason to prefer majority rule consensus is because they provide results with more resolution. But as examples shows [2, 7] the resolution created by a majority rule tree is coupled ore with the ambiguity generated by a particular topology skeleton. Maybe a more interesting solution could be the use of a minority rule consensus (note that supported clades, always appear as they minority rule is 100%), an advantage is that at least they knowledge that the number of clade instances are not related with the 'support' of the clade. A comparison between the majority and minority rule consensus can show the parts of the tree that are somewhat unstable. But I think that this kind of analysis is better performed using a combination of an strict consensus and an Adams consensus [8], or several of the tools from TNT [11, 12].

Under Bayesian analysis, the frequency of clades is recognized as a 'posterior probability' (a more catholic interpretation might be clade support). Bayesian analysis differs from typical consensus because they made a consensus from trees with different optimality value. But all explored topologies are taken as they are found, and branches are never collapsed, so they are subject of the problems of majority rule consensus [2, 7]. Then in cases where a particular terminal/clade is producing ambiguity, the final topology favors the ambiguous topology. This can produce some illogical results when the data sets even with a single optimal tree, are not very decisive [9], with several near-optimal fit trees. In that case, is is possible that the method prefer the ambiguous topologies from the sub-optimal trees, over the topology in the optimal one, even with high probability values! I recommend you to read [2] for a full criticism of bayesian analysis.

There is also other form of majority rule consensus is used: for support measuring using resampling (like jacknife or bootstrap). In this case the use of majority rule reflects the amount of support for each clade. First, from each resampled matrix analyzed a strict consensus tree is build, then the final majority rule consensus shows the amount of times in which a supported clade appears. In this case there is no bias against or for a particular clade, because the trees from resampled matrix are collapsed. This is not the case for PAUP [2, 10], because in PAUP the trees are weighted in each resampled search, then, the majority rule consensus is a majority rule consensus of several individual, and un-collapsed trees, then it is fully prone to the ambiguity problems of majority rule consensus.

[1] Ronquist, F. 2001. TreeFitter, program and documentation. Available at: http://www.ebc.uu.se/systzoo/research/treefitter/treefitter.html
[2] Goloboff, P.A., Pol, D. 2005. Parsimony and Bayesian phylogenetics. In: Albert, V.A. Ed. Parsimony, phylogeny, and genomics. Oxford univ., Oxford, pp. 148-159.
[3] Page, R. D. M. 1993. Component 2.0, program and documentation. Available at: http://taxonomy.zoology.gla.ac.uk/rod/cpw.html
[4] Page, R. D. M. 1994. TreeMap 1.0, program and documentation. Available at: http://taxonomy.zoology.gla.ac.uk/rod/treemap.html
[5] Ronquist, F. 1996. DiVa, program and documentation. Available at: http://www.ebc.uu.se/systzoo/research/diva/diva.html
[6] Ronquist, F. 2003. Parsimony analysis of coevolving species associations. In: Page, R.D.M. Ed. Tangled trees. Chicago Univ., Chicago, pp. 22-64.
[7] Sharkey, M.J., Leathers, J.W. 2001. Majority does not rule: the trouble with majority-rule consensus trees. Cladistics 17: 282-284. doi: 10.1006/clad.2001.0174
[8] Kearney, M. 2002. Fragmentary taxa, missing data, and ambiguity: mistaken assumptions and conclusions. Systematic biology 51: 369-381. doi: 10.1080/10635150252899824
[9] Goloboff, P.A. 1991. Homoplasy and the choice among cladograms. Cladistics 7: 215-232. doi: 10.1111/j.1096-0031.1991.tb00035.x
[10] Swoford, D. 1998. PAUP, program and documentation, Sinauer, Sunderland (USA).
[11] Goloboff, P.A., Farris, J.S., Nixon, K.C. 2008. TNT, program and documentation. Available at: http://www.zmuc.dk/public/phylogeny/TNT/
[12] Goloboff, P.A., Farris, J.S., Nixon, K.C. 2008. TNT, a free program for phylogenetic analysis. Cladistics, in press. doi: 10.1111/j.1096-0031.2008.00217.x

miércoles, mayo 14, 2008

Hennig 27, Oct 28-31 (2008), Tucumán, Argentina

www.hennig27.com.ar

viernes, abril 18, 2008

Vouchers, types and specimens

Prompted by the closure of Utrech Herbarium, Chris Taylor write two interesting posts about the nature of type specimens and vouchers (of molecular studies). In both cases Chris remarks about the importance of a comparison specimen to resolve some taxonomical problems.

In a more implicit way, the post about vouchers remarks how molecular tools can be very important, but in a some way or another, to made a proper 'molecular taxonomy' you need to rely on a proper morphological (classic) taxonomy.

I a post that I made some time ago, Chris point me some of that remarks, and actually I think that my disgust with type specimens is not the types, but the bad practice around it, specifically, that many specialist from old days (and maybe some actual ones), propose hundreds of new species based on telegraphic descriptions.

The other thing about type specimens that I do not like, is that it seems that there is an 'single specimen taxonomy', I thing that every specimen in a collection is equally valuable. For me, most important than the type speciemen information is the 'examined specimens' list, that usually include individuals of different sexes, and ontogenetic states.

But I think that new generation of taxonomists are moved into the right direction. First there are new approaches intended to store and provide access to information on an specimen basis, I specially like [1], even if the data are not made public (I prefer public data, but I understand that a taxonomist want to keep his/her data with him after a long work of several years, for at least the same amount of time!), it allows a quick reexamination of material.

The second point, but sadly it is not very popular in third world countries (and even in first world ones), is that every, I repeat every taxonomic contribution would be made under a rigurous phylogenetic framework. That is, character discussion, published data matrix, and phylogenetic tree-based classification.

Character discussions allows a more objective definition of the examined characters, as inside-study coherent terminology, because phylogenetic characters would be the same character across several species.

Data matrix, shows in a easy readable form (specially if you have a phylogenetic data editor, there are several free on the net) several characters examined for a particular terminal (that is a set of examined specimens), which shrink the usually lengthly (and bored to read) description section of the paper. That part can be reserved to particular characters that are not used under the cladistic analysis (for example, colors, measurements, proportions), or particular character scorings (explaining some particular scoring), biological information (such distribution, ecological aspects), links to figures/pictures (electronic or in print) and the list examined specimens. When a character is scored, the author of the matrix is saying that he/she view that character in the species, it also includes a list of non-seen characters (usually typed as '?'). Note that the matrix is an excellent way to fuse the information stored with the several specimens examined, so linking data from the specimens and from the matrix is direct [1].

Finally a cladogram-based classification, allows to maintain the information from the analysis, tightly binded with the taxonomy. If someone proposes a new genus, it will show that the difference with other ones is a really different group, and not a highly appomorphic group inside a previous established genus. A classification need evidence that support each proposed taxon, and the cladogram is the well know form to found evidence of grouping.

[1] Ramírez, M. J. et al. 2007. Linking of digital images to phylogenetic data matrices using a morphological ontology. Systematic Biology 56: 283-294. doi: 10.1080/10635150701313848

miércoles, abril 16, 2008

We can't get characters, but we can get states

Another piece for the seminaries, this time, about phylogenetics

Ramírez, M. J. 2007. Homology as a parsimony problem: a dynamic homology approach for morphological data. Claditics 23: 588-612. DOI: 10.1111/j.1096-0031.2007.00162.x

I read it few time after it was posted early on line, and I don't want to talk about it, but as it was proposed for the seminar, I put my own view about the paper.

Homology for some morphological structures is, sometimes, straightforward within a group, but, as we move to more inclusive scopes, the interpretation becomes blurred. For example, we know well that the legs from insects are all the same legs, also we know that the joint-legs are homolog within all arthropoda, but which is the equivalent for the pair-2 of insect legs in myriapoda? In vertebrates, the homology of cranial bones is nearly direct in each 'class', but comparison of cranial bones in fishes (specially the fossil ones) with the cranial bones is fairly complicated.

So Matín [Ramírez] give us two-step ways to deal with such cases. The first step, is a formalization of the classic way to deal with characters, a comparison with the possibles states, and its implications, but he puts under lights that the whole decision would be made in a context that evaluates several possible alternatives, and set an specific cost to each one, as a way to chose among the possible alternatives, the most parsimonious one is preferred. In this vein, their work is very similar to Agnarsson and Coddington [1], and in my opinion more easy to grasp.

But in [1] you make the chose and then, go to the standard cladistic analysis. Martín does not make the decision, he wants that the simultaneous analysis, selects the best possible arrangements, in a framework directly derived from molecular 'dynamic homology' [2, 3]. Under an strict dynamic framework each topology would indicate an specific arrangement for the morphology, but as Martín notes, in a direct difference among DNA, not all arrangement can be valid. Then he limits their scope to a set of previously defined morphological 'alignments' and choose the most parsimonious one.

Although Martín description of the problem is more adequate than [1], Agnasson and Coddigton are far better in leaving homology decisions and parsimony analysis separated. When choosing homologous characters, the main objective is to found characters that are the same, you can use several tools of the morphological analysis to do it. If there are some doubts, then it seems better to leave potential unions separated, or fused but with a lesser weight than the other, well established characters [4]. You can use a particular weighting schema to found the homologs, but it is not necessary to use the same in the construction of the cladogram.

As is seen in every character discussion, you can have a plenty of reasons to decide about a character (sometimes, such discussion includes what happens with alternative codings), but claiming that the choose was made because it fits with the best cladogram found... it seems not to be a good reason.

And it is not a good reason! Why? Because a character claim based solely on the cladogram, is just like homoplasy, you can only spoke about it because of the cladogram, then it is an ad hoc hypothesis [5]. 'Dynamic homology' in the molecular sense, or in the morphological one proposed by Ramírez are both ad hoc. It is not a coincidence that Martín found that under his method, the justification of parsimony of minimization of ad hoc hypothesis is not easily followed, and then, methods based directly on homoplasy, like implied weights [6] produce estrange results.

I think that the paper have a great value for its first part, and can be integrated with the proposal of [1]. But as most of the justifications of 'dynamic homology', Martín trades a fully coherent minimization of ad hoc hypothesis of homoplasy [5] with 'minimization of steps' .

[1] Argarsson, I., Coddington, J.A. 2007. Quantitative tests of primary homology. Cladistics 24: 51-61, DOI: 10.1111/j.1096-0031.2007.00168.x
[2] Wheeler, W.C. 1996. Optimization alignment: the end of multiple sequence alignment in phylogenetics? Cladistics 12: 1-9. DOI: 10.1111/j.1096-0031.1996.tb00189.x
[3] Wheeler, W.C. et al. 2006. Dynamic homology and phylogenetic systematics: an unified approach using POY. AMNH, New York. Freely available: http://research.amnh.org/scicomp/pdfs/wheeler/Wheeler_etal2006b.pdf
[4] Neff, N. 1986. A rational basis for a priori character weighting. Syst. Zool. 35: 110-123. JSTOR link: http://www.jstor.org/pss/2413295
[5] Farris, J.S. 1983. The logical basis pf phylogenetic analysis. In: Advances in Cladistics, vol. 2 (Platnick, N.I., Funk, V.A., Eds.). Columbia, New York vol 2. Pp. 7-36.
[6] Goloboff, P.A. 1993. Estimating character weights during tree search. Cladistics 9: 83-91. DOI: 10.1111/j.1096-0031.1993.tb00209.x

Addendum
Of course I do not deny the role of previous analyses and the checking of different alternative codings. That forms part of the tools from which morphologist made their homology desitions.

lunes, abril 14, 2008

C--

Bueno, el post esta más relacionado con la programación que con la biogeografía... pero si les interesa, ahí puse algunas de mis experiencias recientes con el C++...

Seguir leyendo

lunes, marzo 31, 2008

Dispersion strikes back... but in the wrong direction

Just two weeks before, a seminar about biogeography and phylogenetics starts here, in Tucumán. We were about 8 people taking about a biogeographic paper (the next week, it will be a phylogenetic one). So there are my own impressions about the papar (I posted it late, because I'm out of the town because of the of the holly week break).

Sanmartín, I., van der Mark, P., Ronquist, F. 2008. Inferring dispersal: a Bayesian approach to phylogenetic-based island biogeography, with special reference to the Canary islands. J. biogeografy 35: 428-449. DOI: 10.1111/j.1365-2699.2008.01885.x

Personally I do not like Bayesian methods in phylogenetics, they are based on several flawed principles [1], but I will try to focus in the other aspects of the proposed framework of Sanmartín et al., instead of attacking the Bayesian principles.

As many others, Sanmartín et al., attack 'vicariance' biogeography on the grounds that they only counts vicariance and ignore dispersal, I think that the tale is somewhat different, vicariance biogeographers are fully aware of dispersal, but they can't find grounds to found a general dispersal pattern, in the other hand vicariance provides a fully explicable pattern that is shared by several and unrelated organisms, then dispersal, can only be addressed group-wise. But in the case of the islands without any connection to land, vicariance can not be used as a general explanation. But can dispersal provide one?

Sanmartín et al., argue that the answer is 'yes', but instead of show why such conclusion arises they jump to their own model of dispersal, the causal mechanism that put several different organisms in the same model is never given, then their own asseveration that dispersal can produce 'concerted' (i.e. general) patterns, which is the most interesting question raised by they are never answered. What is striking is that they show several different models of dispersion proposed for canarian taxa.

Moreover the model developed, is a symmetrical one, then if in fact it is a common dispersal route, by an oceanic stream for example, it can be masked by their model that presupposes that both dispersal directions are equally probable.

Then they jump on the model frenzy, they are as embedded on the power of model approaches, that they raise 'new questions', that were irrelevant before their model. The most striking one is the 'carrying capacity' of their model. Carrying capacity is a term borrowed from ecology, and in this case is somewhat equated with island raw richness. This parameter, that seems to be originated from the 'island biogeography' of Wilson, can be of interest to ecologist, but they relevance in the search of common biogeographical patterns is never clearly showed (and I do not think that it has some empirical meaning), but now, this is an important question: most of Sanmartín et al. discussion, are about carrying capacities, not about dispersal models, which taxa are best fitted to the model, dispersal rates for different groups, or common routes of dispersal. Sanmartín et al. transform the parameter calibration into the research answer, losing the important questions in the middle.

When given the results, Sanmartín et al. contradicts many of its initial reasons to choose a likelihood model, the first one is that the likelihood of each model allows a selection of the models, but as the model (and its parameters) with the best likelihood seems to be illogical, they move on a sub-optimal models, they claim that it is not enough data, but when is enough data? How can I selecet a model only guided by its unexpected results? Why no limit the model to certain values if we are afraid of illogical answers? I think that Sanmartín et al. Results actually undermined their claims.

Sadly the paper is more like a discourse from a politician in campaign: all about the promises of wonderful results that might be open for likelihood models, but the actual use of the model with the empirical case is disappointing.

If a dispersal biogeography based on 'concerted' patterns of dispersal is to be used, their user would answer several questions: (1) why expect a 'common ' dispersal pattern? (2) is their model adequate to the desired answer? (3) Is the pattern only spatial (i.e. a common dipsersal route) or spatial and temporal (i.e. a common dispersal route, roughly at the same time)? (4) What about the taxa that departs from the selected model? Questions (1) is never answered by Sanmartín et al., question (2) is a clear no (the symmetrical model contradicts directly their aim), maybe they can give an answer to question (3) but they are more interested in parameter estimation than in biological questions, the question (4) is never answered, even when they accept that there are several different models proposed for different taxa in the Canarias.

[1] Goloboff, P.A., Pol, D. 2005. Parsimony and Bayesian phylogenetics. In: Albert, V.A. Ed. Parsimony, phylogeny, and genomics. Oxford univ., Oxford, pp. 148-159.

viernes, febrero 29, 2008

A new look!

It is really nice! The page of the Willi Hennig Society has a new design, it is far better than the old one :)... I hope, some new features (like the data matrices, additional data, comments for papers, an open manuscrips storage, blogs from leader cladists...) sooner, but the first step is cool!

Another page was launched, but by now I do not see it :P, is the Encyclopedia Of Life (EOF, this is a horrible acronym, I see it as 'End Of File'), I read the review of Rod Page (see also the review of Chris Taylor), then I hope that many of problems showed by Rod will be solved sooner (they have 50 millions of US$ for 5 years!!)

And the site for the XXVII WHS meeting (at Tucumán, my new home town :D!) would be launched sooner, stay tuned! ;)

miércoles, enero 30, 2008

Taxonomy: its time to change!

I just read a post by Christopher Taylor about a 'taxonomic problem' with Drosophila melanogaster. The curiosity of the case prompt me to made this post, that I have in my mind for a long time.

One of the main advantages of a classification is the information retrieval. Biological classification allows to find relationships and characters of a determinate specie. There is a set laws to manage the classification, know as the codes (they are different codes for animals, plants, bacteria, etc.).

In this age of intent and massive databases, one could thing that taxonomy are ready to made a direct jump, but unfortunately that seems not to be the case. The codes are really old ones, and many taxonomist are afraid of changing the rules would collapse the system.

I do not think that taxonomist would leave the data designers to propose a new system, but I think that a deep collaboration, coupled with some changes in the code structures will speed-up taxonomic research. The main objective of taxonomist is to describe and classify species, not to be lawyers!

I identify some problems that I think that be are the most important.

Linnean names

Linnean names are nice, they provide a some form of order in a time of great explorations around the world (XVIII-XIX century), when europeans became surprised with the diversity around the world. Many things are change from that time. By now we have several phylogenetic methods, which show how arbitrary a clade name would be (I like this post about the subject), and of course, the tree of life have far more divisions than Linnean ranks. Now we have databases to store names and search algorithms and find it quickly.

A rank-free taxonomy seems to be more adequate to store our information about the phylogeny of species. That is not an embrace of 'phylocode', as I prefer a taxonomy based on specific characters, or better a combination of characters and topology, instead of a topology based taxonomy (note that apomorphy based definition are also based on topology). Also, there is an special utility from ranked names: they can serve as Landmarks when browsing the tree of life.

The nested nature of biological classification allows a rank free taxonomy without the pains imagined by [1], and using “landmarks” helps in search--and writing--of abstracts, titles of papers, and of course, the search of a particular clade, as you can see using GenBank. Of course, genus and species can be (and I think would be) of obligatory nature. That allows a continuity with Linnean taxonomy.

If linnean names are more landmarks, then many of the laws for synonyms, and so one, need to be changed to a more practical usage.

Synonyms

Perhaps, the most annoying characteristic of taxonomy is the synonymy, rank changes, and several related problems. This problems are actually a burden of actual codes, and can be changed with changing the laws, without harming the actual classification.

Synonyms are really bad for databasesas the same entity is labeled with two different names, or (worst!) a same name is applied to different entities, in other cases, the range of the synonymy seems to be overlapping.

Of course, no matter what phylocoders say, the same problem applies to their 'phylogenetic' names: as new revision is published, the names continue but with completely different meaning. At least in traditional taxonomy, it is possible to reject some names.

It seems better to relax some of the naming rules, so a new classification would be clearly different from the original one. If a huge family is discovered to be massively paraphyletic, I think that it is no point allowing the original name to survive, it is simply an brutally wrong name.

For example, “reptiles” for a long time include many steam ammniotes, therapsids “mammal-like reptiles”, anapsids (as turtles), lepidosaurs (lizards and snakes) and crocodyles. They are amniotes but exclude birds and mammals. I can see a reason to retain that ugly name. It is simple, synonymize it with their monophyletic equivalent (Amniota), and never more use it. There is no way that a modern paper allows some confusion with the old ones. The only valid use of the name is when someone shows that the original reptiles, are monophyletic.

Names can be used instead to show different possible classifications. For example, the different arrangements of Arthropoda receive different names: Atelocerata (Myriapods, like centipede, and insects) vs. Pancrustacea (Crustaceans and insects), Mandibulata (Myriposd, crustaceans and insects) vs. schizoramia (crustaceans and arachnids). These names identify different entities, each one attached to a different phylogenetic proposal (note that form phylocoders, atelocerata, pancrustacea and mandibulata can be the same entity!).

Types

The reptile and arthropod example, are possible because there are no types fixing the names of that groups. At family levels, family names, or genus names are ruled by typification of names. Which allows more confusion, solutions to actual research.

There is an example: Lygaelidae was a large family of bugs (Insecta: Heteroptera), for long time it was believed that it was paraphyletic [2], but this was only demostrated by Henry [3]. He propose a new whole classification of Lygelids, elevating to family range no less than 7 subfamilies, and restructuring the meaning of Lygaelidae to only 3 subfamilies. Is the new Lygaelidae the same of, say 20 or 30 years ago? Of course not. Then typification was creating name stability (as is created by topology by phylocoders) but at cost of the loss of name utility.

As in the case of reptiles, there are no Lygaelidae any more. Any new reference i the litarature to 'Lygalelidae' only confuses with the initial meaning of the group (which is a synonym of the superfamly Lyageoidea). Of course you can use 'Lygaeidae sensu Henry' but it is only a clumsy (and error prone) way to give a new name.

Another nice example was provided by the previously mentioned post of Chris. It is about Drosophila. In this case, the usage can be against the taxonomic practice. For m, the solution is simple: no more Drosophila. But as there is a huge number of users of the name, that surely don't care about taxonomy, that outnumbered the number of Drosphilic taxonomic publications, there is when a commission can rule. The practical solution is to maintain Drosophila to the molecular people. Then solutions in cases of conflict would be guided by practical options, rather than some old described type.

This, of course, allows to made classification changes without 'using the types' (as far as the original characters were examined!). And free taxonomist to depend on some poorly known species (or even, specimen) to nominate genera and families.

Speaking about types...

Type specimens have a particular property: they are in the first world, but they are collected in the third world. The reason for this is historical, but their consequences are seen more acute today. Museums are measured by their amount of 'type specimens', and there are particular politics to borrow that specimens (only borrowing one on time, certifications, curators permission...). Also, some taxonomist, specially from the old past, simply nominate a wonderful amount of new species, only giving a superficial description (some color, some illustrations of genital parts) and based the whole 'description' on a type species designation. This old practice continued in several obscure papers in the third world.

Then typing, although seems to be a reasonable way to be objective, is more harming I guess that typing will never gone, but at the moment there are several nice perspectives to free from borrowing politics. Approaches like [4] with a great emphasis on characters and images, can change the situation. Researches far away from type specimenes can see high quality pics of several specimen parts. A side consequence is that the concept of type becomes lost, it is impossible to pic every part from a single specimen, and is possible that it ends destroyed, then the new typing would be more responsible, as it would be based in several different specimens.

Moreover a destypification increase general collections value, that is more the quantity and quality (e.g. fresh specimenes) of material available, than a particular specimen collected in 1816, saved from a fire in 1874, harmfully damaged by bad curation in 1903...

Actually there are many phylogenetic work without using type specimens, that is, the major bulk of molecular phylogenies, and I guess several morphological ones. I think that they do it in a very objective and testable way. If they can live without types, why classical taxonomist do not?

The data matrix

A second question from the previous section, is how a non-typified research can be objective? The answer is that instead on focusing on a particular specimen, phylogeneticists use a data matrix of taxon an characters.

A non type taxonomy enforce the use of well delimited characters, it is the only way to show the reality of the new designation. Look at some recent revisions with a phylogenetic analysis, and compare it with a revision without it (for example some of both see the pubs of AMNH). The character matrix allows to a quick examination of several characters, it is possible to see which state each character has in each taxon. New technologies (see [4]) couple specimen, characters and images for each cell entry. By default a matrix provide a multi-entry key, the identification tools are better.

There are some nice things of using a character matrix. The first, is an increasing interest in provide well defined characters [4]. As character are used for phylogenetic analysis, they would be stricter. Other characteristics like color patterns, length measurements would be restricted to a simpler description. Another advantage is that it provides a quick classification of a new species.

A nice real example was provided with the dinosaur paleontologist researches. The y publish some quick and small reports in high profile journals (like Nature or Science) with small descriptions, but as they have a great database of characters, several points of the anatomy of the new described fossil are immediately 'published', long before the detailed description in a more specialized journal.

Thinking on databasing

Of course, using a character matrix is direct consequence for storage: well defined characters and images enter smoothly in a database [4].

I think that the new challenges of the 'biodiversity crisis' as well as the 'taxonomic crisis' can be solved with a thinking of data storage. How can we store the data more efficiently? How can we link taxonomic and publication data? How changes in our knowledge about phylogeny could change the previous publication data, how the harm can be minimized?

It is important to a new taxonomy to keep the great advances made from Linneaus times, a start from the scratch is clearly a wrong solution. But also taxonomist would be able to made some concessions in their practice, and update it to new data architecture of the world.

It is time that taxonomy became a useful discussion about actual data, facing the massive extinction that the man is producing around the world, it seems weird that a taxonomic study would began searching for old papers from XVIII century, which only utility is that they provide a name, descriptions, characters and other things from that papers are of low value (by the way, taxonomy is the only field of science that continue using such old data. For historians old text are the source of investigation, for taxonomist is a more lawyer-like activity of searching for an 'old case'. Catalogs are some nice curiosities, and surely valuable for historians, but what is their actual value for taxonomist? They are important only because points to papers that establish a name).

If taxonomy is the main objective, then useful data storage is the main objective. Book keeping and law courts are not part of knowing biodiversity.

References
[1] Dominguez, E., Wheeler, Q. 1997. Taxonomic stability is ignorance. Cladistics 13: 367-372. doi: 10.1111/j.1096-0031.1997.tb00325.x
[2] Schuh, R. T., Slater, J. A. 1995. True Bugs of the World. Cornell Univ. New York
[3] Henry, T. J. 1997. Phylogenetic analysis of family groups within the infraorder Pentatomomorpha (Hemiptera: Heteroptera), with emphasis on the Lygaeoidea. Annals of the Entomological Society of America 90: 275-301
[4] Ramírez, M. J. et al. 2007. Linking of digital images to phylogenetic data matrices using a morphological ontology. Systematic Biology 56: 283-294. doi: 10.1080/10635150701313848

viernes, enero 18, 2008

DataTube? I can't wait!!

I just read a wonderful news at WiredScience, Google will be hosting open scientific data on the web [http://research.google.com]! WiredScience says that the interface will be similar to the one from YouTube, with annotations and comments.

I can't wait to see many morphological matrices, and morphological pics! I think that the excellent proposal of Ramírez et al. [1] can be coupled with that project :).

[1] Ramírez, M. J. et al. 2007. Linking of digital images to phylogenetic data matrices using a morphological ontology. Systematic biology 56: 283-294. doi: 10.1080/10635150701313848

Computer cladistics / ¡Cladística a la lata!