Computer cladistics / ¡Cladística a la lata!: abril 2009

martes, abril 28, 2009

Filogenia de 73060 eucariotas

Finalmente, el Behemot vio la luz. Nuestro paper de un análisis de parsimonia de 73060 especies de eucariotas (y 7800 caracteres mol+morf) acaba de ser puesto en linea en Cladistics [doi: 10.1111/j.1096-0031.2009.00255.x].

Pablo hizo un trabajo brutal optimizando cada cosa en las búsquedas de arboles con TNT. Y todos aquí trabajaron intensamente para poder manejar toda esa cantidad de datos!

Al principio me sorprendió la enorme exactitud de los árboles encontrados, porque el set de datos esta lleno de entradas faltantes. Además, me puse muy contento porque al incluir los datos morfológicos, aún a esta escala tan grande, los resultados fueron mejores que usando únicamente datos moleculares!

Hace uno meses, esto fue posteado en dechronization:

[Cassy Dunn] hizo un llamado convincente acerca de la necesidad de técnicas análiticas revolucionarias ahora que entramos en una era en que los alcances de la computación serán más limitantes que la disponibilidad de datos.

Yo creo que nuestro trabajo muestra exactamente lo contrario: nuestras herramientas de búsqueda son lo suficientemente buenas, pero no hay datos suficientes (el conjunto de datos más grande es SSU con 20000 especies, y solo un puñado de genes tiene más de 10000 especies).

Otra lección?.... No necesitamos los super-árboles!

lunes, abril 27, 2009

A phylogeny of 73060 eukaryotes

Versión en español

Finally, the behemoth has seen the light :). Our paper with a parsimony analysis of 73060 eukariotic species (and 7800 mol+morf characters) was just published (as “online early”) in Cladistics [doi:10.1111/j.1096-0031.2009.00255.x].

Pablo does a wonderful work optimizing every aspect of the tree-searches in TNT. And all of the guys worked really hard to manage that amount of data!

At first I was surprised with the high accuracy of the trees founded, because the data set is full of missing entries. Also, I fill happy because the inclusion of morphological data, even at this huge scale, produce better results than molecules alone!

Just few months ago, this was posted in dechronization:

He [Cassey Dunn] makes a convincing case for the idea that a revolution in analytical techniques will be needed as we enter an era during which computational capabilities will be more limiting than data availability.

I think our study shows exactly the inverse: that our actual search capabilities are good enough, but we do not have sufficient data (the largest gene set is SSU with 20000 species, and a handful of genes has more than 10000 species).

The second lesson?... We do not need super-trees!

viernes, abril 24, 2009

Algorithms for phylogenetics 0a: Bitfields

Versión en español

Intro

I want to start a series of post about the algorithms used in phylogenetic analyses. I feel a bit disappointed with the entry on computational phylogenetics (and related subjects) in wikipedia, that are somewhat biased towards model-based methods and "bioinformatics", with a poor representation of algorithms for phylogenetics.

There are two outstanding examples of algorithms for phylogenetics on the web. The book of Wheeler et al. [1], and, in a similar vein, a manuscript by DeLaet [2]. But their presentation of the algorithms is somewhat general, that is be nice for educational purpouse, but in some cases, far away from the actual implementation in computer software.

As the presentation of the algorithms in [1] and [2] is excellent, I want to focus on the implementation. The idea is to go from simpler to more complex algorithms (as far as I can comprehend them xD).

I want to acknowledge Pablo Goloboff. He always have the time to discuss with me about algorithms (and any topic on methodological phylogenetics) :D. Of course, any error in my implementations are my own errors!

As this series progress, I will post the source code on google source or sourceforge, if someone wants to check it, requires an knowledge of C programming language. Thanks to Rubén who showme some tips with the HTML :).

Bitfields

Characters are a basic component of phylogenetic analysis. For simplicity assume that we only have a single character in each terminal. Each character can be stored into a variable, assigning to them 0, 1, 2... as indicated by the character state. This idea is implicit in the original description of Wagner's "groundpland divergence" [3], and in Farris' formalization [4, see 5]. Then operations between characters would be operations of ranges. But it is more simpler to think about characters as a set of states [4, 6]. Then operations between characters would be set operations. This set has a peculiar property: they are perfectly boolean, that is, a taxon has or has not a particular state. This can be translated in two positive consequences: (i) character states can be stored as a bitfield: (ii) character operations are bifield operations.

Computers store numbers as binary numbers. A bitfields use this characteristic to represent a set of positioned bits. This example shows it better:

State    binary representation       "Number" (human)
0        0001                        1
1        0010                        2
2        0100                        4
3        1000                        8
0, 1     0011                        3
1, 3     0101                        5

As can be seen, each state has his on bit position, and the states combination is just the union of its states. Many programming languages include operations to work directly on bits (not, and, or and xor), then the translation from set operations to bit operations is direct.

In this series I will use char to store character states. In C, char has 8 bits, then only 8 states can be stored. I define the bitfield with a typedef, so changing the number of bits on the bitfield (and then the number of states) is easy.

#define BITS_ON_STATE 8
#define ALL_BITS_ON 255
typedef char STATES;

This small example reads the caracters of a taxon, from a file (in), and store it into an array of characters (chars). The number of characters is nc, and the functions getc (in stdio.h) reads a single character from the file, SkipSpaces ignores the blank characters, and isdigit (ctype.h) detects if the character is a number.

for (j = 0; j < nc; ++ j) {
error = SkipSpaces (in);
if (error != NO_ERROR) return error;
c = getc (in);
if (isdigit (c))
  chars [j] = 1 < < (c - '0');
else if ((c == '?') || (c == '-'))
  chars [j] = ALL_BITS_ON;
else return BAD_FORMAT;
}

The operation chars [j] 1 < < (c – '0') substract the ASCII value of '0' (30) from the ASCII value stored in c (a number, so it is form 30 to 39). The resulting value is used to shift the bits of 1. For example if a '0' is readed the operations are:

chars [j] = 1 < < (30 – 30)
chars [j] = 1 < < 0
     [0000 0001 < < 0 = 0000 0001]
chars [j] = 1

If a 5 is readed

chars [j] = 1 < < (35 – 30)
chars [j] = 1 < < 5
      [0000 0001 < < 5 = 0010 0000]
chars [j] = 32

The 1 is shifted five bits to the left.

Note that inapplicables and unknowns are coded with all bits on.

This work nicely with non additive characters. For additive characters, it is possible to “recode it” as several binary characters [7][8], then it is possible to use the character as independent binary characters.

There are more sophisticated usages of bitfields, packing several characters in a single variable (for example, eight 4-bit characters can be stored into a singel 32-bits variable) [8][9]. This packing allows an speed up during searches, because several characters can be examined simultaneously.

References
[1] Wheeler, W. et al. 2006. Dynamic homology and phylogenetic systematics: a unified approach using POY. American Mus. of Natural History, published in cooperation with NASA. online: http://research.amnh.org/scicomp/pdfs/wheeler/Wheeler_etal2006b.pdf
[2] DeLaet, J. 2005. Pseudocode for some tree search algorithms in phylogenetics. Manuscript online: http://www.plantsystematics.org/publications/jdelaet./algora.pdf
[3] Wagner, W.H. 1961. Problems in the classification of ferns. In: Recent advances in botany. Toronto Univ. Press. pp. 841-844.
[4] Farris, J.S. 1970. Methods for computing Wagner trees. Syst. Zool. 19: 83-92.
[5] Goloboff, P.A. et al. 2006. Continuos characters analyzed as such. Cladistics 22: 589-601. doi: 10.1111/j.1096-0031.2006.00122.x
[6] Fitch, W.M. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20: 406-416.
[7] Farris, J.S. et al. 1970. A numerical approach to phylogenetic systematics. Syst. Zool. 19: 172-189.
[8] Moilanen, A. 1999. Searching for most parsimonious trees with simulated evolutionary optimization. Cladistics 15: 39-50. doi: 10.1111/j.1096-0031.1999.tb00393.x
[9] Goloboff, P.A. 2002. Optimization of polytomies: state and parallel set operations. Mol Phyl. Evol. 22: 269-275. doi: 10.1006/mpev.2001.1049

Algoritmos en filogenia 0a: Campos de bits

English version

Intro

Quiero iniciar una serie de posts sobre los algoritmos usados en análisis filogenético. El motivo principal fue ver el árticulo sobre “filogenética computacional” (y los tópicos de sistemática) en wikipedia, que pues me parecen algo sesgados hacía los métodos basados en modelos, y en general los algoritmos de filogenia no están muy representados.

Hay un par de ejemplos muy buenos en i-net de algoritmos de filogenia, uno es el libro de Wheeler et al. [1], y en una vena similar, un manuscrito de DeLaet [2]. Sin embargo, los algoritmos que presentan son más bien generales, lo cual es muy bueno desde el punto de vista didáctico, pero en varios casos, alejado de lo que hacen los programas en la actualidad!

Como esas fuentes son muy buenas, espero enfocarme más en el aspecto de la implementación de los algoritmos. La idea es pasar de los más sencillos a los más complejos (en la medida en la que yo pueda comprenderlos xD).

Antes de empezar, pues quiero agradecer a Pablo Goloboff, que siempre ha estado dispuesto a discutir de algoritmos conmigo :). Por supuesto, los errores que tengan mis implementaciones son mis errores!

A medida que avance, iré posteando el código fuente en google code o sourceforge, por si desean examinarlo, requiere que sepan algo de C. ~~El HTML tiene problemas con los símbolos mayor-que y menor-que, por los que los mostrare en estos posts como texto, en vez de usar el símbolo~~. Gracias a Rubén por el tip del HTML :)!

Campos de bits

Un componente básico del análisis filogenéticos son los caracteres. Asumamos que solo tenemos un carácter para cada terminal. Uno bien podría guardar cada estado en una variable, por ejemplo asignandole 0, 1, 2... según corresponda. Una idea así esta implícita en la descripción inicial del “groundpland divergence” de Wagner [3], y más o menos, en la formalización de Farris [4, ver 5]. Las operaciones entre caracteres serían operaciones de intervalos. Pero es mucho más sencillo ver a los caracteres como un conjunto de estados [4, 6]. Las operaciones entre caracteres serian operaciones de conjuntos. Esos conjuntos tienen además una característica particular: son perfectamente booleanos, es decir, un taxon tiene o no un determinado estado. Esto tiene dos consecuencias positivas: (i) podemos guardar el conjunto de estados como un campo de bits; (ii) las operaciones entre caracteres son operaciones de los campos de bits.

Las compus guardan los datos como números binarios. En un campo de bits, usamos esa particularidad para representar conjuntos de bits. Un ejemplo, aclara mejor la situación:

Estado    Representación binaria    "Número" (humano)
0        0001                        1
1        0010                        2
2        0100                        4
3        1000                        8
0, 1     0011                        3
1, 3     0101                        5

Como se ve, cada estado individual tiene su propio bit, y la combinación de estados es simplemente la unión los estados. Muchos lenguajes de programación incorporan operaciones para trabajar directamente con bits (not, and, or y xor) con lo cual la traducción de operaciones de conjuntos a operaciones de bits es directa.

En esta serie voy a utilizar un char para guardar los estados de carácter. En C, char es de 8 bits. Yo lo defino con un typedef, de manera que después es posible cambiar el número de bits en el bitfield.

#define BITS_ON_STATE 8
#define ALL_BITS_ON 255
typedef char STATES;

En este pequeño ejemplo, se leen los caractes de un taxon, desde un archivo (in), y se guardan en un array de caracteres (chars). El número de caracteres es nc, y se usan las funciones getc para leer caracteres de texto, SkipSpaces para ignorar los espacios de texto, y isdigit para reconocer si el carácter de texto es un número.

for (j = 0; j < nc; ++ j) {
error = SkipSpaces (in);
if (error != NO_ERROR) return error;
c = getc (in);
if (isdigit (c))
 chars [j] = 1 < < (c - '0');
else if ((c == '?') || (c == '-'))
 chars [j] = ALL_BITS_ON;
else return BAD_FORMAT;
}

En la operación chars [j] = 1 < < (c – '0') se resta el valor ASCII de '0' (30) al valor ASCII guardado en c (es un número, va de 30 a 39). El valor de esa resta lo usamos para correr los bits de 1. Así por ejemplo, si se lee un '0', la operación seria:

chars [j] = 1 < < (30 – 30)
chars [j] = 1 < < 0
    [0000 0001 < < 0 = 0000 0001]
chars [j] = 1

Si leemos un 5

chars [j] = 1 < < (35 – 30)
chars [j] = 1 < < 5
     [0000 0001 < < 5 = 0010 0000]
chars [j] = 32

Corrimos el 1 cinco bits a la izquierda.

Observen que los no aplicables o desconocidos, se codifican con todos los bits encendidos.

Esto funciona bastante bien para caracteres no aditivos. Para caracteres aditivos, es posible "recodificarlos" como varios caracteres binarios [7][8], una vez recodificados es posible utilizarlos como si fueran caracteres no aditivos independientes.

Hay formas más sofisticas de usar los campos de bits, uniendo varios caracteres en una sola variable (por ejemplo, en una variable de 32 bits, se pueden usar 8 caracteres de 4 bits, como nucleotidos) [8][9]. Ese empaquetado permite una evaluación muchisimo más rápida de los caracteres durante las búsquedas, puesto que en cada operación se pueden revisar simultáneamente varios caracteres.

Referencias
[1] Wheeler, W. et al. 2006. Dynamic homology and phylogenetic systematics: a unified approach using POY. American Mus. of Natural History, published in cooperation with NASA. online: http://research.amnh.org/scicomp/pdfs/wheeler/Wheeler_etal2006b.pdf
[2] DeLaet, J. 2005. Pseudocode for some tree search algorithms in phylogenetics. Manuscrito online: http://www.plantsystematics.org/publications/jdelaet./algora.pdf
[3] Wagner, W.H. 1961. Problems in the classification of ferns. En: Recent advances in botany. Toronto Univ. Press. pp. 841-844.
[4] Farris, J.S. 1970. Methods for computing Wagner trees. Syst. Zool. 19: 83-92.
[5] Goloboff, P.A. et al. 2006. Continuos characters analyzed as such. Cladistics 22: 589-601. doi: 10.1111/j.1096-0031.2006.00122.x
[6] Fitch, W.M. 1971. Toward defining the course of evolution: minimum change for a specific tree topology. Syst. Zool. 20: 406-416.
[7] Farris, J.S. et al. 1970. A numerical approach to phylogenetic systematics. Syst. Zool. 19: 172-189.
[8] Moilanen, A. 1999. Searching for most parsimonious trees with simulated evolutionary optimization. Cladistics 15: 39-50. doi: 10.1111/j.1096-0031.1999.tb00393.x
[9] Goloboff, P.A. 2002. Optimization of polytomies: state and parallel set operations. Mol Phyl. Evol. 22: 269-275. doi: 10.1006/mpev.2001.1049

Computer cladistics / ¡Cladística a la lata!

martes, abril 28, 2009

Filogenia de 73060 eucariotas

lunes, abril 27, 2009

A phylogeny of 73060 eukaryotes

viernes, abril 24, 2009

Algorithms for phylogenetics 0a: Bitfields

Algoritmos en filogenia 0a: Campos de bits

Acerca de mí

Journals / Revistas

Links / Enlaces

Archivo del Blog

Etiquetas