lunes, noviembre 26, 2007

A brief sketch of a program for phylogenetics

The last week is a nice week for me, as I received a grant to begin a graduate scholarship :) As my main objective is coupled with the development of a program, and my own interest in phylogenetics, and in a lesser extent, in software engineering, i will try to show here how I think a phyligenetic-related software would be implemented.

Do not take it for granted! The software projects in which i am involved to date, are smaller and short-term, and then I usually make some (ugly) shortcuts for immediate solutions but with low flexibility to modifications. Here my sketch take into account the possibility of further refinements of each part, so it is far more modular than my small projects.
Every program is composed with two parts. The first, the more messier and the most bored part to write, is the user interface (UI). The second, is the program itself (the core). In small projects you could mix both parts without much pain, but as the program growths leaving both parts mixed makes the program difficult to maintain: you would rewrite several code parts to introduce new input formats.

Ideally you would try to make the core to be totally independent from the UI, but usually it is not possible (unless your program do noting). An effort to construct a well designed communication between the UI and the core, is always well payed in the future.

I start my design with the core, and in every moment thinking how the core would communicate with a black-box UI. In phylogenetics you have two major parts: the matrix, and the tree. Both are composed of more smaller elements. The matrix is a list of labels, usually with some coupled data (taxon-character matrix, a taxon-distribution matrix). The tree is a topology in which each vertex is associated with the matrix labels.

You could think that both the tree and the matrix are the same, after all, the tree is only an specific way to sort the terminals in the matrix (it seems that TreeFitter[1] works in that way). But there is a difference, the terminals in the matrix are unique, but you could have many tress associated with the same matrix, terminal 'a' are independent in two trees, but they could point to the same element in a matrix.
Apart from terminals, that have a fixed content, internal nodes of a tree are coupled with the same sort of data as terminals, but this is a variable content, dependent on the tree topology. So we could implement a single fixed matrix, and one (or more) variable matrix, when the tree is unused, the variable matrix can be discarded.

To communicate the core with the UI, at this moment, I prefer string streams. They could be find in standard library (although with some little spelling differences), in that way the UI always translate between formats to send to the core specific strings, for example the tree in parenthesis format, so the tree construction would not interact with files, or console input, for example.

In that way, the UI is a central of translation from console, file, command dialogs, etc., that can be implemented for different systems (e.g. Win, Linux, command line), but the core remains the same.

[1] Ronquist, F. TreeFitter, program and documentation, available on line at: http://www.ebc.uu.se/systzoo/research/treefitter/treefitter.html

No hay comentarios.: