jueves, diciembre 20, 2007

Relational databases for phylogenetic data

A long time ago, when I'm studying computer science, the only paradigm for databases were realtional databases, then under internet fast growth, more powerful computers, and popular searching engines--like Google--put text based databases on the top. Text based databases are the paradigm used to construct some databases for phylogenetic data, like GenBank, and TreeBASE.

Under text based databases the emphasis is put on somewhat standardized input files, that keep all the possible data necessary to make the search, you surely know the files retrieved from GenBank, or the matrices/trees from treeBASE, they have several fields, like classification, identification of sequence or character, a string matching algorithm is used to found exact and similar matches from the user query.

Then, the principal source of research in a text based database is to implement matching algorithms, the blog iPhylo from Rod Page, have several posts, paper and manuscripts links (here, here) about that subject. But I think that a text based databases are a wrong approach.

I remark some points:
  • Ambiguity: even if the entry for uploads is based on a 'cut and paste' template (or a wizard) the uploader is left with the responsibility to fill it adequately. As Page remarks, in many TreeBASE matrices terminal names are not properly scientific names.
  • Irrelevant information: when you retrieve a file, it is usually full of information that you don't want, other information it is not provided. As searches are based on text matches, several notes, comments and other fields contain information, its function seems to be to confuse the searching algorithms (remember, they are string matching algorithms!).
  • Formatting: As is text based, a compromise to made the files available to several programs made the entry fields to be rigid and difficult to modify and include new fields/information. For example in GenBank geographic information is not mandatory--as far as I know--even for phylogeographic datasets! Geographic and examined material for TreeBASE seems to be impossible to implement using the nexus format.
  • Taxonomy: As a consequence of the rigid format, alternative taxonomies and synonyms searches can not be implemented, or require searches on alternative databases.
  • Phylogeny: Why not try a 'tree structure' search on TreeBASE?
String matching is a nice tool to search for keywords around the net, or some string structures, as sequence matching in GenBank (I guess that this is the main reason to make it a text based database!), but it seems to problematic for taxonomy/phylogeny databases.

Relational databases are a different concept from text-based databases. They are based on several independent tables connected to key-fields and in several cases using a secondary tables to match keys form different tables. The searching engine is based on specific keys (for complex searches) and specific fields form each table.

My own idea of the structure of database, in a sketch fashion, and based on my usual queries is like this:
Of course, a proper database design will need several years of development, with tons of interviews of several taxonomists to allow a product that could be used in a right way for several people around the world! (Although he had several post endorsing string matching databases, this post of Rod Page provide several nice ideas about the integrative work of a phylogenetic databases, of course there are several things that i don-t like xD).

Here I explain some part of the different tables, the table 'taxon' store the actual nomeclature of a taxon name, a species or a supra-specific entity, its name, author, it may include the diagnosis (linked to character entry!) a 'phylogenetic concept', a link to pictures, the type specimen (with a link to specimen!) and things like that. The table 'synonyms' and 'classification' are secondary tables to store only a relationship between two taxons: synonyms (it could include the motive of the synonymy), and the next inclusive taxon to a particular taxon (it might include the author of this inclusive relationship), as each taxon is independent you could include as many synonyms as you know, or as many classifications proposed.

The 'specimen' table could store specific information about the examined material, links to pictures of the specimen, the locality where the specimen was collected, and so on.

It is also a character section, the table 'character' store the name, description, and maybe bibliography and pictures of the character, with 'character equivalences' it is possible to store equivalent characters used in other studies, allowing cross reference between studies that used different taxon scopes, the 'character entry' table store the specific information for the character and the taxon, it could be a single cell in from a morphology matrix, or an specific sequence fragment.

This design could help in queries/actions in which text-based databases fail. The ambiguity is reduced, as each entry is a single one: the uploader would identify the particular nature of their entries in an specific way, text based databases could be developed with that standard in mind, but here it is possible to keep an strict species naming, as many synonyms as you want, and specific character names. A curator process for the taxonomy could be possible without harming the whole database data, and as phylogenetic results would be introduced in a form of classification, the gap between phylogeny and classification [1] would be reduced.

When you perform a search you could retrieve only the specific information that you want: for example all head characters from Hymenoptera, using the equivalences the characters could be more or less organized, and using the classification table you could retrieve head characters used for many different studies, alternative characters that could match under alternative classifications, or possible characters that could be present as they are scored for more inclusive classifications (for example, a character used to define Hexapoda and Arthropoda), a truly information retrieval system based on classification [2].

As is remarked by Nixon et al. [3] a single file format is not a good thing to a database, instead, it is preferable a table structure, and report tools that could produce entries in different formats, for example retrieving sequences in a GenBank format, in a TNT format, and a POY (fasta) format, or a distribution file ready to use in NDM.

The classification table could help to retrieve studies that support or reject a particular classification, you could found the evidence for one grouping and for the alternative classification. Particular algorithms need to be developed to translate the tree structure to a query--Page posted about the subject frequently ;)--. But it is more powerful that a text based search because the same database have the information needed (the hierarchic classification) to perform the search.

I hope sometime I became rich (I doubt it) or receive a grant--I hope ;)--to perform this huge task, or at least that someone around the net, had similar ideas. Until then the only path is continuous suffering with some 'taxonomic' and 'phylogenetic' databases around the net...

Pd. As you note, maybe a database like that put some burden on researcher that upload the data... but if you go to field trips for months, examine material for hours, day after day, write reports and manuscripts, uploading information it is part of the whole research!

[1] Franz, N.M. 2005. On the lack of good scientific reasons for the growing phylogeny/classification gap. Cladistics 21: 495-500.
[2] Farris, J.S. 1979. The information content of the phylogenetic system. Systematic zoology 28: 483-519.
[3] Nixon, K.C., Carpenter, J.M., Borgardt, S.J. 2001. Beyond NEXUS: universal cladistic data objects. Cladistics 17: S53-S59.

No hay comentarios.: