Understanding and using the Keyword Clustering Machine at IFOM

What is the input data and what does this program do ?



The input of this tool is a list of sequence identifiers (Accession Numbers will go perfectly but also Affymetrix probeset identifiers etc.) followed by a list of tab-separated keywords, which will give a description of the putative/known sequence function (see the picture above). Please note that we have an upper limit of 500 Kbytes for uploading or sending data to this service - this roughly corresponds to 5000 annotated sequences.
You can easily get this kind of tables from the output of the EST Annotation Machine at IFOM, starting from a plain list of Accession Numbers. Otherwise you can extract it from your preferred Excel file, etc. This software can only cluster keywords originally extracted from PIR and Swissprot sequence databases.
The Keyword Clustering Machine will try and identify the Sequence Identifiers which share the largest number of common keywords and put them together in a representation (hierarchical tree) which is familiar to biologist from a long time. Sequences which share the largest number of keywords will be in the same node, followed by sequences which share a lower number of common keywords and so on, up to the root node which will probably link completely unrelated sequences. Please note that sequences that are not annotated with any keyword are discarded from the output.

How do I interpret the output ?
The output is a hierarchical tree - every leaf is a Sequence Identifier, followed by the list of keywords, which belong to that Accession Number. You can also obtain the same output with a number associated to each keyword, representing the keyword frequency in the source database. The non-redundant source keyword set is composed of around 1500 keywords, obtained by merging the content of PIR and Swissprot sequence "Keyword" field. You can also ask for the Phylip output format, in order to use your preferred tree visualization tool. The output can be sent as a link to a web-based repository at IFOM ( this will be erased after ten days), as a PC or Mac email attachment or directly in the body of the email message.



If you compare the keywords, which belong to Sequence Identifiers found on the same nodes in the tree, you will be able to identify easily the common keywords and will get a clue on the possible common function of these sequences. You could also compare these "keyword profiles" with gene expression profiles and see whether they will overlap up to a certain extent. You can in fact think that functionally related genes will have the same expression dynamic and regulation. In effect, this tool is born as an aid to functional interpretation of microarray data.

But how much can I rely on this representation ?
It really depends on the keywords you associate to every sequence, i.e. on your starting vocabulary. If you use as a source of input the EST Annotation Machine at IFOM, the keywords will be extracted from protein sequences, which are annotated as similar to the query cDNA sequence. In this particular case function is associated via sequence similarity. If you use a different system to generate your association between sequence identifiers and keywords this will affect the output. Remember that for the moment you can only use Swissprot and PIR keywords as input.

Who are the people involved ?
The list is growing and we will add many more features and people! However, as of March 2002, this is the developing team:

The Distance calculation (CALCDIST) and tree building (QCLUST) programs were originally written by by John Brzustowski
The clustering program incorporates a translation of Fionn Murtagh's O(n^2) clustering code from his R/Splus multiv package