wFleaBase | BLAST | BioMart | GBrowse Maps | Genomics | Help
[?]

Daphnia tandem genes problem: Rationale for analysis

Tandemgenes, or 'Tandy', software has been developed to address a problem of gene predictions for tandem duplicate genes. Tandem duplicates can be nearly identical (95+% identity), and very close (within intron distances of each other), and very interesting biology.

This Tandy software is available in alpha state (not fully documented or debugged) at http://eugenes.org/gmod/tandy/

Nature rarely gives us clear and regular signals, and tandem genes are one case. These are, barring genome mis-assembly, clearly biological signals, yet many genome informatics tools have problems adequately modelling them. Any software that relies on alignment with gapping, from BLAST and BLAT thru GeneWise, Exonerate and like tools, can get confused in regions with repeated high-identity exons. These gapped alignments, as with genes and introns, will often skip over one exon to link its identical twin into a gene model.

The Daphnia pulex genome (2007 release; http://wfleabase.org/) appears rich in tandem duplicate genes, perhaps as many as 20% of its genes See this Daphnia duplication summary. This is on the order of 50% or twice as many as another gene duplicate rich genome, C. elegans. Early estimates with protein homology to Daphnia suggested a high rate of tandem genes, however this evidence was not enough to accurately measure duplicate genes.

Early gene predictions were not as promising in finding duplicates. A break through of evidence came with application of the PASA EST analysis pipeline (http://pasa.sourceforge.net/), which identified many problem areas with the initial gene predictions (from genewise, fgenesh, snap), with many of the prediction problems appearing to be tandem gene regions. PASA also produced reports of some very confused EST assemblies, spanning large regions with many interconnected EST-exons. A notable PASA confusion case turned out to be the Daphnia hemoglobin cluster of 8 genes.

These errors lead to a catch-22 where one can't well assess gene duplications without good gene models, and but getting good gene models in tandem duplicate rich genomes is a problem.

To address this, the tandy approach is to work with exons, not genes or proteins, as gene predictors typically call exons at much higher success rate than genes. Having one good exon set among a cluster of tandem genes can be as useful as having many duplicate sets, if one can locate the others. BLAT, BLAST or like tools, at exon level without gapping, will not run into the same problem of mis-aligning as gene-level alignments.

After genome scanning for all matches to all predicted exons, the core of tandy's algorithm is to mark runs of duplicate exons, then combine and split into new gene models based on a heuristic method that uses (a) intergene versus intron distances, (b) runs of exon sets (e.g. exons 1,2,3 of a gene model that are repeated), and (c) gene start/stop exons and strand inversions. The final output of tandy is a GFF feature file identifying duplication regions, the gene model (matches) and the exon matches (match_part or HSP) per gene model. Duplicates are classed as near (<15Kb) or far, every other duplicate on a scaffold/chromosome, the set of gene predictions included, and several quality measures.

There are several inadequacies of this current release. A hoped for clean set of duplicate regions and models is a ways off, but the results are very useful as additional evidence. It seems to find duplicate regions well, however it now can produce several overlapping regions, based on different exon duplicate sets. Some of this is real biology (different sets of tandem genes in a region), and some artifact. It is successful in (a) splitting spuriously joined gene predictions in tandem regions, and (b) finding extra duplicate genes (or pseudogenes) that predictors miss. It produces too many partial gene sets, and is weak in what it calls a gene, in that it doesn't like most predictors, measure start,stop and donor/acceptor values of the gene models. Tandy has been tested on the well-studied genomes of C. elegans, Dros. melanogaster, and some other new Drosophila genomes, along with Daphnia pulex.


Don Gilbert, June 2007, gilbertd@indiana.edu