Find now at http://wfleabase.org/prerelease/dpulex_jgi060905/ Genome assembly used: dpulex_jgi060905 (release date 2006/09/05) Nine eukaryote proteome annotations (tBLASTn) [1] including Human, Mouse, Fruitfly, Worm, Arabidopsis, Yeast, others Gene predictor (SNAP/homology with location, transcripts, proteins) [2] [1] Model organism protein blast matches (HSP groups): Source proteomes are taken from model organism databases. modDM = Drosophila melanogaster (flybase) proteome modMM = Mus musculus (MGI) proteome modCE = C. elegans (WormBase) proteome modSC = Sacc. cer. (SGD) proteome Protein tBLASTn matches are grouped by HSP overlap. Names are protein ID + HSP group. ID Key: CG00000_G1 = primary Gene match (best hit), CG0000_G2 = secondary Gene match on same scaffold, CG0000_S[3..n] = tertiary and further matches of same protein on same scaffold as _G1, CG0000_o[1..n] = further matches on other scaffolds. Matches at p <= 1e-3 are collected, secondary matches that mostly overlap better matches are removed. HSPs are grouped by both protein fragment overlap (same part of protein) and target genome overlap. HSP groups include several gene events: alternate splice exons in same gene, tandem and distant duplications, new genes composed of parts of several other genes, as well as computational artifacts. See here for further details and use in Gene Ontology groups D. Gilbert, may 06 Example HSP GFF: source field = MOD database; score = blast bitscore; Parent= only for HSPs (no ID) with ID as above. tkey = protein target HSP group (location subset); tloc = protein target location; align = no. aa residues aligned ##gff-version 3 scaffold_13340 modMM HSP 13833673 13833942 79.0 - . Parent=MGI:88491_G1;tk ey=MGI:88491-HSP:23-120;tloc=23-120;align=98 scaffold_13340 modDM HSP 11004286 11004747 230 + . Parent=CG8236_G1;tkey= CG8236-PA-HSP:1-153;tloc=1-153;align=154 ---------------------------------------------- [2] Gene Predictions: 2. SNAP gene predictions with protein homology guidance (DGIL_SNO is source tag) This SNO version is better quality than (1) SNP by sensitivity & specificity statistics. (not done yet: 1. SNAP gene predictions (no homology guidance; pure ab-initio; DGIL_SNP source tag) Recent gene prediction qualities for twelve Drosophila genomes have been assessed and summarized here, forming the basis for these choices. http://insects.eugenes.org/species/news/genome-summaries/genepredictions-compared.txt SNAP (Korf 2004) guided by protein homology evidence is one of the best ab-initio predictors when (a) new genes are sought, (b) there are no close relatives with an experimentally verified genome annotation. SNAP works well on the range of eukaryote genomes (plant to animal, small to big) minimal homology data. The draw-back is that SNAP overpredicts, but in a way that identifies gene-like features better than other predictors. SNAP with ho. evidence produces a much closer gene mapping where there is homology, yet retains unique gene calls in non-homologous regions.