EvidentialGene/daphnia/daphnia_magna/Genes/earlyaccess/

Please consider donating toward my costs to produce this Daphnia magna gene set. A limitation on this project is my need for funds support this work. My substantial effort in genome information engineering and dissemination in 2013-2015 has been without salary. To enable use of this work and future Daphnia genome informatics, I am asking those with research budgets who use these D. magna genes to contribute funds to defray my contribution. These gene data will be published to databanks within a year for all to use freely.

-- Don Gilbert, gilbertd at indiana edu, 2014 November

      Name                             Last modified       Size  Description

[DIR] Parent Directory 22-Sep-2016 14:57 - [TXT] About_dmagset7fin9b.txt 22-Nov-2016 17:35 8k [   ] dmagset7finall9b.attr.gfi.gz 10-Oct-2014 16:22 12.2M [   ] dmagset7finall9b.attr.tbl.gz 10-Oct-2014 16:19 12.6M [   ] dmagset7finall9b.attr.xml.gz 10-Oct-2014 23:08 25.2M [TXT] dmagset7finall9b.clonesource.tbl 09-Apr-2015 14:38 2.4M [   ] dmagset7finalt9b.puban.aa.gz 12-Oct-2014 17:13 36.3M [   ] dmagset7finalt9b.puban.cds.gz 12-Oct-2014 17:13 62.0M [   ] dmagset7finalt9b.puban.gff.gz 26-Sep-2014 15:18 27.8M [   ] dmagset7finalt9b.puban.mrna.gz 12-Oct-2014 17:13 82.0M [   ] dmagset7finalt9c.puban.gff.gz 22-Nov-2016 17:05 27.8M [   ] dmagset7finloc9b.puban.aa.gz 12-Oct-2014 17:11 10.4M [   ] dmagset7finloc9b.puban.cds.gz 12-Oct-2014 17:10 16.2M [   ] dmagset7finloc9b.puban.gff.gz 10-Oct-2014 16:34 11.3M [   ] dmagset7finloc9b.puban.mrna.gz 12-Oct-2014 17:11 20.6M [   ] dmagset7finloc9c.puban.gff.gz 22-Nov-2016 17:02 11.3M [TXT] evg7finloc9_sum.txt 05-Oct-2014 16:06 15k [DIR] genomaps/ 18-Dec-2014 12:45 -


Daphnia magna gene set: evg7vose/pubset9b/    03 Oct 2014
Summarized in Genes/evg7finloc9_sum.txt

This set contains 29121 gene loci, with primary transcripts in "finloc9b", 
and 84898 alternate transcripts in "finalt9b".

As well, 17228 "culled" transcripts are part of this gene set, classified
as fragment or duplicate (i.e. artifactual copy of some locus transcript).
These culled transcripts however may contain some valid loci.
--------------------------------------------------------------------------------------

File sets by suffix: 

aa,cds,mrna : gene sequences in fasta format, proteins (.aa), coding sequences (.cds), and mRNA transcripts (.mrna)
  These sequence files contain primary dmagset7finloc9b, or alternate (dmagset7finalt9b) transcript sequences.
  Culled dmag7finlfrag sequences are in separate files.

gff: gene locations on Daphnia magna 2010.04 assembly of genome sequences (Genome/dmagna20100422assembly.summary)
  GFF files contain one mRNA row plus exon, CDS rows, per mapped transcript, with unique transcript ID,
  except for split-mapped transcripts, which contain 2 mRNA rows at different locations, with same transcript ID, 
  plus "Split=1;" or Split=2 attribute (not yet more).
  GFF source column 2 is one of dmag7finlm dmag7finlalt dmag7finlfrag
  dmagset7finloc9b.puban.gff contains both dmag7finlm main location and dmag7finlfrag fragment transcript locations.
  dmagset7finalt9b.puban.gff contains only dmag7finlalt alternate transcripts.

attr.tbl,xml,gfi: gene attributes in 3 formats, for primary, alternate and culled transcripts.
  These attributes are repeated, in part, in GFF mRNA and Fasta header entries.
  The attr.xml is the same format displayed in your web browser via "Search Daphnia magna Genes", 
  using xml style sheets.

TODO problems, updates:
 CDS exons (gff) with -neg span, from tr>genome mapping errors (tr maps to shorter genome span), n=443
 See below Update 2016.nov.22, 9c.puban.gff.gz  corrections of -neg span

 Errors of both false positive and false negative locus calls exist in this gene set, i.e.,
 some called loci are instead alternate transcripts, and some called alternate transcripts are separate
 paralog loci. This is most common true tandem duplicate loci, where distinguishing true and false is difficult.
 Also there are cases of mis-matched alternate/main transcripts.  

--------------------------------------------------------------------------------------
Attribute table columns in dmagset7finall9b.attr.tbl.gz
transcriptID
        geneID
        isoform
        quality
        aaSize
        cdsSize
        Name
        oname
        groupname
        ortholog
        paralog
        genegroup
        Dbxref
        intron
        express
        mapCover
        location
        pulexmap
        ref1
        locusclass
        mvclocusids
        oid
        score

Dapma7bEVm000001t1
        Dapma7bEVm000001
        1
        Class:Strong,Express:Strong,Homology:InparalogStrong,Intron:Strong,Map:Strong,Protein:complete
        2009
        70%,6030/8560
        Down syndrome cell adhesion molecule protein (76%H)
        GH03113p (76%d)
        same
        mayzebr:XP_004554022.1
        Dapma7bEVm030496t1
        ARP7f_G184,5/43/10
        human:UniRef50_O94856,dromel:FBgn0025878,CDD:238020,
        81%,29/36
        100%,rx:4,notde
        100%
        scaffold00007:237926-273051:+
        dplxm:scaffold_7:977771-1012016:-
        0
        keepmain,oneR0Rc1Rca
        MCG8.3.111,Dapma6vsEVm000072t2,CG207
        Dapma6tiEVm000291t1,Dapma6vtEVm000001t2,Dapma7aEVm000001t1,dmag4vel4ibxk45Loc2101t13
        3150
------------------------------------------------------------------------------------------

Attribute description
transcriptID, geneID, isoform : ID in three parts 
  "Dapma7bEVm000001 t 1" are gene identifier + transcript number (isoform)
quality
  A value string expression quality of transcript, using Strong|Medium|Okay|Poor/Weak|None strength of quality values
  Overall transcript quality is first, as "Class:Strong", followed by attribute parts,
  Express[ion], Homology, Intron (from mapping to genome), Map (to genome), Protein completeness + CDS/UTR qual.
  Strength values are related to percentage of measured quality, but cut offs are adjusted for each attribute, eg 
    Map : Strong >= 90%, Medium >= 80%, Poor >= 60%, None < 5%, map coverage
    Name: Homology >= 66%, Medium => 33%, Weak => 15% alignment, 
aaSize  
  Protein length
cdsSize
  CDS/Transcript length and percent: 70%,6030/8560
Name
  Best name derived from homology to other species, or other, "Uncharacterized" is for no homology.
  Name suffix " (76%H)" is alignment percent and source key, "76%" align to "H"uman gene in Dbxref
oname
  Other name if different, from reference species (Daph pulex, Dromel, Human)
groupname
  Consensus gene family name from OrthoMCL analysis
ortholog
  Other species ortholog gene ID (best aligned of several)
paralog
  Same species paralog gene ID, as scored from OrthoMCL analysis (may be improved)
genegroup
  Gene family ID, counts of this species genes, all genes, and number of species (10 max) in orthoMCL family
Dbxref
  Database cross-ref IDs, from UniProt, species gene sets, and NCBI conserved domain (CDD)
intron
  Intron/exon count and percent from map to genome, and align with evidence read-introns,
  where 81%,29/36 means 36 exons have 29 read-introns aligned, or 81% evidence-introns
express
  Expression coverage from rna-assembly or rna-seq and est alignment for gene predictions, and
  transcript-read map score as RPKM, plus rough DE measure (anonymous treatments), eg.
  100%,rx:4,notde is 100% expression coverage (rna-assembled), with RPKM=4 for reads mapped back,
  with no DE among tested treatments.
mapCover
  Coverage of transcript mapped to genome as percent, plus percent-identity when below ident>=99%,
  and Split-mapping annotation.
  as "99%,i98%,SplitC3" has 99% total coverage, split among 2 locations on different scaffolds, with
  mapping identity of 98%
location
  location as scaffold:start-end:strand for mapping to Dmagna 2010 genome, eg. scaffold00007:237926-273051:+
  where NOPATH indicates no mapping (and mapCover = 0)
pulexmap
  location for mapping to Dpulex 2007 genome assembly, dplxm:scaffold_7:977771-1012016:-, or nodplxm for none
ref1
  Prior gene set reference ID, zero here
locusclass
  Locus/transcript classification values used in producing gene set, to be detailed
mvclocusids
  Additional locus/transcript identifiers, to be detailed,
    MCG8.3.111 is MCG cross-clone consensus alignment locus id/value, CG207 is CG consensus-genome-map id
oid
  Object IDs tracking original transcript assembly ID thru intermediate gene sets, e/.g.
  original trasm=dmag4vel4ibxk45Loc2101t13, second stage clonal gene set(INB)=Dapma6tiEVm000291t1,
  third stage combined-clones gene set=Dapma6vtEVm000001t2
score
  Numeric weighted score of transcript qualities, not detailed
------------------
  
Update 2016.nov.22 
v9c gff corrects GFF formatting, for -negative spans on CDS exons due to map errors,
and CDS, exons, mRNA lacking strand orientation (+/- in column 7).  There remain ambiguous strand
entries, due to insufficient map data. These include about 500 loci, as well as the NOPATH (no location)
and the fragments (source = dmag7finlfrag in column 2) that should be ignored anyway.
In the case of -negative spans, not all were correctable, those now have prefix '#errspan' 
comment, so GFF parsers should ignore.

  dmagset7finloc9c.puban.gff is dmagset7finloc9b.puban.gff with strand/negspan corrections.
  dmagset7finalt9c.puban.gff is dmagset7finalt9b.puban.gff with strand/negspan corrections
#--------------------