EvidentialGene/daphnia/daphnia_magna/Genes/earlyaccess/
Please consider donating toward my costs to produce this Daphnia magna gene set. A limitation on this project is my need for funds support this work. My substantial effort in genome information engineering and dissemination in 2013-2015 has been without salary. To enable use of this work and future Daphnia genome informatics, I am asking those with research budgets who use these D. magna genes to contribute funds to defray my contribution. These gene data will be published to databanks within a year for all to use freely.
-- Don Gilbert, gilbertd at indiana edu, 2014 November
Name Last modified Size Description
Parent Directory 22-Sep-2016 14:57 -
About_dmagset7fin9b.txt 22-Nov-2016 17:35 8k
dmagset7finall9b.attr.gfi.gz 10-Oct-2014 16:22 12.2M
dmagset7finall9b.attr.tbl.gz 10-Oct-2014 16:19 12.6M
dmagset7finall9b.attr.xml.gz 10-Oct-2014 23:08 25.2M
dmagset7finall9b.clonesource.tbl 09-Apr-2015 14:38 2.4M
dmagset7finalt9b.puban.aa.gz 12-Oct-2014 17:13 36.3M
dmagset7finalt9b.puban.cds.gz 12-Oct-2014 17:13 62.0M
dmagset7finalt9b.puban.gff.gz 26-Sep-2014 15:18 27.8M
dmagset7finalt9b.puban.mrna.gz 12-Oct-2014 17:13 82.0M
dmagset7finalt9c.puban.gff.gz 22-Nov-2016 17:05 27.8M
dmagset7finloc9b.puban.aa.gz 12-Oct-2014 17:11 10.4M
dmagset7finloc9b.puban.cds.gz 12-Oct-2014 17:10 16.2M
dmagset7finloc9b.puban.gff.gz 10-Oct-2014 16:34 11.3M
dmagset7finloc9b.puban.mrna.gz 12-Oct-2014 17:11 20.6M
dmagset7finloc9c.puban.gff.gz 22-Nov-2016 17:02 11.3M
evg7finloc9_sum.txt 05-Oct-2014 16:06 15k
genomaps/ 18-Dec-2014 12:45 -
Daphnia magna gene set: evg7vose/pubset9b/ 03 Oct 2014 Summarized in Genes/evg7finloc9_sum.txt This set contains 29121 gene loci, with primary transcripts in "finloc9b", and 84898 alternate transcripts in "finalt9b". As well, 17228 "culled" transcripts are part of this gene set, classified as fragment or duplicate (i.e. artifactual copy of some locus transcript). These culled transcripts however may contain some valid loci. -------------------------------------------------------------------------------------- File sets by suffix: aa,cds,mrna : gene sequences in fasta format, proteins (.aa), coding sequences (.cds), and mRNA transcripts (.mrna) These sequence files contain primary dmagset7finloc9b, or alternate (dmagset7finalt9b) transcript sequences. Culled dmag7finlfrag sequences are in separate files. gff: gene locations on Daphnia magna 2010.04 assembly of genome sequences (Genome/dmagna20100422assembly.summary) GFF files contain one mRNA row plus exon, CDS rows, per mapped transcript, with unique transcript ID, except for split-mapped transcripts, which contain 2 mRNA rows at different locations, with same transcript ID, plus "Split=1;" or Split=2 attribute (not yet more). GFF source column 2 is one of dmag7finlm dmag7finlalt dmag7finlfrag dmagset7finloc9b.puban.gff contains both dmag7finlm main location and dmag7finlfrag fragment transcript locations. dmagset7finalt9b.puban.gff contains only dmag7finlalt alternate transcripts. attr.tbl,xml,gfi: gene attributes in 3 formats, for primary, alternate and culled transcripts. These attributes are repeated, in part, in GFF mRNA and Fasta header entries. The attr.xml is the same format displayed in your web browser via "Search Daphnia magna Genes", using xml style sheets. TODO problems, updates: CDS exons (gff) with -neg span, from tr>genome mapping errors (tr maps to shorter genome span), n=443 See below Update 2016.nov.22, 9c.puban.gff.gz corrections of -neg span Errors of both false positive and false negative locus calls exist in this gene set, i.e., some called loci are instead alternate transcripts, and some called alternate transcripts are separate paralog loci. This is most common true tandem duplicate loci, where distinguishing true and false is difficult. Also there are cases of mis-matched alternate/main transcripts. -------------------------------------------------------------------------------------- Attribute table columns in dmagset7finall9b.attr.tbl.gz transcriptID geneID isoform quality aaSize cdsSize Name oname groupname ortholog paralog genegroup Dbxref intron express mapCover location pulexmap ref1 locusclass mvclocusids oid score Dapma7bEVm000001t1 Dapma7bEVm000001 1 Class:Strong,Express:Strong,Homology:InparalogStrong,Intron:Strong,Map:Strong,Protein:complete 2009 70%,6030/8560 Down syndrome cell adhesion molecule protein (76%H) GH03113p (76%d) same mayzebr:XP_004554022.1 Dapma7bEVm030496t1 ARP7f_G184,5/43/10 human:UniRef50_O94856,dromel:FBgn0025878,CDD:238020, 81%,29/36 100%,rx:4,notde 100% scaffold00007:237926-273051:+ dplxm:scaffold_7:977771-1012016:- 0 keepmain,oneR0Rc1Rca MCG8.3.111,Dapma6vsEVm000072t2,CG207 Dapma6tiEVm000291t1,Dapma6vtEVm000001t2,Dapma7aEVm000001t1,dmag4vel4ibxk45Loc2101t13 3150 ------------------------------------------------------------------------------------------ Attribute description transcriptID, geneID, isoform : ID in three parts "Dapma7bEVm000001 t 1" are gene identifier + transcript number (isoform) quality A value string expression quality of transcript, using Strong|Medium|Okay|Poor/Weak|None strength of quality values Overall transcript quality is first, as "Class:Strong", followed by attribute parts, Express[ion], Homology, Intron (from mapping to genome), Map (to genome), Protein completeness + CDS/UTR qual. Strength values are related to percentage of measured quality, but cut offs are adjusted for each attribute, eg Map : Strong >= 90%, Medium >= 80%, Poor >= 60%, None < 5%, map coverage Name: Homology >= 66%, Medium => 33%, Weak => 15% alignment, aaSize Protein length cdsSize CDS/Transcript length and percent: 70%,6030/8560 Name Best name derived from homology to other species, or other, "Uncharacterized" is for no homology. Name suffix " (76%H)" is alignment percent and source key, "76%" align to "H"uman gene in Dbxref oname Other name if different, from reference species (Daph pulex, Dromel, Human) groupname Consensus gene family name from OrthoMCL analysis ortholog Other species ortholog gene ID (best aligned of several) paralog Same species paralog gene ID, as scored from OrthoMCL analysis (may be improved) genegroup Gene family ID, counts of this species genes, all genes, and number of species (10 max) in orthoMCL family Dbxref Database cross-ref IDs, from UniProt, species gene sets, and NCBI conserved domain (CDD) intron Intron/exon count and percent from map to genome, and align with evidence read-introns, where 81%,29/36 means 36 exons have 29 read-introns aligned, or 81% evidence-introns express Expression coverage from rna-assembly or rna-seq and est alignment for gene predictions, and transcript-read map score as RPKM, plus rough DE measure (anonymous treatments), eg. 100%,rx:4,notde is 100% expression coverage (rna-assembled), with RPKM=4 for reads mapped back, with no DE among tested treatments. mapCover Coverage of transcript mapped to genome as percent, plus percent-identity when below ident>=99%, and Split-mapping annotation. as "99%,i98%,SplitC3" has 99% total coverage, split among 2 locations on different scaffolds, with mapping identity of 98% location location as scaffold:start-end:strand for mapping to Dmagna 2010 genome, eg. scaffold00007:237926-273051:+ where NOPATH indicates no mapping (and mapCover = 0) pulexmap location for mapping to Dpulex 2007 genome assembly, dplxm:scaffold_7:977771-1012016:-, or nodplxm for none ref1 Prior gene set reference ID, zero here locusclass Locus/transcript classification values used in producing gene set, to be detailed mvclocusids Additional locus/transcript identifiers, to be detailed, MCG8.3.111 is MCG cross-clone consensus alignment locus id/value, CG207 is CG consensus-genome-map id oid Object IDs tracking original transcript assembly ID thru intermediate gene sets, e/.g. original trasm=dmag4vel4ibxk45Loc2101t13, second stage clonal gene set(INB)=Dapma6tiEVm000291t1, third stage combined-clones gene set=Dapma6vtEVm000001t2 score Numeric weighted score of transcript qualities, not detailed ------------------ Update 2016.nov.22 v9c gff corrects GFF formatting, for -negative spans on CDS exons due to map errors, and CDS, exons, mRNA lacking strand orientation (+/- in column 7). There remain ambiguous strand entries, due to insufficient map data. These include about 500 loci, as well as the NOPATH (no location) and the fragments (source = dmag7finlfrag in column 2) that should be ignored anyway. In the case of -negative spans, not all were correctable, those now have prefix '#errspan' comment, so GFF parsers should ignore. dmagset7finloc9c.puban.gff is dmagset7finloc9b.puban.gff with strand/negspan corrections. dmagset7finalt9c.puban.gff is dmagset7finalt9b.puban.gff with strand/negspan corrections #--------------------