Genome Coding Sequence Density

Eukaryote coding density (pdf) Daphnia CDS density: singleton vs tandem regions (pdf)

Whole genome gene density, calculated using averages or using total CDS bases/ total genome bases is given below. Daphnia doesn't seem different from Drosophila in this statistic. We know there are a large number of validated genes in this genome, and can see on maps where these genes are packed in, especially where duplicates are found.

The whole genome average doesn't show the distribution of gene density or higher density in some regions (and doesn't account for many more gaps in Daphnia). The two figures here show that Daphnia has a density skewed toward the higher C. elegans, from lower insects gene density. The second shows that Daphnia's gene duplicates are in regions of higher gene density (not a big surprise).

In Daphnia regions with 2+ genes, where there are duplicates, about 35% of the region is coding. Where there are 2+ genes but no duplicates, only 10-20% is coding. Overall, Celegans peaks at about 23% coding, Daphnia at about 18% and insects at about 10%. The averages don't show this due to broad tails on these distributions.

Whole genome coding sequence ratio
(distinct cds-exons including alternate transcripts)

# nematodes
celegans  wb176 : ntr=27049, n=124138, m=204.413, cds=25375492, tb=100241936, c/t=0.253
cbriggsae wb176 : ntr=19525, n=114373, m=210.568, cds=24083296, tb=108443721, c/t=0.222

# crustacean
daphnia JGI_V11 :  ntr=30940   n=142754,  m=211.28,  cds=30160786,  tb=174233412,   c/t=0.173
daphnia Gnomon  :  ntr=37466   n=151668,  m=237.45,  cds=36014074,  tb=200738384,   c/t=0.179

# drosophila
drosmel ncbi   :  ntr=14560,  n=55078,   m=404.47,   cds=22277417,   tb=120231707,   c/t=0.185 
drosmel Gnomon :  ntr=20420   n=59534,   m=385.79,   cds=22967701,   tb=129253983,   c/t=0.178
drossec Gnomon :  ntr=25689   n=69851,   m=356.17,   cds=24878808,   tb=138574395,   c/t=0.180
drossim Gnomon :  ntr=19885   n=63858,   m=350.95,   cds=22410664,   tb=142312176,   c/t=0.157
drosyak Gnomon :  ntr=20302   n=67234,   m=370.24,   cds=24892648,   tb=168514273,   c/t=0.148
drosere Gnomon :  ntr=18662,  n=61599,   m=376.01,   cds=23161570,   tb=136287721,   c/t=0.170

drosana Gnomon :  ntr=23784   n=74217,   m=376.16,   cds=27917668,   tb=195136171,   c/t=0.143
drospse Gnomon :  ntr=19259,  n=65407,   m=375.92,   cds=24587759,   tb=143281209,   c/t=0.172
drosper Gnomon :  ntr=24696   n=75374,   m=355.43,   cds=26790450,   tb=163411818,   c/t=0.164
droswil Gnomon :  ntr=24920   n=73171,   m=365.77,   cds=26763491,   tb=207912054,   c/t=0.129
drosmoj Gnomon :  ntr=17950,  n=63811,   m=368.24,   cds=23497700,   tb=174752375,   c/t=0.134
drosvir Gnomon :  ntr=18636   n=65573,   m=368.18,   cds=24142814,   tb=179118823,   c/t=0.135
drosgri Gnomon :  ntr=17922,  n=64047,   m=359.51,   cds=23025399,   tb=153703083,   c/t=0.150

# other insects
anogam ncbi      : ntr=12444, n=48852,  m=357.99, cds=17488695, tb=230175766, c/t=0.075
apismel ncbi     : ntr=9429,  n=70453,  m=234.09, cds=16492577, tb=177733093, c/t=0.093
nasvit fgenesh   : ntr=26115, n=115968, m=282.68, cds=32782779, tb=267327937, c/t=0.122
nasvit Gnomon    : ntr=28998, n=118889, m=264.06, cds=31394613, tb=270065215, c/t=0.116

tb=total bases that gene cds bases span
cds=sum of all non-overlapping cds bases (including alt-tr distinct exons of same gene)
m=mean cds-exon length
n=number of distinct cds-exons
c/t= ratio of cds/tb bases
ntr=number of transcripts

# Method: sum CDS bases from genome GFF files

set ntr=`gunzip -c $gf | grep -c 'mRNA'`
echo -n "CDSbases $gf : ntr=$ntr " ; gunzip -c $gf | grep 'CDS	' | \
sort -k1,1 -k4,4n -k5,5nr | perl -ne '($r,$s,$t,$b,$e)=split; if($lr and $lr ne $r) { $tb+=$le;} \
unless($r eq $lr and $b < $le and $e > $lb) { $n++; ($b,$e)=($e,$b) if($e<$b); $sb += 1+$e-$b;}  \
($lr,$lb,$le)=($r,$b,$e); END{ $m=$sb/$n; $tb += $le; $cb=$sb/$tb; \
printf "\tn=$n,\t m=%.2f,\t cds=$sb,\t tb=$tb,\t c/t=%.3f\n",$m,$cb;}'

Don Gilbert February 2008
