wFleaBase | BLAST | BioMart | GBrowse Maps | Genomics | Help
[?]

Comment on sequence normalization of tiling array expression

Comments on this paper: Assessing the need for sequence-based normalization in tiling microarray experiments. Thomas E. Royce, Joel S. Rozowsky and Mark B. Gerstein. Bioinformatics, Vol. 23 no. 8 2007, pages 988997. doi:10.1093/bioinformatics/btm052

The authors have identified important aspects of artifact in tile array signal detection: ubiquitous hybridization that varies with probe sequence content. They back this up with uncomplicated, usable R statistics for the methods presented. The other side of this that gene structures have clear, large changes in sequence GC content: introns and intergenes have lower GC content than exons. For examples with Daphnia pulex, see here http://wfleabase.org/genome-summaries/gene-structure/basecounts/

I was able to use and reproduce generally the GC content effect of probe sequence for both Daphnia Nimblegen and Drosophila Affymetrix tile expression data. The plots look compelling: raw signal gives a higher signal for GC-rich probes. After normalization by sequence, that effect goes away.

Signals normalized this way do not differ grossly from the raw signals. However, fuzziness at detecting gene sequence structure (exon/intron/intergene boundaries) appears to be one result of sequence content normalization. On the other hand, there are cases of signals in GC-poor regions being clarified with this normalization. Sequence normalization gave a 1% increase in sensitivity/specificity at detecting known EST exons with Daphnia/Nimblegen data. But for Drosophila/Affymetrix data, there was a 2% drop in exon sensitivity/specificity with normalized signal.

Don Gilbert, 2008 March, gilbertd@indiana.edu

I used the R source code from the supplement at http://tiling.gersteinlab.org/sequence_effects/ in this paper: sequence_normalization_functions.R, both RLS with iteration, and quantilenorm, the latter seems the better one. These methods have the value of being clear and uncomplicated statistical approach to adjusting tile signals for ubiquitous hybridization effects of probe sequence.

What this paper doesn't address well is the biological correlation with sequence content and gene structure, and whether sequence normalization affects accurate discrimination of structures. The authors compare human RefSeq genes versus non-RefSeq regions (control) paired for GC content. This may not be enough to disentagle non-specific signal due to greater hybridization to GC rich probes, from true signal of transcribed regions that are GC rich.

Methods

Sample cases of Daphnia pulex scaffold_4 and Drosophila melanogaster chr4, both containing about 800 exons, were used. For Daphnia, these were only EST-validated exons. For Drosophila, these were reference release 5 exons. The Nimblegen data included 90,000 tiles of 50 bp length, overlapping every 25 bp, on this scaffold. Affymetrix (two studies: Manak et al., and modENCODE transcriptome data) included 20,000 tiles, of 38 and 36 bp each, non-overlapping. Tile signals above median threshold before and after normalizations were counted that overlap exons. This measures sensitivity (exons with tile expression/all exons) and specificity (1 - (high signal tiles outside exons/all high-signal tiles)). Because one important use of tile expression is to detect gene structures (exon, intron and intergene regions), signal changes at exon/intron bounds were also measured as the difference in successive signals. Intergene regions were not used due their lower certainty of annotation. Maps of gene structures, tile signal and GC content were viewed with GBrowse, which gave the first clue that sequence normalization was affecting clarity of gene structure detection.

A third comparison was done with Drosophila data from two experiments using different, but overlapping tiles on the same chr4, Manak et al. (ref) and modENCODE transcriptome data. For this probes that overlaped 25% to 75% were selected to provide variation in GC content at the nearly same locations. Differences in signal and GC content at paired locations could indicate if a correlation exists for sequence content effects within the same exon and intron gene structures.

Results

The Quantile normalization and RLS methods reproduce generally the GC content effect of probe sequence reported by Royce and colleagues, for both Daphnia Nimblegen and Drosophila Affymetrix tile data. The plots look compelling (Fig 2 and Fig 3): raw signal gives a higher signal for GC-rich probes. After normalization by sequence, that effect goes away.

For detecting gene structures, the overlap of high scoring tiles with exons provides a measure of normalization results. There was a 1% increase in both sensitivity a specificity at detecting exons for Daphnia/Nimblegen using quantile-normalized signals. Drosophila data showed a 2% drop off in sens./spec. with quantile normalized signal.

Using the partially overlapped tiles of two Drosophila experiments, differences in GC content had much lower correlation with signal level. For each experiment separately there is a 21% to 25% correlation of GC and signal strength. But for paired tiles this drops to a 2.5% correlation of GC and signal. This result, though suggesting major effects of gene structure relative to non-specific sequence content hybridization, is inconclusive.

More compelling evidence for use of raw signals at detecting gene structures is seen by looking at signal changes at exon/intron boundaries (Fig. 4). This shows that one effect of normalization is to obscure gene structure boundaries, which are often related to sequence content changes. Figures 1 and 5 provide gene map examples of this.

Possible problems with these normalizations for detecting gene structures became clear when I ran the robust least squares (RLS) described on DrosMel Affy data. It down-weighted exons and up-weighted introns so that the normalized signal became strongest for introns of several genes.

Figure 1. Example Drosophila gene with tile expression signals (raw, normalized)

This is a 3 exon gene that shows the effects well. The exons are obscured (reduced) for the RLS signal, and the introns are obscured (increased) for the QNorm signal, relative to Raw score (red boxes). The blue boxes show gains in UTR region. These correspond to GC changes at exon/intron/intergene bounds. These are plots as in Royce et al (Fig. 1) of average signal per base over probe sequence position:

Figure 2. Raw and normalized signal by base per probe sequence position

This same data is shown in terms of signals versus probe GC content, with the added distinction of exon (red) and intron (green) tile groupings. Average exon signal and GC values are above those of introns (see large dots), and this remains after normalizations, though correlation of GC and signals changes.

Figure 3. Signal strength versus GC content, with colored exon and intron groups

A way to see the effect of normalization on detecting gene structure changes is to look at serial changes (from tile to tile) at exon/intron bounds. These are plots of the signal distribution (raw and quantilenorm) at exon boundaries and non-boundary changes. The raw signal has a greater signal change at boundaries. The normalized signal shows little change at exon boundaries compared to non-boundary tiles.

Figure 4. Signal change distribution at exon/intron boundaries vs. same part
The differences shown are statistically significant; raw signal is more likely to discriminate exon/intron bounds than sequence normalized signal. Raw signal showed an average 0.747 change at boundaries, versus 0.428 change within exons or introns (t = -13.5, df = 1117, p-value < 2.2e-16). Quantile normalized signal showed an average 0.149 change at boundaries, versus 0.127 change within structures. (t = -3.68, df = 1168, p-value = 0.00024).

While this paper has useful analyses, it doesn't address gene structure effects. Some further additions to these normalization methods would seem needed. But for the case of using tile expression to detect gene structures, this leads to a dilemma because gene structures and ubiquitous hybridization artifacts are entangled with the sequence content. If there is a way to combine it with gene structure sequence changes, this would be a helpful analysis. I did see cases where the QNorm adjustment improved apparent gene-structure signal in low GC regions. See the right side of dpulex4-nimtilenorm2.png

By the way, for the Daphnia hemoglobin genes, the Qnorm method pulled out those odd, weak 1-tile signals before/behind the genes (red boxes) that I also found using a signal discontinuity analysis. See http://microbe.bio.indiana.edu:7182/data/dpx-augtrials/ These are in GC-poor regions:

Figure 5. Tile Signal normalization effects at two Daphnia hemoglobin genes
and

      Name                         Last modified       Size  Description

[DIR] Parent Directory 07-Mar-2009 13:18 - [IMG] dpulex-diffqnorm-dens.png 30-Mar-2008 00:04 48k [IMG] dpulex-diffraw-dens.png 30-Mar-2008 00:04 48k [IMG] dpulex-siggc-inex-qnorm.png 30-Mar-2008 13:07 327k [IMG] dpulex-siggc-inex-raw.png 30-Mar-2008 13:07 266k [IMG] dpulex-siggc-inex-rls.png 30-Mar-2008 13:07 250k [IMG] dpulex4-nimtilenorm-hb1b.png 29-Mar-2008 12:53 14k [IMG] dpulex4-nimtilenorm-hb8b.png 29-Mar-2008 12:55 12k [IMG] dpulex4-nimtilenorm1.png 29-Mar-2008 13:02 14k [IMG] dpulex4-nimtilenorm2.png 29-Mar-2008 13:26 14k [   ] dpulex4-seqsig-qnorm.pdf 28-Mar-2008 13:48 18k [IMG] dpulex4-seqsig-qnorm.png 29-Mar-2008 12:49 74k [   ] dpulex4-seqsig-raw.pdf 28-Mar-2008 13:48 16k [IMG] dpulex4-seqsig-raw.png 29-Mar-2008 12:49 61k [   ] dpulex4-seqsig-rls.pdf 28-Mar-2008 13:48 19k [IMG] dpulex4-seqsig-rls.png 29-Mar-2008 12:50 104k [IMG] drosmel4-affytilenorm1.png 29-Mar-2008 13:03 15k [IMG] drosmel4-affytilenorm2.png 28-Mar-2008 15:18 7k [IMG] drosmel4-affytilenorm3.png 28-Mar-2008 15:14 8k