A simple method for statistical analysis of intensity differences in microarray-derived gene expression data
© Kamb and Ramaswami; licensee BioMed Central Ltd. 2001
Received: 21 July 2001
Accepted: 2 October 2001
Published: 2 October 2001
Microarray experiments offer a potent solution to the problem of making and comparing large numbers of gene expression measurements either in different cell types or in the same cell type under different conditions. Inferences about the biological relevance of observed changes in expression depend on the statistical significance of the changes. In lieu of many replicates with which to determine accurate intensity means and variances, reliable estimates of statistical significance remain problematic. Without such estimates, overly conservative choices for significance must be enforced.
A simple statistical method for estimating variances from microarray control data which does not require multiple replicates is presented. Comparison of datasets from two commercial entities using this difference-averaging method demonstrates that the standard deviation of the signal scales at a level intermediate between the signal intensity and its square root. Application of the method to a dataset related to the β-catenin pathway yields a larger number of biologically reasonable genes whose expression is altered than the ratio method.
The difference-averaging method enables determination of variances as a function of signal intensities by averaging over the entire dataset. The method also provides a platform-independent view of important statistical properties of microarray data.
Comparative gene expression using microarrays plays an increasingly important role in analysis of biological control mechanisms, phenotyping, cell classification, and a variety of other applications (see [1–3], for review). There are now several commercial purveyors of microarray equipment and reagents, as well as companies that perform experiments on a contract basis. The output of microarray experiments typically consists of intensity measurements that are manipulated by scaling, background subtraction and other correction procedures, the details of which are often proprietary. In the case of experiments performed by contract, computer files are returned to the customer which contain lists of sequences, matched intensities and, in some instances, intensity ratios compared to internal references.
Representation of intensity data as ratios has considerable value for biologists. Seldom are absolute levels of mRNA expression of interest. Rather, the relative changes in expression of individual genes between two samples are more informative. But the use of ratios to characterize differences may have drawbacks. For example, estimates of significance are more difficult to determine. In addition, potential improvements in estimates of high signals compared to low signals may not be adequately represented by a ratio. Thus, a conservative evaluation of confidence levels is called for, limiting useful information that may be extracted from the intensity data.
Analytical approaches that rely on signal subtraction may have certain advantages [4–6]. Variances for such difference values are the sums of the variances for the individual measurements. Therefore, a simple, general method to estimate variance at specific signal intensities may permit more effective data analysis. In the absence of replicates, the intensity distributions for individual genes (and, therefore, the distribution mean (μ) and variance (σ2)) are unknown. For an experiment that examines two hybridizations that involve the same RNA sample, each gene is matched with two intensities, S1 and S2. The μ's for the distributions of each S are not known and range widely, reflecting low or high gene expression. However, for properly handled data, μ for the difference, S1 - S2 = ΔS, should be approximately zero for all intensity levels. If the intensities are distributed normally, then the ΔS distribution can be used in principle to determine σ2 for a given signal (σ S 2) because σ S 2 = σΔS2/2; that is, the variance of the difference of two identical distributions is twice the variance of the individual function.
Here we present results from investigation of two commercial microarray platforms, the Affymetrix system and the Incyte Genomics system. We show that the control data for both platforms, after proprietary manipulation procedures, are well behaved using some statistical measures. We further show that intensity differences can be used to supply variance estimates of these differences in a simple way, without the need for multiple replicate datasets. These variances can be applied to non-control data to estimate p values for specific changes in gene expression. The method described here is intended to supplement more elaborate analytical procedures that depend on larger numbers of independent observations.
Reproducibility of intensity measurements
The two platforms were examined independently and all comparisons were limited to datasets within one platform. For the Affymetrix experiments, RNA samples from D. melanogastor heads were used; for Incyte experiments, human HEK293 cell line RNA (see Methods). Intensity measurements from the mRNA samples were analyzed using the data provided by the commercial groups; no scaling or background subtraction was performed other than the proprietary modifications to intensities carried out prior to distribution of the files to the customer.
Distributions of intensity differences
Variance of intensity difference as a function of signal intensity
In many cases, measurement accuracies (and the signficance of individual measurements) are related in a straightforward way to the magnitude of the signal. Photon counts are such a case, and it is expected that higher signal strengths (intensities) should have smaller percentage errors compared to weaker signals. If such measurement errors could be estimated, confidence values could be calculated for specific differences.
The fine structure of the signal vs. variance plots was also interesting. In both cases the plots were noisy, though the Affymetrix data was smoother than the Incyte data. Fine-structure patterns were not preserved among different experiments using one platform and, therefore, probably do not reflect any fundamental trend for a given platform (see Discussion). Quality of the Incyte data was arguably poorer than the Affymetrix data, based on the analysis of signal vs. σ2. However, other Incyte datasets displayed smoother behavior, though the general form of the intensity vs. σ2 data was similar (see Discussion).
Application of the method
Genes whose expression differs significantly from controls
In using intensity ratios for comparison of gene expression levels, a choice must be made about data presentation. In particular, it is necessary to confront a mathematical problem inherent in expression ratios. Consider a comparison of two RNA samples, A and B, and associated intensity values for RNAs, an and bn. If, after background subtraction, one RNA (bi) is undetectable while ai yields a measurable signal, then ai / bi approaches infinity. Moreover, for ai > bi, ai / bi > 1; but for ai < bi, 0 < ai / bi < 1, resulting in an asymmetric representation of comparative intensity values. These situations can be rectified in arbitrary ways; for instance, by setting backgrounds to a non-zero value and using a logarithm transform of the ratio. Alternatively, introduction of a discontinuity along with sign/ratio inversion (e.g., for ai > bi, use ai / bi; for ai < bi, use -bi / ai) solves the problem. Such considerations do not arise if intensity differences are used.
Several algorithms have been developed to analyze microarray data both in the private and public sectors. Some attention has been devoted to the problem of estimating backgrounds, scaling data for dataset merging, and determining statistical significance of intensity differences or ratios. The most sophisticated published treatments for determination of significance use Bayesian probability methods, maximum likelihood procedures, or multiparameter fitting to analyze samples of gene expression data [4–6]. As pointed out by others, the lack of replicates of individual sequence intensities blocks the most direct route to estimates of variance. However, certain statistical treatments can provide estimates for the means and variances of small numbers of replicates (e.g., 4 in the case of Long et al. ) within microarray datasets. However, collection of even a few repeats can be technically impractical or prohibitively expensive. Nevertheless, repetition is the most reliable way to collect statistical information and the strategy described here is intended to supplement, not replace, such replication experiments.
The method of microarray data analysis presented here is platform-independent and can be used to explore data quality and to estimate variances. The calculated variances provide a statistical basis for interpreting significance of intensity differences. The relationship between intensity values (S) and σ S 2 is discerned using an averaging procedure that groups sets of points of related intensities to estimate σΔS2, and hence, σ S 2. The use of all data together to fit a function argues for a high degree of robustness in the procedure that should resist fluctuations in the intensity measurements caused by noise. Furthermore, local averaging of intensity differences and calculation of σ S 2 coupled with a global data fit provides the most reliable estimates for σ S 2 as a function of signal, assuming that variance is mainly a result of signal intensity and is not otherwise sequence-specific. The general smoothness of the plots supports this view. Within a given platform, the σ2 plots have similar general shapes that can be fitted well by low order polynomials. Higher order polynomials yield better R2 values, but are probably not justified due to the noise in the data. A general function is desired, not one that fits the idiosyncratic noise in a specific pairwise comparison. Despite its presumed robustness, the difference-averaging method is expected to perform better with higher quality data as input.
The estimators derived from this type of analysis can be used to evaluate the significance of intensity differences in non-control datasets, because they relate the magnitude of the intensity value, S, reported in the data file to the standard deviation of an intensity distribution with μ = S (Fig. 7). For example, a given pair of signals corresponding to measurements for one sequence (e.g., a gene) can be compared statistically by computing the variance for each signal using the function derived from the fitted data (e.g., a cubic polynomial in Fig. 5). The variance of the difference (S1-S2) is simply σ12 + σ22. The observation that distributions of intensity differences were approximately Gaussian in form suggests that z values may provide reliable estimates for p values. Only two replicates are required. In the case of two-dye experimental platforms such as Incyte, inter-chip variances can be estimated in the manner shown here. Intra-chip variances can be estimated in a similar way, using two dyes on a single chip. The inter-chip estimators for σ2 provide a conservative statistical measure of significance if applied to inter-chip experiments and provide a justification for determining significance of differences in merged datasets. Due to its simplicity, the approach does not require a sophisticated understanding of statistical principles. Furthermore, the entire analytical procedure can be performed inside a spreadsheet application such as Microsoft Excel.
The two microarray platforms tested here use different types of sequence on the chip. Affymetrix employs sets of oligonucleotides to interrogate a specific RNA. Incyte uses a single spotted DNA of substantially greater length. In contrast to Incyte data, the final intensity measurement in the Affymetrix case is a function of individual intensities derived from the oligonucleotides. The data from Affymetrix chips were, at least superficially, well behaved. The data from both platforms, especially Incyte's, appear to include noise other than the counting statistics type. The scatter does not scale with σ2. This behavior was not restricted to the dataset that was the principal subject of the present study; all other Incyte datasets analyzed, including those with a much smoother appearance, displayed similar dependence (not shown). Such noise may originate from variability in the spotting or detection.
We have presented a simple analytical approach based on differences in signal intensities and averaging for analysis of microarray data which can be performed without advanced statistics or specialized software. This procedure provides insight into the properties of the data under consideration, as well as estimates of variances as a function of signal strength. Application of the method gives statistical support for a more aggressive interpretation of microarray intensity data.
Materials and Methods
RNA for the Affymetrix experiment consisted of poly(A)+ RNA isolated from heads from fruit flies that overexpressed Fos and Jun. RNA for the Incyte platform experiments was poly(A)+ RNA prepared from HEK293 human cells that expressed a mutant (S45Y) β-catenin oncogene [11, 12]. Other samples (e.g., head poly(A)+ RNA from fly heads that expressed dominant-negative Fos and Jun molecules; and poly(A)+ RNA from HEK293 cells that expressed either a cadherin or Tcf inhibitor of the β-catenin pathway were also collected and examined [[7, 8]; Kamb, unpublished].
All analytical procedures and graphing was performed using Microsoft Excel 2000; no other software packages or custom code was used.
Summary of Affymetrix expression data for the 4 experimental sets
We are grateful to P. Etter for help with the Affymetrix experiments and Dr. M. Pierce for preparing RNA for the Incyte experiments. This work was supported in part by grant #1RO1 DA 13337-01.
- Brown PO, Botstein D: Exploring the new world of the genome with DNA microarrays. Nat Genet. 1999, 21: 33-37. 10.1038/4462.View ArticleGoogle Scholar
- Lipshutz RJ, Fodor SP, Gingeras TR, Lockhart DJ: High density synthetic oligonucleotide arrays. Nat Genet. 1999, 21: 20-24. 10.1038/4447.View ArticleGoogle Scholar
- He YD, Friend SH: Microarrays-the 21st century divining rod?. Nat Med. 2001, 7: 658-659. 10.1038/89022.View ArticleGoogle Scholar
- Ideker T, Thorsson V, Siegel AF, Hood LE: Testing for differentially-expressed genes by maximum-likelihood analysis of microarray data. J Comput Biol. 2000, 7: 805-817. 10.1089/10665270050514945.View ArticleGoogle Scholar
- Kerr MK, Martin M, Churchill GA: Analysis of variance for gene expression microarray data. J Comput Biol. 2000, 7: 819-837. 10.1089/10665270050514954.View ArticleGoogle Scholar
- Long AD, Mangalam HJ, Chan BY, Tolleri L, Hatfield GW, Baldi P: Improved statistical inference from dna microarray data using analysis of variance and a Bayesian statistical framework. Analysis of gene expression in Escherichia coli K12. J Biol Chem. 2001, 276: 19937-19944. 10.1074/jbc.M010192200.View ArticleGoogle Scholar
- Behrens J, von Kries JP, Kuhl M, Bruhn L, Wedlich D, Grosschedl R, Birchmeier W: Functional interaction of beta-catenin with the transcription factor LEF-1. Nature. 1996, 382: 638-42. 10.1038/382638a0.View ArticleGoogle Scholar
- Sadot E, Simcha I, Shtutman M, Ben-Ze'ev A, Geiger B: Inhibition of beta-catenin-mediated transactivation by cadherin derivatives. Proc Natl Acad Sci U S A. 1998, 95: 15339-15344. 10.1073/pnas.95.26.15339.View ArticleGoogle Scholar
- Tetsu O, McCormick F: Beta-catenin regulates expression of cyclin D1 in colon carcinoma cells. Nature. 1999, 398: 422-426. 10.1038/18884.View ArticleGoogle Scholar
- Ferrell JE, Machleder EM: The biochemical basis of an all-or-none cell fate switch in Xenopus oocytes. Science. 1998, 280: 895-898. 10.1126/science.280.5365.895.View ArticleGoogle Scholar
- Baeg GH, Matsumine A, Kuroda T, Bhattacharjee RN, Miyashiro I, Toyoshima K, Akiyama T: The tumour suppressor gene product APC blocks cell cycle progression from G0/G1 to S phase. EMBO J. 1995, 14: 5618-5625.Google Scholar
- Munemitsu S, Albert I, Souza B, Rubinfeld B, Polakis P: Regulation of intracellular beta-catenin levels by the adenomatous polyposis coli (APC) tumor-suppressor protein. Proc Natl Acad Sci U S A. 1995, 92: 3046-3050.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.