Keywords
RNA sequencing, data analysis, gene expression
This article is included in the Bioconductor gateway.
This article is included in the Bioinformatics gateway.
RNA sequencing, data analysis, gene expression
RNA-sequencing (RNA-seq) has become the primary technology for gene expression profiling and the detection of differentially expressed genes between two or more conditions is one of the most commonly asked questions using this approach. The edgeR1 and limma packages2 available from the Bioconductor project3 offer a well-developed suite of statistical methods for dealing with this type of analysis. In this article, we describe an edgeR - limma workflow for analysing RNA-seq data that takes gene-level counts as its input, and moves through pre-processing and exploratory data analysis before obtaining lists of differentially expressed (DE) genes and gene signatures. This analysis is enhanced through the use of interactive graphics from the Glimma package4, that allows for a more detailed exploration of the data at both the sample and gene-level than is possible using static R plots.
The experiment analysed in this workflow is from Sheridan et al. (2015)5 and consists of three cell populations (basal, luminal progenitor (LP) and mature luminal (ML)) sorted from the mammary glands of female virgin mice, each profiled in triplicate. RNA samples were sequenced across three batches on an Illumina HiSeq 2000 to obtain 100 base-pair single-end reads. The analysis outlined in this article assumes that reads obtained from an RNA-seq experiment have been aligned to an appropriate reference genome and summarised into counts associated with gene-specific regions. In this instance, reads were aligned to the mouse reference genome (mm10) using the R based pipeline available in the Rsubread package (specifically the align function6 followed by featureCounts7 for gene-level summarisation based on the in-built mm10 RefSeq-based annotation). Count data for these samples can be downloaded from the Gene Expression Omnibus (GEO) http://www.ncbi.nlm.nih.gov/geo/ using GEO Series accession number GSE63310. Further information on experimental design and sample preparation is also available from GEO under this accession number.
To get started with this analysis, download the file GSE63310_RAW.tar available online from http://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE63310&format=file, and extract the relevant files from this archive. Each of these text files contains the raw gene-level counts for a given sample. Note that our analysis only includes the basal, LP and ML samples from this experiment (see associated file names below).
files<- c("GSM1545535_10_6_5_11.txt", "GSM1545536_9_6_5_11.txt", "GSM1545538_purep53.txt","GSM1545539_JMS8-2.txt", "GSM1545540_JMS8-3.txt","GSM1545541_JMS8-4.txt", "GSM1545542_JMS8-5.txt","GSM1545544_JMS9-P7c.txt", "GSM1545545_JMS9-P8c.txt") read.delim(files[1], nrow=5)
## EntrezID GeneLength Count ## 1 497097 3634 1 ## 2 100503874 3259 0 ## 3 100038431 1634 0 ## 4 19888 9747 0 ## 5 20671 3130 1
Whilst each of the nine text files can be read into R separately and combined into a matrix of counts, edgeR offers a convenient way to do this in one step using the readDGE function. The resulting DGEList-object contains a matrix of counts with 27,179 rows associated with unique Entrez gene identifiers (IDs) and nine columns associated with the individual samples in the experiment.
library(limma) library(edgeR) x <- readDGE(files, columns=c(1,3)) class(x)
## [1] "DGEList" ## attr(,"package") ## [1] "edgeR"
dim(x)
## [1] 27179 9
If the counts from all samples were stored in a single file, the data can be read into R and then converted into a DGEList-object using the DGEList function.
For downstream analysis, sample-level information related to the experimental design needs to be associated with the columns of the counts matrix. This should include experimental variables, both biological and technical, that could have an effect on expression levels. Examples include cell type (basal, LP and ML in this experiment), genotype (wild-type, knock-out), phenotype (disease status, sex, age), sample treatment (drug, control) and batch information (date experiment was performed if samples were collected and analysed at distinct time points) to name just a few.
Our DGEList-object contains a samples data frame that stores both cell type (or group) and batch (sequencing lane) information, each of which consists of three distinct levels. Note that within x$samples, library sizes are automatically calculated for each sample and normalisation factors are set to 1. For simplicity, we remove the GEO sample IDs (GSM*) from the column names of our DGEList-object x.
samplenames <- substring(colnames(x), 12, nchar(colnames(x))) samplenames
## [1] "10_6_5_11" "9_6_5_11" "purep53" "JMS8-2" "JMS8-3" "JMS8-4" "JMS8-5" ## [8] "JMS9-P7c" "JMS9-P8c"
colnames(x) <- samplenames
group <- as.factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP")) x$samples$group <- group lane <- as.factor(rep(c("L004","L006","L008"), c(3,4,2))) x$samples$lane <- lane x$samples
## files group lib.size norm.factors lane ## 10_6_5_11 GSM1545535_10_6_5_11.txt LP 32863052 1 L004 ## 9_6_5_11 GSM1545536_9_6_5_11.txt ML 35335491 1 L004 ## purep53 GSM1545538_purep53.txt Basal 57160817 1 L004 ## JMS8-2 GSM1545539_JMS8-2.txt Basal 51368625 1 L006 ## JMS8-3 GSM1545540_JMS8-3.txt ML 75795034 1 L006 ## JMS8-4 GSM1545541_JMS8-4.txt LP 60517657 1 L006 ## JMS8-5 GSM1545542_JMS8-5.txt Basal 55086324 1 L006 ## JMS9-P7c GSM1545544_JMS9-P7c.txt ML 21311068 1 L008 ## JMS9-P8c GSM1545545_JMS9-P8c.txt LP 19958838 1 L008
A second data frame named genes in the DGEList-object is used to store gene-level information associated with rows of the counts matrix. This information can be retrieved using organism specific packages such as Mus.musculus8 for mouse (or Homo.sapiens9 for human) or the biomaRt package10,11 which interfaces the Ensembl genome databases in order to perform gene annotation. The type of information that can be retrieved includes gene symbols, gene names, chromosome names and locations, Entrez gene IDs, Refseq gene IDs and Ensembl gene IDs to name just a few. biomaRt primarily works off Ensembl gene IDs, whereas Mus.musculus packages information from various sources and allows users to choose between many different gene IDs as the key. The Entrez gene IDs available in our dataset were annotated using the Mus.musculus package to retrieve associated gene symbols and chromosome information.
library(Mus.musculus) geneid <- rownames(x) genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"), keytype="ENTREZID") head(genes)
## ENTREZID SYMBOL TXCHROM ## 1 497097 Xkr4 chr1 ## 2 100503874 Gm19938 <NA> ## 3 100038431 Gm10568 <NA> ## 4 19888 Rp1 chr1 ## 5 20671 Sox17 chr1 ## 6 27395 Mrpl15 chr1
As with any gene ID, Entrez gene IDs may not map one-to-one to the gene information of interest. It is important to check for duplicated gene IDs and resolve them to ensure the gene order is consistent between our annotation and the counts in the DGEList-object.
dup <- genes$ENTREZID[duplicated(genes$ENTREZID)] genes[genes$ENTREZID %in% dup,][1:10,]
## ENTREZID SYMBOL TXCHROM ## 5360 100316809 Mir1906-1 chr12 ## 5361 100316809 Mir1906-1 chrX ## 7128 218963 Gm1821 chr11 ## 7129 218963 Gm1821 chr14 ## 9564 12228 Btg3 chr16 ## 9565 12228 Btg3 chr17 ## 11351 433182 Eno1b chr4 ## 11352 433182 Eno1b chr18 ## 11544 100217457 Snord58b chr14 ## 11545 100217457 Snord58b chr18
mat <- match(geneid, genes$ENTREZID) genes <- genes[mat,] genes[genes$ENTREZID %in% dup,][1:5,]
## ENTREZID SYMBOL TXCHROM ## 5360 100316809 Mir1906-1 chr12 ## 7128 218963 Gm1821 chr11 ## 9564 12228 Btg3 chr16 ## 11351 433182 Eno1b chr4 ## 11544 100217457 Snord58b chr14
The match-ing process above takes care of the duplication by keeping the first occurrence of each gene ID. Now that non-unique gene IDs have been resolved, the data frame of gene annotations is added to the data object:
x$genes <- genes
Now our data is packaged neatly in a DGEList-object containing raw count data with associated sample information and gene annotations.
x
## An object of class "DGEList" ## $samples ## files group lib.size norm.factors lane ## 10_6_5_11 GSM1545535_10_6_5_11.txt LP 32863052 1 L004 ## 9_6_5_11 GSM1545536_9_6_5_11.txt ML 35335491 1 L004 ## purep53 GSM1545538_purep53.txt Basal 57160817 1 L004 ## JMS8-2 GSM1545539_JMS8-2.txt Basal 51368625 1 L006 ## JMS8-3 GSM1545540_JMS8-3.txt ML 75795034 1 L006 ## JMS8-4 GSM1545541_JMS8-4.txt LP 60517657 1 L006 ## JMS8-5 GSM1545542_JMS8-5.txt Basal 55086324 1 L006 ## JMS9-P7c GSM1545544_JMS9-P7c.txt ML 21311068 1 L008 ## JMS9-P8c GSM1545545_JMS9-P8c.txt LP 19958838 1 L008 ## ## $counts ## Samples ## Tags 10_6_5_11 9_6_5_11 purep53 JMS8-2 JMS8-3 JMS8-4 JMS8-5 JMS9-P7c JMS9-P8c ## 497097 1 2 342 526 3 3 535 2 0 ## 100503874 0 0 5 6 0 0 5 0 0 ## 100038431 0 0 0 0 0 0 1 0 0 ## 19888 0 1 0 0 17 2 0 1 0 ## 20671 1 1 76 40 33 14 98 18 8 ## 27174 more rows … ## ## $genes ## ENTREZID SYMBOL TXCHROM ## 1 497097 Xkr4 chr1 ## 2 100503874 Gm19938 <NA> ## 3 100038431 Gm10568 <NA> ## 4 19888 Rp1 chr1 ## 5 20671 Sox17 chr1 ## 27174 more rows …
For differential expression and related analyses, gene expression is rarely considered at the level of raw counts since libraries sequenced at a greater depth will result in higher counts. Rather, it is common practice to transform raw counts onto a scale that accounts for such library size differences. Popular transformations include counts per million (CPM), log2-counts per million (log-CPM), reads per kilobase of transcript per million (RPKM), and fragments per kilobase of transcript per million (FPKM).
In our analyses, CPM and log-CPM transformations are used regularly although they do not account for feature length differences which RPKM and FPKM values do. Whilst RPKM and FPKM values can just as well be used, CPM and log-CPM values can be calculated using a counts matrix alone and will suffice for the type of comparisons we are interested in. Differential expression analyses look at gene expression differences between conditions, rather than comparing expression across multiple genes or drawing conclusions on absolute levels of expression. In other words, gene lengths remain constant for comparisons of interest and any observed differences are a result of changes in condition rather than changes in gene length.
Here raw counts are converted to CPM and log-CPM values using the cpm function in edgeR, where log-transformations use a prior count of 0.25 to avoid taking the log of zero. RPKM values are just as easily calculated as CPM values using the rpkm function in edgeR if gene lengths are available.
cpm <- cpm(x) lcpm <- cpm(x, log=TRUE)
All datasets will include a mix of genes that are expressed and those that are not expressed. Whilst it is of interest to examine genes that are expressed in one condition but not in another, some genes are unexpressed throughout all samples. In fact, 19% of genes in this dataset have zero counts across all nine samples.
table(rowSums(x$counts==0)==9)
## ## FALSE TRUE ## 22026 5153
Genes that are not expressed at a biologically meaningful level in any condition should be discarded to reduce the subset of genes to those that are of interest, and to reduce the number of tests carried out downstream when looking at differential expression. Upon examination of log-CPM values, it can be seen that a large proportion of genes within each sample is unexpressed or lowly-expressed (Figure 1A). Using a nominal CPM value of 1 (which is equivalent to a log-CPM value of 0) a gene is deemed to be expressed in a given sample if its transformed count is above this threshold, and unexpressed otherwise. Genes must be expressed in at least one group (or roughly any three samples across the entire experiment) to be kept for downstream analysis.
keep.exprs <- rowSums(cpm>1)>=3 x <- x[keep.exprs,, keep.lib.sizes=FALSE] dim(x)
## [1] 14165 9
Using this criterion, the number of genes is reduced to approximately half the number that we started with (14,165 genes, Figure 1B). Note that subsetting the entire DGEList-object removes both the counts as well as the associated gene information. Code to produce Figure 1 is given below.
library(RColorBrewer) nsamples <- ncol(x) col <- brewer.pal(nsamples, "Paired") par(mfrow=c(1,2)) plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.21), las=2, main="", xlab="") title(main="A. Raw data", xlab="Log-cpm") abline(v=0, lty=3) for (i in 2:nsamples){ den <- density(lcpm[,i]) lines(den$x, den$y, col=col[i], lwd=2) } legend("topright", samplenames, text.col=col, bty="n") lcpm <- cpm(x, log=TRUE) plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.21), las=2, main="", xlab="") title(main="B. Filtered data", xlab="Log-cpm") abline(v=0, lty=3) for (i in 2:nsamples){ den <- density(lcpm[,i]) lines(den$x, den$y, col=col[i], lwd=2) } legend("topright", samplenames, text.col=col, bty="n")
During the sample preparation or sequencing process, external factors that are not of biological interest can affect the gene expression levels of individual samples. For example, samples processed in the first batch of an experiment may have been sequenced more deeply than samples processed in a second batch. It is assumed that all samples should have a similar range and distribution of expression values. Normalisation is generally required to ensure that the expression distributions of each sample are similar across the entire experiment.
Any plot showing the per sample expression distributions, such as a density or boxplot, is useful in determining whether any samples are dissimilar to others. Distributions of log-CPM values are similar throughout all samples within this dataset (Figure 1B).
Nonetheless, normalisation by the method of trimmed mean of M-values (TMM)12 is performed using the calcNormFactors function in edgeR. The normalisation factors calculated here are used as a scaling factor for the library sizes. When working with DGEList-objects, these normalisation factors are automatically stored in x$samples$norm.factors. For this dataset the effect of TMM-normalisation is mild, as evident in the magnitude of the scaling factors, which are all relatively close to 1.
x <- calcNormFactors(x, method = "TMM") x$samples$norm.factors
## [1] 0.896 1.035 1.044 1.041 1.032 0.922 0.984 1.083 0.979
To give a better visual representation of the effects of normalisation, the data was duplicated then adjusted so that the counts of the first sample are reduced to 5% of their original values, and in the second sample they are inflated to be 5-times larger.
x2 <- x x2$samples$norm.factors <- 1 x2$counts[,1] <- ceiling(x2$counts[,1]*0.05) x2$counts[,2] <- x2$counts[,2]*5
Figure 2 shows the expression distribution of samples for unnormalised and normalised data, where distributions are noticeably different pre-normalisation and are similar post-normalisation. Here the first sample has a small TMM scaling factor of 0.05, whereas the second sample has a large scaling factor of 6.13 – neither values are close to 1.
par(mfrow=c(1,2)) lcpm <- cpm(x2, log=TRUE) boxplot(lcpm, las=2, col=col, main="") title(main="A. Example: Unnormalised data",ylab="Log-cpm") x2 <- calcNormFactors(x2) x2$samples$norm.factors
## [1] 0.0547 6.1306 1.2293 1.1705 1.2149 1.0562 1.1459 1.2613 1.1170 lcpm <- cpm(x2, log=TRUE) boxplot(lcpm, las=2, col=col, main="") title(main="B. Example: Normalised data",ylab="Log-cpm")
In our opinion, one of the most important exploratory plots to examine for gene expression analyses is the multidimensional scaling (MDS) plot, or similar. The plot shows similarities and dissimilarities between samples in an unsupervised manner so that one can have an idea of the extent to which differential expression can be detected before carrying out formal tests. Ideally, samples would cluster well within the primary condition of interest, and any sample straying far from its group could be identified and followed up for sources of error or extra variation. If present, technical replicates should lie very close to one another.
Such a plot can be made in limma using the plotMDS function. The first dimension represents the leading-fold-change that best separates samples and explains the largest proportion of variation in the data, with subsequent dimensions having a smaller effect and being orthogonal to the ones before it. When experimental design involves multiple factors, it is recommended that each factor is examined over several dimensions. If samples cluster by a given factor in any of these dimensions, it suggests that the factor contributes to expression differences and is worth including in the linear modelling. On the other hand, factors that show little or no effect may be left out of downstream analysis.
In this dataset, samples can be seen to cluster well within experimental groups over dimension 1 and 2, and then separate by sequencing lane (sample batch) over dimension 3 (Figure 3). Keeping in mind that the first dimension explains the largest proportion of variation in the data, notice that the range of values over the dimensions become smaller as we move to higher dimensions. Whilst all samples cluster by groups, the largest transcriptional difference is observed between basal and LP, and basal and ML over dimension 1. For this reason, it is expected that pairwise comparisons between cell populations will result in a greater number of DE genes for comparisons involving basal samples, and relatively small numbers of DE genes when comparing ML to LP. In other datasets, samples that do not cluster by their groups of interest may also show little or no evidence of differential expression in the downstream analysis.
To create the MDS plots, different colour groupings are assigned to factors of interest. Dimensions 1 and 2 are examined using the color grouping defined by cell types.
lcpm <- cpm(x, log=TRUE) par(mfrow=c(1,2)) col.group <- group levels(col.group) <- brewer.pal(nlevels(col.group), "Set1") col.group <- as.character(col.group) col.lane <- lane levels(col.lane) <- brewer.pal(nlevels(col.lane), "Set2") col.lane <- as.character(col.lane) plotMDS(lcpm, labels=group, col=col.group) title(main="A. Sample groups")
Dimensions 3 and 4 are examined using the colour grouping defined by sequencing lanes (batch).
plotMDS(lcpm, labels=lane, col=col.lane, dim=c(3,4)) title(main="B. Sequencing lanes")
Alternatively, the Glimma package offers the convenience of an interactive MDS plot where multiple dimensions can be explored. The glMDSPlot function generates an html page (that is opened in a browser if launch=TRUE) with an MDS plot in the left panel and a barplot showing the proportion of variation explained by each dimension in the right panel. Clicking on the bars of the bar plot changes the pair of dimensions plotted in the MDS plot, and hovering over the individual points reveals the sample label. The colour scheme can be changed as well to highlight cell population or sequencing lane (batch). An interactive MDS plot of this dataset can be found at http://bioinf.wehi.edu.au/folders/limmaWorkflow/glimma-plots/MDS-Plot.html.
library(Glimma) glMDSPlot(lcpm, labels=paste(group, lane, sep="_"), groups=x$samples[,c(2,5)], launch=FALSE)
In this study, it is of interest to see which genes are expressed at different levels between the three cell populations profiled. In our analysis, linear models are fitted to the data with the assumption that the underlying data is normally distributed. To get started, a design matrix is set up with both the cell population and sequencing lane (batch) information.
design <- model.matrix(~0+group+lane) colnames(design) <- gsub("group", "", colnames(design)) design
## Basal LP ML laneL006 laneL008 ## 1 0 1 0 0 0 ## 2 0 0 1 0 0 ## 3 1 0 0 0 0 ## 4 1 0 0 1 0 ## 5 0 0 1 1 0 ## 6 0 1 0 1 0 ## 7 1 0 0 1 0 ## 8 0 0 1 0 1 ## 9 0 1 0 0 1 ## attr(,"assign") ## [1] 1 1 1 2 2 ## attr(,"contrasts") ## attr(,"contrasts")$group ## [1] "contr.treatment" ## ## attr(,"contrasts")$lane ## [1] "contr.treatment"
For a given experiment, there are usually several equivalent ways to set up an appropriate design matrix. For example, ~0+group+lane removes the intercept from the first factor, group, but an intercept remains in the second factor lane. Alternatively, ~group+lane could be used to keep the intercepts in both group and lane. Understanding how to interpret the coefficients estimated in a given model is key here. We choose the first model for our analysis, as setting up model contrasts is more straight forward in the absence of an intercept for group. Contrasts for pairwise comparisons between cell populations are set up in limma using the makeContrasts function.
contr.matrix <- makeContrasts( BasalvsLP = Basal-LP, BasalvsML = Basal-ML, LPvsML = LP-ML, levels = colnames(design)) contr.matrix
## Contrasts ## Levels BasalvsLP BasalvsML LPvsML ## Basal 1 1 0 ## LP -1 0 1 ## ML 0 -1 -1 ## laneL006 0 0 0 ## laneL008 0 0 0
A key strength of limma’s linear modelling approach, is the ability accommodate arbitrary experimental complexity. Simple designs, such as the one in this workflow, with cell type and batch, through to more complicated factorial designs and models with interaction terms can be handled relatively easily. Where experimental or technical effects can be modelled using a random effect, another possibility in limma is to estimate correlations using duplicateCorrelation by specifying a block argument for both this function and in the lmFit linear modelling step.
It has been shown that for RNA-seq count data, the variance is not independent of the mean13 – this is true of raw counts or when transformed to log-CPM values. Methods that model counts using a Negative Binomial distribution assume a quadratic mean-variance relationship. In limma, linear modelling is carried out on the log-CPM values which are assumed to be normally distributed and the mean-variance relationship is accommodated using precision weights calculated by the voom function.
When operating on a DGEList-object, voom converts raw counts to log-CPM values by automatically extracting library sizes and normalisation factors stored in the object. For a matrix of counts, the method of normalisation can be specified within voom using the normalize.method (by default no normalisation is performed).
The mean-variance relationship of log-CPM values for this dataset is shown in Figure 4A. Typically, the “voom-plot” shows a decreasing trend between the means and variances resulting from a combination of technical variation in the sequencing experiment and biological variation amongst the replicate samples from different cell populations. Experiments with high biological variation usually result in flatter trends, where variance values plateau at high expression values. Experiments with low biological variation tend to result in sharp decreasing trends.
Moreover, the voom-plot provides a visual check on the level of filtering performed upstream. If filtering of lowly-expressed genes is insufficient, a drop in variance levels can be observed at the low end of the expression scale due to very small counts. If this is observed, one should return to the earlier filtering step and increase the expression threshold applied to the dataset.
Where sample-level variation is evident from earlier inspections of the MDS plot, the voomWithQualityWeights function can be used to simultaneously incorporate sample-level weights together with the abundance dependent weights estimated by voom14. For an example of this, see Liu et al. (2016)15.
v <- voom(x, design, plot=TRUE) v
## An object of class "EList"
## $genes
## ENTREZID SYMBOL TXCHROM
## 1 497097 Xkr4 chr1
## 6 27395 Mrpl15 chr1
## 7 18777 Lypla1 chr1
## 9 21399 Tcea1 chr1
## 10 58175 Rgs20 chr1
## 14160 more rows ...
##
## $targets
## files group lib.size norm.factors lane
## 10_6_5_11 GSM1545535_10_6_5_11.txt LP 29409426 0.896 L004
## 9_6_5_11 GSM1545536_9_6_5_11.txt ML 36528591 1.035 L004
## purep53 GSM1545538_purep53.txt Basal 59598629 1.044 L004
## JMS8-2 GSM1545539_JMS8-2.txt Basal 53382070 1.041 L006
## JMS8-3 GSM1545540_JMS8-3.txt ML 78175314 1.032 L006
## JMS8-4 GSM1545541_JMS8-4.txt LP 55762781 0.922 L006
## JMS8-5 GSM1545542_JMS8-5.txt Basal 54115150 0.984 L006
## JMS9-P7c GSM1545544_JMS9-P7c.txt ML 23043111 1.083 L008
## JMS9-P8c GSM1545545_JMS9-P8c.txt LP 19525423 0.979 L008
##
## $E
## Samples
## Tags 10_6_5_11 9_6_5_11 purep53 JMS8-2 JMS8-3 JMS8-4 JMS8-5 JMS9-P7c JMS9-P8c
## 497097 -4.29 -3.87 2.52 3.30 -4.48 -3.99 3.31 -3.20 -5.29
## 27395 3.88 4.40 4.52 4.57 4.32 3.79 3.92 4.35 4.13
## 18777 4.71 5.56 5.40 5.17 5.63 5.08 5.08 5.76 5.15
## 21399 4.78 4.74 5.37 5.13 4.85 4.94 5.16 5.04 4.99
## 58175 3.94 3.29 -1.77 -1.88 2.99 3.36 -2.11 3.14 3.52
## 14160 more rows ...
##
## $weights
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] 1.18 1.18 20.53 20.98 1.77 1.22 21.13 1.18 1.18
## [2,] 20.88 26.56 31.60 29.66 32.56 26.75 29.79 21.90 17.15
## [3,] 28.00 33.70 34.85 34.46 35.15 33.55 34.52 31.44 25.23
## [4,] 27.67 29.60 34.90 34.43 34.84 33.16 34.49 26.14 24.50
## [5,] 19.74 18.66 3.18 2.63 24.19 24.01 2.65 13.15 14.35
## 14160 more rows ...
##
## $design
## Basal LP ML laneL006 laneL008
## 1 0 1 0 0 0
## 2 0 0 1 0 0
## 3 1 0 0 0 0
## 4 1 0 0 1 0
## 5 0 0 1 1 0
## 6 0 1 0 1 0
## 7 1 0 0 1 0
## 8 0 0 1 0 1
## 9 0 1 0 0 1
## attr(,"assign")
## [1] 1 1 1 2 2
## attr(,"contrasts")
## attr(,"contrasts")$group
## [1] "contr.treatment"
##
## attr(,"contrasts")$lane
## [1] "contr.treatment"
Note that the other data frames stored within the DGEList-object that contain gene- and sample-level information, are retained in the EList-object v created by voom. The v$genes data frame is equivalent to x$genes, v$targets is equivalent to x$samples, and the expression values stored in v$E is analogous to x$counts, albeit on a transformed scale. In addition to this, the voom EList-object has a matrix of precision weights v$weights and stores the design matrix in v$design.
Linear modelling in limma is carried out using the lmFit and contrasts.fit functions which can be used for both microarray and RNA-seq data and fit a separate model to the expression values for each gene. Next, empirical Bayes moderation is carried out by the eBayes function which borrows information across all the genes to obtain more precise estimates of gene-wise variability16. The model’s residual variances are plotted against average expression values in Figure 4B. It can be seen from this plot that the variance is no longer dependent on the mean expression level.
vfit <- lmFit(v, design) vfit <- contrasts.fit(vfit, contrasts=contr.matrix) efit <- eBayes(vfit) plotSA(efit)
For a quick look at differential expression levels, the number of significantly up- and down-regulated genes can be summarised in a table. Significance is defined using an adjusted p-value cutoff that is set at 5% by default. For the comparison between expression levels in basal and LP, 4,127 genes are found to be down-regulated in basal relative to LP and 4,298 genes are up-regulated in basal relative to LP – a total of 8,425 DE genes. A total of 8,510 DE genes are found between basal and ML (4,338 down- and 4,172 up-regulated genes), and a total of 5,340 DE genes are found between LP and ML (2,895 down- and 2,445 up-regulated). The larger numbers of DE genes observed for comparisons involving the basal population are consistent with our observations from the MDS plots.
summary(decideTests(efit))
## BasalvsLP BasalvsML LPvsML
## -1 4127 4338 2895
## 0 5740 5655 8825
## 1 4298 4172 2445
Some studies require more than an adjusted p-value cut-off. For a stricter definition on significance, one may require log-fold-changes (log-FCs) to be above a minimum value. The treat method17 can be used to calculate p-values from empirical Bayes moderated t-statistics with a minimum log-FC requirement. The number of differentially expressed genes are reduced to a total of 3,135 DE genes for basal versus LP, 3,270 DE genes for basal versus ML, and 385 DE genes for LP versus ML when testing requires genes to have a log-FC that is significantly greater than 1 (equivalent to a 2-fold difference between cell types on the original scale).
tfit <- treat(vfit, lfc=1) dt <- decideTests(tfit) summary(dt)
## BasalvsLP BasalvsML LPvsML
## -1 1417 1512 203
## 0 11030 10895 13780
## 1 1718 1758 182
Genes that are DE in multiple comparisons can be extracted using the results from decideTests, where 0s represent genes that are not DE, 1s represent genes that are up-regulated, and -1s represent genes that are down-regulated. A total of 2,409 genes are DE in both basal versus LP and basal versus ML (Figure 5), twenty of which are listed below. The write.fit function can be used to extract and write results for all three comparisons to a single output file.
de.common <- which(dt[,1]!=0 & dt[,2]!=0) length(de.common)
## [1] 2409
head(tfit$genes$SYMBOL[de.common], n=20)
## [1] "Xkr4" "Rgs20" "Cpa6" "Sulf1" "Eya1"
## [6] "Msc" "Sbspon" "Pi15" "Crispld1" "Kcnq5"
## [11] "Ptpn18" "Arhgef4" "2010300C02Rik" "Aff3" "Npas2"
## [16] "Tbc1d8" "Creg2" "Il1r1" "Il18r1" "Il18rap"
vennDiagram(dt[,1:2], circle.col=c("turquoise", "salmon")) write.fit(tfit, dt, file="results.txt")
The top DE genes can be listed using topTreat for results using treat (or topTable for results using eBayes). By default topTreat arranges genes from smallest to largest adjusted p-value with associated gene information, log-FC, average log-CPM, moderated t-statistic, raw and adjusted p-value for each gene. The number of top genes displayed can be specified, where n=Inf includes all genes. Genes Cldn7 and Rasef are amongst the top DE genes for both basal versus LP and basal versus ML.
basal.vs.lp <- topTreat(tfit, coef=1, n=Inf) basal.vs.ml <- topTreat(tfit, coef=2, n=Inf) head(basal.vs.lp)
## ENTREZID SYMBOL TXCHROM logFC AveExpr t P.Value adj.P.Val
## 12759 12759 Clu chr14 -5.44 8.86 -33.4 3.99e-10 2.7e-06
## 53624 53624 Cldn7 chr11 -5.51 6.30 -32.9 4.50e-10 2.7e-06
## 242505 242505 Rasef chr4 -5.92 5.12 -31.8 6.06e-10 2.7e-06
## 67451 67451 Pkp2 chr16 -5.72 4.42 -30.7 8.01e-10 2.7e-06
## 228543 228543 Rhov chr2 -6.25 5.49 -29.5 1.11e-09 2.7e-06
## 70350 70350 Basp1 chr15 -6.07 5.25 -28.6 1.38e-09 2.7e-06
head(basal.vs.ml)
## ENTREZID SYMBOL TXCHROM logFC AveExpr t P.Value adj.P.Val
## 242505 242505 Rasef chr4 -6.51 5.12 -35.5 2.57e-10 1.92e-06
## 53624 53624 Cldn7 chr11 -5.47 6.30 -32.5 4.98e-10 1.92e-06
## 12521 12521 Cd82 chr2 -4.67 7.07 -31.8 5.80e-10 1.92e-06
## 71740 71740 Pvrl4 chr1 -5.56 5.17 -31.3 6.76e-10 1.92e-06
## 20661 20661 Sort1 chr3 -4.91 6.71 -31.2 6.76e-10 1.92e-06
## 15375 15375 Foxa1 chr12 -5.75 5.63 -28.3 1.49e-09 2.28e-06
To summarise results for all genes visually, mean-difference plots, which display log-FCs from the linear model fit against the average log-CPM values can be generated using the plotMD function, with the differentially expressed genes highlighted.
plotMD(tfit, column=1, status=dt[,1], main=colnames(tfit)[1], xlim=c(-8,13))
Glimma extends this functionality by providing an interactive mean-difference plot via the glMDPlot function. The output of this function is an html page, with summarised results in the left panel (similar to what is output by plotMD), and the log-CPM values from individual samples in the right panel, with a table of results below the plots (Figure 6). This interactive display allows the user to search for particular genes based on their Gene symbol, which is not possible in a static R plot. The glMDPlot function is not limited to mean-difference plots, with a default version allowing a data frame to be passed with the user able to select the columns of interest to plot in the left panel.
glMDPlot(tfit, coef=1, status=dt[,1], main=colnames(tfit)[1], counts=x$counts, samples=colnames(x), anno=x$genes, groups=group, id.column="ENTREZID", display.columns=c("SYMBOL", "ENTREZID"), search.by="SYMBOL", launch=FALSE)
The mean-difference plot generated by the command above is available online (see http://bioinf.wehi.edu.au/folders/limmaWorkflow/glimma-plots/MD-Plot.html). The interactivity provided by the Glimma package allows additional information to be presented in a single graphical window. Glimma is implemented in R and Javascript, with the R code generating the data which is converted into graphics using the Javascript library D3 (https://d3js.org), with the Bootstrap library handling layouts and Datatables generating the interactive searchable tables. This allows plots to be viewed in any modern browser, which is convenient for including them as linked files from an Rmarkdown report of the analysis.
Plots shown previously include either all of the genes that are expressed in any one condition (such as the Venn diagram of common DE genes or mean-difference plot) or look at genes individually (log-CPM values shown in right panel of the interactive mean-difference plot). Heatmaps allow users to look at the expression of a subset of genes. This can give useful insight into the expression of individual groups and samples without losing perspective of the overall study when focusing on individual genes, or losing resolution when examining patterns averaged over thousands of genes at the same time.
A heatmap is created for the top 100 DE genes (as ranked by adjusted p-value) from the basal versus LP contrast using the heatmap.2 function from the gplots package (Figure 7). The heatmap correctly clusters samples into cell type and rearranges the order of genes to form blocks of similar expression. From the heatmap, we observe that the expression of ML and LP samples are very similar for the top 100 DE genes between basal and LP.
library(gplots) basal.vs.lp.topgenes <- basal.vs.lp$ENTREZID[1:100] i <- which(v$genes$ENTREZID %in% basal.vs.lp.topgenes) mycol <- colorpanel(1000,"blue","white","red") heatmap.2(v$E[i,], scale="row", labRow=v$genes$SYMBOL[i], labCol=group,
col=mycol, trace="none", density.info="none", margin=c(8,6), lhei=c(2,10), dendrogram="column")
We finish this analysis with gene set testing by applying the camera method18 to the c2 gene signatures from the Broad Institute's MSigDB c2 collection19 that have been adapted for mouse and are available as Rdata objects from http://bioinf.wehi.edu.au/software/MSigDB/.
load(url("http://bioinf.wehi.edu.au/software/MSigDB/mouse_c2_v5p1.rdata")) idx <- ids2indices(Mm.c2,id=rownames(v)) cam.BasalvsLP <- camera(v,idx,design,contrast=contr.matrix[,1], inter.gene.cor=0.01) head(cam.BasalvsLP,5)
## NGenes Correlation Direction PValue ## LIM_MAMMARY_STEM_CELL_UP 739 0.01 Up 1.13e-18 ## LIM_MAMMARY_STEM_CELL_DN 630 0.01 Down 1.57e-15 ## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER 163 0.01 Up 1.44e-13 ## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP 183 0.01 Up 2.18e-13 ## LIM_MAMMARY_LUMINAL_PROGENITOR_UP 87 0.01 Down 6.73e-13 ## FDR ## LIM_MAMMARY_STEM_CELL_UP 5.36e-15 ## LIM_MAMMARY_STEM_CELL_DN 3.71e-12 ## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER 2.26e-10 ## SOTIRIOU_BREAST_CANCER_GRADE_1_VS_3_UP 2.58e-10 ## LIM_MAMMARY_LUMINAL_PROGENITOR_UP 6.36e-10
cam.BasalvsML <- camera(v,idx,design,contrast=contr.matrix[,2], inter.gene.cor=0.01) head(cam.BasalvsML,5)
## NGenes Correlation Direction PValue ## LIM_MAMMARY_STEM_CELL_UP 739 0.01 Up 5.09e-23 ## LIM_MAMMARY_STEM_CELL_DN 630 0.01 Down 5.13e-19 ## LIM_MAMMARY_LUMINAL_MATURE_DN 166 0.01 Up 8.88e-16 ## LIM_MAMMARY_LUMINAL_MATURE_UP 180 0.01 Down 6.29e-13 ## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER 163 0.01 Up 1.68e-12 ## FDR ## LIM_MAMMARY_STEM_CELL_UP 2.40e-19 ## LIM_MAMMARY_STEM_CELL_DN 1.21e-15 ## LIM_MAMMARY_LUMINAL_MATURE_DN 1.40e-12 ## LIM_MAMMARY_LUMINAL_MATURE_UP 7.43e-10 ## ROSTY_CERVICAL_CANCER_PROLIFERATION_CLUSTER 1.59e-09
cam.LPvsML <- camera(v,idx,design,contrast=contr.matrix[,3], inter.gene.cor=0.01) head(cam.LPvsML,5)
## NGenes Correlation Direction PValue FDR ## LIM_MAMMARY_LUMINAL_MATURE_UP 180 0.01 Down 8.50e-14 3.40e-10 ## LIM_MAMMARY_LUMINAL_MATURE_DN 166 0.01 Up 1.44e-13 3.40e-10 ## LIM_MAMMARY_LUMINAL_PROGENITOR_UP 87 0.01 Up 3.84e-11 6.05e-08 ## REACTOME_RESPIRATORY_ELECTRON_TRANSPORT 91 0.01 Down 2.66e-08 3.14e-05 ## NABA_CORE_MATRISOME 222 0.01 Up 4.43e-08 4.19e-05
The camera function performs a competitive test to assess whether the genes in a given set are highly ranked in terms of differential expression relative to genes that are not in the set. It uses limma’s linear model framework, taking both the design matrix and contrast matrix (if present) and accommodates the observational-level weights from voom in the testing procedure. After adjusting the variance of the resulting gene set test statistic by a variance inflation factor, that depends on the gene-wise correlation (which can be estimated from the data or specified by the user) and the size of the set, a p-value is returned and adjusted for multiple testing. Here we set inter.gene.cor to 0.01 for all gene sets which produces a less conservative test that performs well in many practical situations.
This experiment is the RNA-seq equivalent of Lim et al. (2010)20, who used Illumina microarrays to profile the same sorted cell populations, so it is reassuring to see the gene signatures from this earlier publication coming up at the top of the list for each contrast. We make a barcodeplot of the Lim et al. (2010) Mature Luminal gene sets (Up and Down) in the LP versus ML contrast. Note that these sets go in the opposite direction in our dataset due to our parameterization which compares LP against ML rather than the other way around (if the contrast were reversed, the directions would be consistent).
barcodeplot(efit$t[,3], index=idx$LIM_MAMMARY_LUMINAL_MATURE_UP, index2=idx$LIM_MAMMARY_LUMINAL_MATURE_DN, main="LPvsML")
This RNA-seq workflow makes use of various packages available from version 3.4 of the Bioconductor project, running on R21 version 3.3.0 or higher. Besides the packages highlighted in this article (limma, Glimma and edgeR) it requires a number of other software packages, including gplots22 and RColorBrewer and the gene annotation package Mus.musculus. This document was compiled using knitr23–25. Version numbers for all packages used are shown below. Code to perform this analysis is available from the Supplementary Materials website at http://bioinf.wehi.edu.au/folders/limmaWorkflow/ and as a Bioconductor workflow from http://www.bioconductor.org/help/workflows/.
sessionInfo()
## R version 3.3.0 (2016-05-03) ## Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: CentOS release 6.4 (Final) ## ## locale: ## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8 ## [4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8 ## [7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C ## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] parallel stats4 stats graphics grDevices utils datasets methods ## [9] base ## ## other attached packages: ## [1] gplots_3.0.1 Glimma_1.1.1 ## [3] RColorBrewer_1.1-2 Mus.musculus_1.3.1 ## [5] TxDb.Mmusculus.UCSC.mm10.knownGene_3.2.2 org.Mm.eg.db_3.3.0 ## [7] GO.db_3.3.0 OrganismDbi_1.14.0 ## [9] GenomicFeatures_1.24.0 GenomicRanges_1.24.0 ## [11] GenomeInfoDb_1.8.1 AnnotationDbi_1.34.3 ## [13] IRanges_2.6.0 S4Vectors_0.10.1 ## [15] Biobase_2.32.0 BiocGenerics_0.18.0 ## [17] edgeR_3.14.0 limma_3.28.5 ## ## loaded via a namespace (and not attached): ## [1] locfit_1.5-9.1 Rcpp_0.12.5 lattice_0.20-33 ## [4] gtools_3.5.0 Rsamtools_1.24.0 Biostrings_2.40.1 ## [7] plyr_1.8.3 chron_2.3-47 acepack_1.3-3.3 ## [10] RSQLite_1.0.0 evaluate_0.9 ggplot2_2.1.0 ## [13] highr_0.5.1 BiocInstaller_1.23.4 zlibbioc_1.18.0 ## [16] gdata_2.17.0 annotate_1.50.0 data.table_1.9.6 ## [19] R.utils_2.3.0 R.oo_1.20.0 rpart_4.1-10 ## [22] Matrix_1.2-6 splines_3.3.0 BiocParallel_1.6.2 ## [25] geneplotter_1.50.0 stringr_1.0.0 foreign_0.8-66 ## [28] RCurl_1.95-4.8 biomaRt_2.28.0 munsell_0.4.3 ## [31] rtracklayer_1.32.0 nnet_7.3-12 SummarizedExperiment_1.2.2 ## [34] gridExtra_2.2.1 Hmisc_3.17-4 XML_3.98-1.4 ## [37] GenomicAlignments_1.8.0 bitops_1.0-6 R.methodsS3_1.7.1 ## [40] grid_3.3.0 RBGL_1.48.0 xtable_1.8-2 ## [43] gtable_0.2.0 DBI_0.4-1 magrittr_1.5 ## [46] formatR_1.3 scales_0.4.0 KernSmooth_2.23-15 ## [49] graph_1.50.0 stringi_1.0-1 XVector_0.12.0 ## [52] genefilter_1.54.2 latticeExtra_0.6-28 Formula_1.2-1 ## [55] tools_3.3.0 DESeq2_1.12.2 survival_2.39-3 ## [58] colorspace_1.2-6 cluster_2.0.4 caTools_1.17.1 ## [61] knitr_1.12.3
This worked was funded by the National Health and Medical Research Council (NHMRC) (Fellowship GNT1058892 and Program GNT1054618 to GKS, Project GNT1050661 to MER and GKS and Fellowship GNT1104924 to MER), Victorian State Government Operational Infrastructure Support and Australian Government NHMRC IRIISS.
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
We thank Dr Julie Sheridan for generating this dataset and for advice on its analysis.
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: Matthew E. Ritchie and I are both organisers of the Bioconductor Asia-Pacific Developer meeting, which is to be held in Brisbane in November 2016.
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | |||
---|---|---|---|
1 | 2 | 3 | |
Version 3 (revision) 28 Dec 18 |
|||
Version 2 (revision) 30 Nov 16 |
|||
Version 1 17 Jun 16 |
read | read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)