ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Research Article

Gene length and detection bias in single cell RNA sequencing protocols

[version 1; peer review: 4 approved]
PUBLISHED 28 Apr 2017
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

Abstract

Background: Single cell RNA sequencing (scRNA-seq) has rapidly gained popularity for profiling transcriptomes of hundreds to thousands of single cells. This technology has led to the discovery of novel cell types and revealed insights into the development of complex tissues. However, many technical challenges need to be overcome during data generation. Due to minute amounts of starting material, samples undergo extensive amplification, increasing technical variability. A solution for mitigating amplification biases is to include unique molecular identifiers (UMIs), which tag individual molecules. Transcript abundances are then estimated from the number of unique UMIs aligning to a specific gene, with PCR duplicates resulting in copies of the UMI not included in expression estimates.
Methods: Here we investigate the effect of gene length bias in scRNA-Seq across a variety of datasets that differ in terms of capture technology, library preparation, cell types and species.
Results: We find that scRNA-seq datasets that have been sequenced using a full-length transcript protocol exhibit gene length bias akin to bulk RNA-seq data. Specifically, shorter genes tend to have lower counts and a higher rate of dropout. In contrast, protocols that include UMIs do not exhibit gene length bias, with a mostly uniform rate of dropout across genes of varying length. Across four different scRNA-Seq datasets profiling mouse embryonic stem cells (mESCs), we found the subset of genes that are only detected in the UMI datasets tended to be shorter, while the subset of genes detected only in the full-length datasets tended to be longer.
Conclusions: We find that the choice of scRNA-seq protocol influences the detection rate of genes, and that full-length datasets exhibit gene-length bias. In addition, despite clear differences between UMI and full-length transcript data, we illustrate that full-length and UMI data can be combined to reveal the underlying biology influencing expression of mESCs.

Keywords

single cell RNA sequencing, unique molecular identifiers, gene length bias, gene detection rate, differential expression

Introduction

Single cell RNA-Seq (scRNA-Seq) has rapidly gained popularity as the primary tool to profile gene expression of hundreds to thousands of single cells. This new technology enables researchers to examine transcription at the resolution of a single cell in a high-throughput manner, and has led to the discovery of novel cell types and revealed insights into the development of complex tissues as well as differentiation lineages. With the promise of novel discoveries, this new technology has been embraced by the scientific community.

Many technical challenges need to be overcome during data generation, and technology for performing scRNA-Seq is advancing at a rapid rate. The original Fluidigm C1 system has a 96-well plate, which limits how many single cells researchers can practically handle in an experiment. However, depth of sequencing is only limited by cost, with a sequencing depth of around 2 million reads per cell recommended (Tung et al., 2016). Droplet based technology, such as InDrop (Klein et al., 2015), Drop-Seq (Macosko et al., 2015) and the more recent Chromium system from 10X Genomics (Zheng et al., 2016), are cost effective methods to obtain relatively shallow sequencing of thousands to tens of thousands of single cells in one run. Lower sequencing depth limits the complexity of the expression profile attained per cell, as only the most highly expressed genes will be observed, however, it may be the case that researchers combine deeper sequencing of fewer single cells with shallow sequencing of tens of thousands of cells to answer their scientific questions of interest.

Not only are there different technologies for capturing single cells, there are also differences in library preparation protocols, which aim to amplify and process the minute amounts of RNA from each cell. Most RNA-Seq library preparation protocols include enrichment of mRNA by either polyA pulldown or ribosomal depletion, followed by fragmentation and PCR amplification before sequencing. The extensive PCR amplification that is required for scRNA-Seq increases technical variability in the data by introducing amplification biases (Stegle et al., 2015). A solution for mitigating amplification biases is to include Unique Molecular Identifiers (UMIs), which are short (5–10bp) sequences ligated onto the 5’ end of the molecule prior to PCR amplification (Islam et al., 2014). Transcript abundances are then estimated from the number of reads with unique UMIs aligning to a specific gene. PCR duplicates resulting in copies of the UMI are therefore not included in expression estimates. While some protocols, such as those used with Fluidigm C1 (e.g SMARTer), need to be modified to include UMIs (Tung et al., 2016), some droplet based methods, for example the Chromium system (Zheng et al., 2016), always include UMIs in the chemistry. It is worth noting that, while mechanisms such as alternative splicing can be studied using full-length transcript protocols, this type of analysis is not possible with data generated with protocols that include UMIs.

Gene length bias is well understood in bulk RNA-seq data. When cDNAs are fragmented, long genes result in more fragments for the same number of transcripts, resulting in higher counts and more power to detect differential expression (Oshlack & Wakefield, 2009). As a result gene set testing is biased towards gene ontology categories containing longer genes (Young et al., 2010). While there is much in common between scRNA-Seq and bulk RNA-Seq data, modifications to the protocols such as amplification and the inclusion of UMIs, may highlight different biases in the data.

Here we investigate the effect of gene length bias in scRNA-Seq across a variety of datasets that differ in terms of capture technology, library preparation, cell types and species. As hypothesised, we find that scRNA-seq datasets that have been sequenced using a full-length transcript protocol exhibit gene length bias akin to bulk RNA-seq data. Specifically, shorter genes tend to have lower counts and a higher rate of dropout. In contrast, protocols that include UMIs do not exhibit gene length bias. UMI protocols reveal that shorter genes are as highly expressed as longer genes, and dropout is mostly uniform across genes of varying length. These effects mean that different protocols have the ability to detect a different subset of genes, with shorter genes detected more readily using UMI protocols and longer genes detected by full-length protocols.

Methods

Processing of full-length datasets

We processed three datasets through our pipeline developed for full-length data:

The quality of the raw sequencing reads was examined using FastQC (v0.11.4). They were checked for contamination by aligning a sample of reads to multiple reference genomes using FastQ Screen (v0.6.4.). Reads were aligned to the appropriate reference using STAR (v2.5.2a) (Dobin et al., 2013). For the mouse dataset, we used the mm10 version of the genome, using the chromFa.tar.gz file on http://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips, and for the human datasets we used the hg38 version of the genome, using the hg38.chromFa.tar.gz file on http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips. Reads were summarised across genes using featureCounts (v1.5.0-p3) (Liao et al., 2014), with GENCODE M9 annotation for mouse and GENCODE V22 annotation for human datasets. This pipeline was constructed in Bpipe (v0.9.9.3) (Sadedin et al., 2012), and a report summarising the steps produced using MultiQC (v0.8) (Ewels et al., 2016).

Gene filtering

Our gene filtering strategy was identical between datasets. Genes that had more than 90% zeroes across all cells, as well as ribosomal and mitochondrial genes, were filtered out. Genes that could not be annotated with an Entrez Gene ID were also removed in order to retain a set of well curated genes. Gene length information was taken as the sum of the exon lengths as outputted by the featureCounts software for the mm10 GENCODE VM4 annotation for all mouse datasets, and for all human datasets, we used the sum of exon lengths as outputted by featureCounts for the hg38 GENCODE V22 annotation. Genes that could not be annotated with gene length information were filtered out. We found that using these criteria helped reduce some of the variability in the datasets.

Processing of all datasets

Details of all datasets analysed in this study are listed in Supplementary Table 1.

Mouse embryonic stem cells, Kolodziejczyk et al., 2015, full-length. We downloaded the raw data from the ArrayExpress database under accession number E-MTAB-2600 and ran our full-length processing pipeline using the mm10 mouse genome to produce a counts matrix. We performed quality control on the cells and removed cells that had a dropout rate of greater than 80% and a library size of fewer than half a million. We calculated the proportion of sequencing reads taken up by the ERCC spike-ins and discarded three plates that had proportions of ERCC spike-ins that appeared excessive compared to the remaining plates. We performed gene filtering as described above. After cell and gene filtering, we were left with 530 cells and 12395 genes for further analysis.

Human primordial germ cells, Guo et al., 2015, full-length. We downloaded the processed data from Conquer. The data had been pseudo-aligned to the latest human reference genome, hg38, using the Salmon software tool, v0.6.0 (Patro et al., 2017). The data is also available under the GEO accession number GSE63818. There did not appear to be any spike-in controls for this dataset, hence filtering was performed on the dropout rate and total sequencing depth for each cell. Cells with more than 85% dropout and fewer than half a million sequencing reads were filtered out. After cell and gene filtering, there were 226 cells and 15837 genes for further analysis.

Human cerebral organoid cells, Camp et al., 2015, full-length. We downloaded the data from SRA under accession SRP066834, and ran our full-length processing pipeline to produce a counts matrix, using the hg38 human genome. We removed cells that had greater than 90% dropout, library size smaller than half a million as well as cells that had more than 20% of the sequencing taken up by ERCC controls. After cell and gene filtering, we had 494 cells and 11325 genes for further analysis.

Mouse embryonic stem cells, Grün et al., 2014, UMI. We downloaded the processed data from GEO under accession number GSE54695. The data was aligned to the mm10 mouse genome using BWA and transcript number estimated from UMI counts by the authors. We removed cells that had > 80% dropout, library size smaller than 10000, as well as cells that had more than 5% of the sequencing taken up by ERCC controls. After cell and gene filtering, there were 127 cells and 9962 genes for further analysis.

Human induced pluripotent stem cells, Tung et al., 2016, UMI. We downloaded the processed molecule counts and sample information from the authors’ Github repository (https://github.com/jdblischak/singleCellSeq). The data was aligned by the authors to the human genome hg19 using the Subjunc aligner (Liao et al., 2013). The data is also available under GEO accession GSE77288. We removed cells that had > 70% dropout, fewer than 30000 sequencing reads per cell, as well as cells that had more than 3% of the sequencing taken up by ERCC spike-ins. After cell and gene filtering, we had 671 cells and 11971 genes for further analysis.

Human K562 cells (lymphoblastoma culture), Klein et al., 2015, UMI. The processed molecule count data was downloaded from GEO under accession GSM1599500. The data was aligned to the hg19 human genome using Bowtie v0.12.0 (Langmead et al., 2009). Cells that had > 85% dropout, fewer than 10000 total sequencing reads, or an ERCC library size to total library size ratio > 0.01 were filtered out. After cell and gene filtering, we had 219 cells and 13418 genes for further analysis.

Mouse embryonic stem cells, Ziegenhain et al., 2016, UMI. We downloaded the molecule counts from GEO under accession GSE75790. The SCRB-Seq protocol, a 3’ digital gene expression RNA-Seq protocol, (Soumillon et al., 2014), was used to generate the libraries. The data was processed by the authors through a dropseq pipeline, which included alignment to the mm10 mouse genome using STAR v2.4.0 (Dobin et al., 2013). The cells all appeared good quality hence cell filtering wasn’t necessary. After gene filtering, we had 84 cells and 10519 genes for further analysis.

Mouse embryonic stem cells, Buettner et al., 2015, full-length. We downloaded the data from the European Nucleotide Archive, under accession PRJEB6989, and ran the data through our full-length pipeline, mapping to the mm10 mouse genome to produce a counts matrix. We filtered out cells with > 85% dropout and sequencing depth less than a million. After cell and gene filtering, we had 271 cells and 11700 genes for further analysis.

Combining mouse embryonic stem cell datasets

We combined the four different mouse embryonic stem cell datasets using the following approach. We performed gene and cell filtering on each dataset independently, and combined the datasets by taking the genes commonly detected across all four datasets (8678 genes, 1012 cells, each gene is detected in at least 10% of the cells for each dataset). This strategy ensured that the genes were detected in all four datasets, and hence larger datasets did not dominate gene filtering. It also ensured that the larger datasets did not dominate the principal components analysis.

Statistical analysis

All statistical analysis was performed in R-3.3.1, using the limma (Ritchie et al., 2015), edgeR (Robinson et al., 2010), scran (Lun et al., 2016) and scater (McCarthy et al., 2016) Bioconductor packages (Gentleman et al., 2004). The UMI dataset was normalised using scran prior to differential expression analysis, as it clearly showed composition bias. Differential expression analysis in the mESCs was performed using edgeR, specifying a log-fold-change cut-off of 1 for the full-length dataset, and 0.5 for the UMI dataset. GO analysis was performed with hypergeometric tests using the goana function in the Bioconductor R package limma (Ritchie et al., 2015). All scripts for analysing the datasets are available on the Oshlack lab Github page (https://github.com/Oshlack/GeneLengthBias-scRNASeq).

Results

Gene length bias is apparent in scRNA-Seq in non-UMI based protocols

Initially, we analysed three different datasets generated using full-length transcript protocols: mouse embryonic stem cells (Kolodziejczyk et al., 2015), human primordial germ cells (Guo et al., 2015) and human brain whole organoids (Camp et al., 2015). For a full list of the datasets analysed see Supplementary Table 1. Quality control of the single cells was performed and problematic cells filtered out (see methods), leaving 530 mouse embryonic stem cells, 226 human primordial germ cells and 494 human brain organoid cells. For each gene, the average log-counts, normalised for sequencing depth, and the proportion of zeroes across the cells (i.e. the dropout rate per gene) were calculated. Gene-wise abundances were estimated for all datasets by dividing the gene-level counts by gene length to obtain reads per kilobase per million (RPKM). In order to assess gene length bias, genes were assigned to 10 bins based on gene length, such that each bin had roughly 1000 genes. The results are summarised in the boxplots in Figure 1.

598ffe64-c67c-45c5-9605-aceb7841cf1a_figure1.gif

Figure 1. Gene length bias is present in non-UMI protocols.

Three different datasets were analysed: (ac) mouse embryonic stem cells, n=530 (Kolodziejczyk et al., 2015), (df) human primordial germ cells, n=226 (Guo et al., 2015), (gi) human brain whole organoids, n=494 (Camp et al., 2015). For all plots (ai), the x-axis shows 10 gene length bins all containing roughly equal numbers of genes. The left panel shows gene-wise average log counts, the middle panel shows proportion of zeroes in each gene (dropout rate per gene), and the right panel shows average log counts corrected for gene length (RPKM).

For all three full-length protocol datasets, shorter genes have lower count level expression proportions compared to longer genes, with a clear trend of increasing log-counts as gene length increases (Figure 1a, d, g). This was accompanied by a decreasing trend in the dropout rate per gene as gene length increased, highlighting the fact that shorter genes are more difficult to detect using full length protocols (Figure 1b, e, h). These trends are stronger for the human PGCs and human brain organoid datasets, while not as severe for the mouse ESCs. Calculating transcript abundance by dividing gene-level counts by gene length mostly removed the gene length bias for the human PGCs and brain organoid datasets (Figure 1f, i), however for the mouse ESCs calculating RPKMs appeared to induce a trend with gene length such that shorter genes appeared more highly expressed relative to the longer genes (Figure 1c).

UMI-based protocols do not suffer from gene length bias

We hypothesised that because UMI protocols tag each transcript molecule separately we would not see a similar gene length bias in these protocols. In order to assess gene length bias in scRNA-Seq datasets with included UMIs, we analysed three different datasets: mouse embryonic stem cells generated using a CEL-Seq protocol (Grün et al., 2014; Hashimshony et al., 2012), human induced pluripotent stem cells generated using a modified SMARTer protocol with the Fluidigm C1 system (Tung et al., 2016) and human leukemia cell line K562 cells using the CEL-Seq protocol with InDrop (Klein et al., 2015). After quality control and filtering of problematic cells, 127 single cells remained for the mouse embryonic stem cells, 671 for human induced pluripotent stem cells and 219 human K562 cells.

We found that for the human iPSCs and human K562 datasets, the average log-counts were fairly uniform across the 10 gene length bins, and for the mouse ESCs, the shorter genes appear to be more highly expressed than the longer genes (Figure 2a, d, g). Comparing medians, the dropout rate per gene is slightly lower for shorter genes in the mouse ESCs, while for the human iPSCs and K562 cells, the dropout is fairly uniform across the gene length bins, although slightly more variable for the shortest genes (Figure 2b, e, h). However, calculating RPKMs by dividing by gene length induces a clear trend with gene length where shorter genes appear to be more highly expressed relative to longer genes, with the median log RPKM decreasing with increasing gene length (Figure 2c, f, i).

598ffe64-c67c-45c5-9605-aceb7841cf1a_figure2.gif

Figure 2. Gene length bias is absent in UMI-based protocols.

Three different datasets were analysed: (ac) mouse embryonic stem cells n=127 (Grün et al., 2014), (df) human induced pluripotent stem cells n=671 (Tung et al., 2016), and (gi) human leukemia cell line K562 cells, n=219 (Klein et al., 2015). For all plots (ai), the x-axis shows 10 gene length bins all containing roughly equal numbers of genes. The left panel shows gene-wise average log counts, the middle panel shows proportion of zeroes in each gene (dropout rate per gene), and the right panel shows average log expression corrected for gene length (RPKM).

Comparing gene length bias between different mouse embryonic stem cell datasets

To ensure the gene length bias is not due to the specific biology of the different cell types, we analysed four different mouse embryonic stem cell datasets generated using both UMI and full-length transcript protocols (Buettner et al., 2015; Grün et al., 2014; Kolodziejczyk et al., 2015; Ziegenhain et al., 2016). When we combined all four datasets together (see methods) and performed principal components analysis, we noted that the cells clustered by dataset, with the UMI datasets on the left and full-length datasets on the right of the plot (Figure 3a). Interestingly, in principal components two and three, we saw some biological structure in the datasets emerging, with cells grown in different media clustering together (Figure 3b). In particular, three different datasets (two full-length, one UMI), grown in standard media with 2i inhibitors all cluster together on the left of the plot. This shows great promise for obtaining biologically interesting results from combining multiple datasets generated in separate labs using different technology.

598ffe64-c67c-45c5-9605-aceb7841cf1a_figure3.gif

Figure 3. Combining four mouse embryonic stem cell datasets.

Four different mouse embryonic stem cell datasets were combined, two full-length transcript (Buettner et al., 2015; Kolodziejczyk et al., 2015) and two UMI datasets (Grün et al., 2014; Ziegenhain et al., 2016). (a) Principal component analysis plot (coloured by dataset) shows the major source of variation between the cells is the dataset, with the UMI datasets on the left and the full-length datasets on the right. (b) Examining principal components two and three reveals that the next major source of variation in the data is the media in which cells are grown. In particular three datasets (two full-length and one UMI) which have cells grown in standard media with 2i inhibitors all cluster together on the left. J1, Rex1 and G4 refer to the mESC cell line. The Ziegenhain dataset has single cells profiled in two batches. (cd) Gene length bias is present in full-length mESC datasets; dotted grey line is the median log-count in the first gene length bin. (ef) Gene length bias is absent in UMI mESC datasets; dotted grey line is the median log-count in the first gene length bin.

In terms of the gene length bias across the multiple datasets, it is clear that data generated from full length protocols exhibit gene length bias, with shorter genes having lower average log-counts compared to longer genes (Figure 3c, d). This is not as pronounced compared to other full-length datasets (Figure 1d, g), however compared to the UMI mESC datasets it is quite noticeable. For the UMI datasets, the gene length bias is mostly uniform across the gene length bins, however the shortest genes in the first bin appear to have slightly higher average log-counts and are more variable compared to the longer genes (Figure 3e, f).

Detection differences in UMI and full-length mESC datasets

In order to investigate whether choice of protocol impacts which genes are detected, we compared genes detected in both UMI mESC datasets to genes detected in both full-length mESC datasets. Across all datasets, 13434 genes were detected in at least one of the four datasets. Across both UMI datasets, 8866 genes were detected with counts in at least 10% of the cells for each dataset. For the full-length datasets, 11328 genes were detected using the same criteria. The full-length datasets had much greater sequencing depth (median ~ 3million reads, Supplementary Table 1) and more cells compared to the UMI datasets (median ~33,000 reads, Supplementary Table 1), hence it is unsurprising that more genes are detected across both full-length datasets. However, there were 188 genes detected in the UMI datasets that were not detected in the full-length datasets (Figure 4a). The genes unique to the UMI datasets tended to be shorter compared to the gene lengths of the 2644 genes uniquely detected in the full-length datasets (Figure 4b, p-value=0.000297, Wilcoxon Rank Sum Test). The genes uniquely detected in either the full-length or UMI datasets tended to be lowly expressed, hence more difficult to detect in general (Supplementary Figure 1).

598ffe64-c67c-45c5-9605-aceb7841cf1a_figure4.gif

Figure 4. Detection differences in UMI and full-length mESC datasets.

(a) A Venn diagram comparing the number of genes detected in two UMI mESC datasets, with the number detected in the two full-length datasets. We find that while the majority of genes are detected in all datasets (n=8689), there are genes that are uniquely detected when using either a full-length or UMI protocol. (b) Density plots of gene length for the subsets of genes corresponding to the Venn diagram in (a). The uniquely detected genes for the UMI datasets (blue line) tend to be shorter than the uniquely detected genes in the full-length datasets (red line), p=0.000297. (c) A Venn diagram showing the number of enriched GO categories in the 188 genes unique to UMIs and the 2649 genes unique to the full-length protocols. This reveals that these genes interrogate different biology, with only 3 GO categories in common. (d) Density plots of average gene length for each GO category corresponding to the significantly enriched GO categories in (c). We assigned each GO category an average length by calculating the median of the lengths of all genes annotated to each GO category. While there is not a significant shift in location in the density plots we noted a much greater spread of median length in the enriched GO categories for the uniquely detected UMI genes, largely driven by the presence of GO categories that tend to have very short genes.

Comparing differential expression between two media (2i inhibitors versus serum) in one UMI dataset (Grün et al., 2014), revealed that 31% (59/188) of the uniquely detected genes were defined as significantly differentially expressed (total differentially expressed = 1641/9962, 16%). For a similar comparison in a full length dataset (2i inhibitors versus serum, Kolodziejczyk et al., 2015), 20% (531/2644) of the uniquely detected genes in full length datasets were significantly differentially expressed (total differentially expressed = 1653/12395, 13%). This highlights that protocol choice may impact ability to detect differential expression of some genes.

Examining which GO terms are over-represented for the 188 genes unique to the UMI dataset revealed that categories such as neural crest cell migration, negative regulation of megakaryocyte differentiation and stem cell development were among the 26 statistically significantly enriched categories (Supplementary Table 2). There were 4/26 GO categories with extremely short average gene length (<1000, median gene length across all GO categories = 4039), with the top two GO categories, “nucleosome” and “DNA packaging complex”, having median gene length in GO categories = 614, 706. However there were also statistically significant categories comprised of longer than average genes (13/26 categories with median length > 4039), indicating that pathways enriched for the unique UMI genes were not heavily biased towards categories only containing short genes.

For the full-length datasets, the GO categories that were significantly enriched (n=111) were different to those pathways enriched for the unique UMI genes, with only 3 GO categories overlapping (Figure 4c, Supplementary Table 3). GO categories such as those involved in plasma membrane, cell signalling, and ion and cation channel activity, were over-represented for the 2649 unique genes. While there were no significantly enriched GO categories that had extremely small average gene length (<1000), 14% (16/111) had median gene length < 2632 (the 5th percentile of median gene length across the GO categories). There was one statistically significant GO category with extremely large average gene length (> 10,000). Although there was no significant shift in median gene length of GO categories between the UMI and full-length GO categories, we noted that the variation in median GO length for the uniquely detected UMI genes was 3.5 times greater than for the uniquely detected full-length genes, largely driven by prevalence of very small sets (Figure 4d, p-value = 5.6x10-6, F-test).

Discussion

While single cell RNA-sequencing technology is advancing at a rapid rate and novel discoveries are being made, the datasets being generated have many technical biases. Here, we have investigated the role that gene length plays in protocols that include UMIs as well as full-length transcript protocols. Unsurprisingly, we find that for full-length protocols, genes that tend to be shorter have lower counts and a higher rate of dropout, while UMI based protocols have a more even distribution of dropout across genes of varying length. In addition, a UMI protocol is more likely to detect lowly expressed genes that are shorter compared to a full-length protocol, where lowly expressed genes that are longer are easier to detect (Supplementary figure 1). Of course, UMI protocols are unable to provide information on transcript structure such as which isoforms are expressed in a sample, and only provide overall gene level expression measures. Since UMI counts are already molecule counts, expression levels should be expressed as normalised counts (e.g. counts per million) rather than dividing by gene length to obtain RPKMs, as this latter measure will artificially inflate the expression estimates of shorter genes relative to longer genes.

While datasets generated using a UMI based protocol tend to have much lower sequencing depths, and hence lower counts, we found that in mESCs we were still able to detect uniquely expressed genes in the UMI datasets that were not detected in full-length datasets. However, a larger set of genes were detected in the mESC full-length datasets. Performing GO analysis on genes uniquely detected by each protocol revealed that they interrogate different biology, and hence the choice of protocol may affect which pathways can be studied. In particular, the genes unique to either the UMI or the full-length datasets appeared to be biologically relevant, as a subset were found to be significantly differentially expressed when comparing cells grown in two different media.

We combined four different datasets generated from mESCs that had strikingly different sequencing depths and protocols. Despite these differences, we found that we were able to recover biologically relevant structure. In particular, three different datasets (two full-length, one UMI), grown in standard media with 2i inhibitors, all cluster together when examining higher principal components. Although promising, the greatest source of variation between the cells was the dataset they belonged to, highlighting the known issues with large batch effects in scRNA-seq (Tung et al., 2016; Hicks et al., 2015). Hence, analysis methods including data cleaning and normalisation are crucial when combining datasets in order to extract biologically meaningful relationships.

Data and software availability

Latest source code for scripts used to analyse the datasets:

https://github.com/Oshlack/GeneLengthBias-scRNASeq

Information on the repositories and accession numbers of all datasets used in this study:

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 28 Apr 2017
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Phipson B, Zappia L and Oshlack A. Gene length and detection bias in single cell RNA sequencing protocols [version 1; peer review: 4 approved] F1000Research 2017, 6:595 (https://doi.org/10.12688/f1000research.11290.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 28 Apr 2017
Views
37
Cite
Reviewer Report 19 May 2017
Sam Buckberry, Harry Perkins Institute of Medical Research, University of Western Australia, Perth, WA, Australia;  Australian Research Council Centre of Excellence in Plant Energy Biology, University of Western Australia, Perth, WA, Australia 
Timothy Stuart, Australian Research Council Centre of Excellence in Plant Energy Biology, University of Western Australia, Perth, WA, Australia 
Approved
VIEWS 37
Phipson, Zappia and Oshlack present evidence against the existence of systematic gene length bias in single cell RNA sequencing experiments that use unique molecular identifiers (UMI). In contrast, methods that measure read counts across full length transcripts appear similar to ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Buckberry S and Stuart T. Reviewer Report For: Gene length and detection bias in single cell RNA sequencing protocols [version 1; peer review: 4 approved]. F1000Research 2017, 6:595 (https://doi.org/10.5256/f1000research.12181.r22437)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
51
Cite
Reviewer Report 17 May 2017
Wolfgang Huber, EMBL Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany 
Approved
VIEWS 51
The paper presents a very useful investigation of the dependence of detection efficiency in single cell RNA sequencing on (a) gene length and (b) certain choices in the experimental protocol, namely, shotgun sequencing of full transcripts versus transcript end sequencing ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Huber W. Reviewer Report For: Gene length and detection bias in single cell RNA sequencing protocols [version 1; peer review: 4 approved]. F1000Research 2017, 6:595 (https://doi.org/10.5256/f1000research.12181.r22371)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
34
Cite
Reviewer Report 10 May 2017
 Samuel W. Lukowski, Institute for Molecular Bioscience (IMB), The University of Queensland, St Lucia, Qld, Australia 
Approved
VIEWS 34
General comments
This is a well-written and concise study that reveals some very interesting results. Firstly, pertaining to the enrichment of biological processes in data processed using different protocols, the fact that the genes detected in full-length transcript and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Lukowski  W. Reviewer Report For: Gene length and detection bias in single cell RNA sequencing protocols [version 1; peer review: 4 approved]. F1000Research 2017, 6:595 (https://doi.org/10.5256/f1000research.12181.r22438)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
57
Cite
Reviewer Report 10 May 2017
Charlotte Soneson, Institute of Molecular Life Sciences, University of Zurich (UZH), Zürich, Switzerland 
Approved
VIEWS 57
This is a nice evaluation of the extent of gene length and detection bias in single-cell RNA-seq data sets generated with different types of protocols. Overall, it is clearly written and the results are well presented and agree with expectations. ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Soneson C. Reviewer Report For: Gene length and detection bias in single cell RNA sequencing protocols [version 1; peer review: 4 approved]. F1000Research 2017, 6:595 (https://doi.org/10.5256/f1000research.12181.r22376)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 28 Apr 2017
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.