Next Article in Journal
Deciphering the Biological Activities of Dunaliella sp. Aqueous Extract from Stressed Conditions on Breast Cancer: from in Vitro to in Vivo Investigations
Previous Article in Journal
Regulation of Small GTPase Rab20 by Ikaros in B-Cell Acute Lymphoblastic Leukemia
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Evaluation of Seven Different RNA-Seq Alignment Tools Based on Experimental Data from the Model Plant Arabidopsis thaliana

Max Planck Institute of Molecular Plant Physiology, Am Mühlenberg 1, 14476 Potsdam, Germany
*
Author to whom correspondence should be addressed.
Int. J. Mol. Sci. 2020, 21(5), 1720; https://0-doi-org.brum.beds.ac.uk/10.3390/ijms21051720
Submission received: 30 January 2020 / Revised: 28 February 2020 / Accepted: 29 February 2020 / Published: 3 March 2020
(This article belongs to the Section Molecular Informatics)

Abstract

:
Quantification of gene expression is crucial to connect genome sequences with phenotypic and physiological data. RNA-Sequencing (RNA-Seq) has taken a prominent role in the study of transcriptomic reactions of plants to various environmental and genetic perturbations. However, comparative tests of different tools for RNA-Seq read mapping and quantification have been mainly performed on data from animals or humans, which necessarily neglect, for example, the large genetic variability among natural accessions within plant species. Here, we compared seven computational tools for their ability to map and quantify Illumina single-end reads from the Arabidopsis thaliana accessions Columbia-0 (Col-0) and N14. Between 92.4% and 99.5% of all reads were mapped to the reference genome or transcriptome and the raw count distributions obtained from the different mappers were highly correlated. Using the software DESeq2 to determine differential gene expression (DGE) between plants exposed to 20 °C or 4 °C from these read counts showed a large pairwise overlap between the mappers. Interestingly, when the commercial CLC software was used with its own DGE module instead of DESeq2, strongly diverging results were obtained. All tested mappers provided highly similar results for mapping Illumina reads of two polymorphic Arabidopsis accessions to the reference genome or transcriptome and for the determination of DGE when the same software was used for processing.

1. Introduction

Since the completion of the human genome project in 2003 [1], sequencing technologies have developed extraordinarily fast. The resulting data have revealed the astonishing complexity of genome architecture and transcriptome composition. In this context, transcript identification and the quantification of gene expression play crucial roles in connecting genomic information with phenotypic and biochemical measurements. These two key aspects of transcriptomics can be combined in a single high-throughput sequencing assay called RNA-Sequencing (RNA-Seq). This approach allows detailed transcript profiling including the identification of splicing-induced isoforms, nucleotide variation and post-transcriptional base modification [2].
While comparative studies of diverse read aligners have been performed using data with a corresponding reference genome or transcriptome [3,4,5,6,7] or de novo assembly [8,9,10], only little evaluation is available of the performance of read mappers for data generated from genotypes within a species showing sequence polymorphisms. In this study, the algorithmically different mappers bwa, CLC Genomics Workbench, HISAT2, kallisto, RSEM, salmon and STAR were used to map experimentally generated RNA-Seq data from the two natural accessions Columbia-0 (Col-0) and N14 of the higher plant Arabidopsis thaliana and to quantify the transcripts.
Bwa (Burrows–Wheeler-Alignment) was developed for mapping short DNA sequences against a reference genome and was extended for RNA-Seq data analysis. For indexing, the algorithm constructs a suffix array and Burrows–Wheeler-Transformation (BWT), and subsequently matches the sequences using a backward search [11]. STAR (Spliced Transcripts Alignment to a Reference) is a specialized tool for RNA-Seq reads that uses a seed-extension search based on compressed suffix arrays [12] and can detect splice-junctions. HISAT2 (Hierarchical Indexing for Spliced Alignment of Transcripts 2) is also a splice-aware aligner using a graph-based alignment approach (graph Ferragina Manzini index) that can align DNA and RNA sequences [13]. RSEM (RNA-Seq by Expectation Maximization) is a software package that quantifies transcript abundances. It can employ different pre-defined mappers such as bowtie2 and based on the generated alignments utilizes a maximum likelihood abundance estimation, the expectation-maximization algorithm, as the statistical model to quantify transcripts [14]. By contrast, salmon and kallisto are tools which do not perform a classical alignment of individual bases, but instead implement new strategies for RNA-Seq quantification. Salmon is based on the concept of quasi-mapping. It uses a suffix array that is BWT-indexed and searched by an FMD algorithm, allowing the discovery of shared substrings of any length between a read and the complete set of transcripts. Mismatches are handled with chains of maximally exact matches [15]. The concept of kallisto is based on pseudo-alignments. Pseudo-alignments define a relationship between a read and a set of compatible transcripts. This relationship is computed based on “mapping” the k-mers to paths in a transcript De Bruijn graph. As the pseudo-alignments are generated, equivalence classes are computed and used for the relative isoform quantification [16]. CLC read mapping utilizes an approach described by Mortazavi et al. [3] and is the only commercial tool with a graphical user interface included in our study.
Here, we compare the performance of these seven RNA-Seq mappers in the analysis of experimentally generated transcriptome data covering more than 30,000 Arabidopsis thaliana genes. The analysis compares alignment accuracy and quantification to enable comprehensive biological interpretation. For the RNA-Seq experiment, RNA was isolated from the higher plant Arabidopsis thaliana and the performance of each software was tested on 150 bp single-end reads from the two natural accessions Col-0 and N14 [17]. Mappability, raw count expression, overall similarity of the count distribution and differential gene expression (DGE) were analyzed to compare the mappers. The two splice-aware aligners HISAT2 and STAR were compared for accuracy by mapping the reads against the reference genome without an annotation. Additionally, an in silico approach to characterize the correctness of the mappers was performed (see Figure 1 for a schematic description of the analysis workflow).

2. Results

2.1. Mapping Statistics

After pre-processing, the resulting dataset contained 36 samples [17] with a sequencing data size ranging from about 21 to almost 33 × 106 reads (Table A1). In general, a high fraction of the total reads was mapped for both accessions. The mapping for Col-0 was slightly better than for N14 (Figure 2) with mapped reads between 95.9% (bwa) and 99.5% (STAR). For N14 between 92.4% (bwa) and 98.1% (STAR) of the reads were mapped against the respective reference sequence of Col-0 (Table A2).

2.2. Raw Count Distribution for Individual Samples

Raw count distributions between the mappers were investigated for both accessions. The unfiltered expression values for each mapper were plotted against each other and correlations computed. The results for one control sample of Col-0 (sample A) and N14 (sample B) are shown as an example (Figure 3). For Col-0 (Figure 3a), high correlation coefficients between 0.977 (STAR vs. CLC) and 0.997 (kallisto vs. salmon) were determined. For N14 (Figure 3b) the correlation coefficients ranged from 0.978 (CLC vs. HISAT2) to 0.996 (kallisto vs. salmon). Regarding the STAR and HISAT2 comparisons with all other mappers, a higher variance was observed in the direction of STAR and HISAT2 for lowly expressed genes.

2.3. Overall Comparison of the Mappers

For a more quantitative comparison, the raw counts generated by each mapper from all samples were compared against each other employing the Rv coefficient to quantify similarity. The raw count tables generated by the seven mappers have a high similarity indicated by Rv values close to 1 (Figure 4). Salmon and kallisto showed the highest similarity (Rv = 0.9999). CLC mapped slightly differently compared to bwa, HISAT2, kallisto, RSEM and salmon. However, it should be stressed that the raw count tables of all mappers were very similar; with 0.9804 as the lowest Rv value (CLC vs. HISAT2).
To investigate the effect of mapper choice on further statistical analysis, differentially expressed genes between control and cold acclimated conditions were determined [17]. In the read mapping steps, the aligners bwa, salmon and kallisto, using the transcriptomic reference, identified 32,243 expressed genes and thus 1,359 genes less than the other mappers with 33,602 genes each. This difference is due to the presence of non-coding RNAs such as transfer RNAs (tRNA) and micro RNAs (miRNA) in the genomic reference, which are absent from the transcriptomic reference that is based on poly-adenylated mRNAs. Prior to DGE analysis, transcript raw count tables were filtered to remove lowly expressed genes with less than five counts over all 36 samples, resulting in 23,903 (CLC) to 25,144 (RSEM) genes (Table 1). While this cut-off is admittedly arbitrary, most genes are removed with a cut-off of 1 read count (around 20%), while additional increases from 2 to 10 counts only reduce the number of genes by 2–0.3% per additional count, making the exact cut-off rather uncritical.
The percentage of overlapping DGE (control vs. cold acclimated) identified by each pair of mappers was analyzed in both directions using DESeq2 [18] in all cases and was plotted in an asymmetric matrix. For Col-0 (Figure 5a) kallisto and salmon yielded a large overlap of DGE of 98% (kallisto vs. salmon) and 97.7% (salmon vs. kallisto). For N14 (Figure 5b) slightly smaller overlaps were detected, but also here salmon and kallisto (97.6% and 96.4%) yielded the largest overlap. On the other hand, for both Col-0 and N14 the lowest overlaps were detected for bwa and STAR (93.4% and 92.1%, respectively). In general, a smaller overlap of DGE between 92% and 94% was identified for the comparisons of STAR and HISAT2 with the remaining five mappers.
DGE analysis [19,20] was additionally performed directly in the CLC software instead of using DESeq2. Using the standard significance levels for these two software packages (FDR < 0.1 and FDR < 0.05 for DESeq2 and CLC, respectively) this resulted in a much higher number of significantly differentially expressed genes for the two exemplary comparisons, detailed under Methods, compared to the DESeq2 analysis (Table 2). Also, there was only a limited overlap between the results of the two methods.
All mappers have different options to perform RNA-Seq quantification (Table 3). While most mappers can only use either a genome or a transcriptome reference, CLC, HISAT2 and STAR are able to use both types of reference sequences to align transcripts. Depending on the downstream analysis, it is essential which output each mapper provides. The classical alignment-based mappers bwa, CLC, HISAT2, RSEM and STAR provide an alignment output of the reads against the references, whereas salmon and kallisto only provide read quantifications. Nevertheless, kallisto offers a “pseudo-alignment” which can generate alignment files and salmon provides an option to re-quantify RNA-Seq reads using previously generated alignments against the transcriptome as obtained, for example, from STAR. Five out of the seven mappers generate transcript count tables. Only for HISAT2 and bwa additional tools have to be employed for this purpose.
For a more detailed investigation of the comparability of the outputs of different mappers, three of the seven mappers were analyzed in detail regarding read position on the reference sequence. The overlap of reads from one sample, which were mapped by HISAT2, bowtie2/RSEM and STAR, was determined and the positions of the mapped reads on the reference genome were compared. For Col-0 around 11.2 × 106 (Figure 6a) of around 24.9 × 106 mapped reads and for N14 around 10.5 × 106 reads (Figure 7a) of around 22.0 × 106 mapped reads were located on the same genomic position by all three mappers. For both accessions, bowtie2/RSEM showed a high number of reads mapping to a different position compared to HISAT2 and STAR. The number of reads with a unique position was between 20.4-fold and 10.9-fold higher for bowtie2/RSEM than for the other two mappers. Hence, the differences in read positions were determined, showing that most of these reads had a position that differed by one base pair. This is the result of soft clipping of the first or last base pair that is performed by HISAT2 and STAR. After adding the base pair back to the reads in HISAT2 and STAR, the overlap with RSEM increased to 20.8 × 106 reads for Col-0 (Figure 6b) and to 17.9 × 106 reads for N14 (Figure 7b). However, RSEM still produced between 18.4-fold and 3.8-fold more uniquely positioned reads than HISAT2 and STAR that cannot be explained by soft clipping.
Additionally, the two splice-aware aligners HISAT2 and STAR were tested for accuracy. Reads of all 36 biological samples were mapped against the reference genome sequence without annotation and reads on exons were determined with featureCounts (Table 4). For Col-0, 93% (STAR) and 94% (HISAT2), and for N14 around 91% (both mappers) of the primary alignments were mapped to known exons. A small fraction of reads were not assigned to the annotated exons due to no mapping, multimapping (i.e., mapping to more than one location) or mapping to intergenic regions.

2.4. Mapping of in Silico Generated Reads

To investigate whether mappers placed the mapped reads in the correct positions on the reference genome, the alignment results for in silico generated Col-0 RNA-Seq reads were analyzed using HISAT2, bowtie2/RSEM and STAR. All three mappers correctly positioned a high percentage (almost 99%) of the reads on the respective reference sequence (Table 5) for the primary alignments. Almost all remaining reads were mapped to the correct gene, but to a different transcript. Furthermore, only 0.001 to 0.03% of the reads were not mapped against the reference sequence for all mappers. A small number of reads mapped to intergenic regions for STAR and HISAT2 while for bowtie2/RSEM all reads were mapped on known genes. This derives from the fact that the used mapper bowtie2 is a splice unaware aligner that only maps against the transcriptome which was extracted from the genome reference. For the secondary alignments of HISAT2 and STAR, which only constituted 3.2% (STAR) and 3.8% (HISAT2) of the total alignments, 41.5% (HISAT2) and 36.9% (STAR) of the reads were correctly aligned. The majority of the secondary alignments, 55% for HISAT2 and 59% for STAR, mapped the reads to wrong positions, mostly to wrong (unrelated) or paralogous genes. For bowtie2/RSEM, almost 43% of these reads were mapped multiple times. Nearly 96% of these reads were mapped to the wrong gene.
For a better overview, the alignments were split into primary and secondary alignments. If a read maps multiple times against the reference, one mapping is defined as primary (underlying criteria depend on the mapper), while the other mappings are classified as secondary alignments.

3. Discussion

RNA-Seq data from the Arabidopsis thaliana accessions Col-0 and N14 were mapped with five alignment-based and two pseudo-alignment tools. For Col-0, high mappability of the 150 bp single-end Illumina reads to the Col-0 reference genome or transcriptome was found for all seven alignment tools, ranging from 95.9% (bwa) to 99.5% (STAR). A slightly smaller fraction of the reads obtained from N14 was mapped to the same references, ranging from 92.4% to 98.1%. The high quality of the reference sequences may contribute to the high fraction of mapped reads. For both accessions, bwa had the lowest performance and STAR the highest, although it should be stressed that differences in mappability for any sample between the mapping tools ranged only from 1% to 4%. Comparable performance of different mapping tools has been found in previous studies using either simulated reads or RNA-Seq reads obtained from various non-plant organisms [21,22,23,24,25]. On the other hand, another report showed that seed-extended approaches used by STAR performed better than e.g., exon-first approaches, when mapping reads from genetically polymorphic species [26].
Considering the two accessions separately, the high number of mapped reads for Col-0 is in agreement with the fact that the Col-0 reference sequences were used for mapping. However, a small number of reads was not mapped, potentially due to sequencing errors or to polymorphisms between the publicly available genome sequence and the genome of the Col-0 population used in our experiments. In this context it has to be kept in mind that the Col-0 populations used in various laboratories around the world have been separated for many generations and have very likely accumulated different mutations over time [27]. The generally lower percentage of mapped reads for N14 can be explained by natural variation between the accessions [28,29].
In addition to the percentage of mapped reads, the correctness of the mapping of reads to the reference genome or transcriptome is also of crucial importance to obtain reliable biological information from an RNA-Seq experiment. We found that HISAT2 and STAR had a high overlap of reads mapping to the same position in the reference sequence. The differences in read positions between bowtie2/RSEM and HISAT2/STAR originated to a large part from the soft-clipping, mostly of the first base of the reads, by both aligners. Soft-clipping can be turned off in both tools and that largely eliminates the observed differences. However, STAR has a higher tolerance for more soft-clipped and mismatched bases compared to HISAT2, which leads to a higher mapping rate for STAR and more unmapped reads for HISAT2 [24]. Also, in our analysis, STAR showed the highest fraction of mapped reads for both accessions among all compared mapping tools.
Our analysis of an in silico generated RNA-Seq data set also indicated that differences in the mapping quality between the three mappers are most likely due to their different ability to deal with mismatches. About 99% of the primary aligned reads were correctly positioned and the mappers showed the same performance when synthetic reads without any mismatches between read and reference sequences were used. This indicates that mapper performance may also depend on other factors, such as the complexity of the genome, read length and read quality [22]. The high fraction of correctly mapped reads may in part be due to the comparatively small genome of Arabidopsis with roughly 130 megabases and a low content of repetitive DNA sequences [30,31]. Regarding the secondary alignments, RSEM showed a high number of multimapped reads. The mapping for RSEM was performed with the mapper bowtie2 which searches for distinct, valid alignments for each read. As long as no upper limit is defined, bowtie2 will continue to look for all alignments that are as good or better for one read [32]. If the same read maps multiple times with the same quality string, the primary alignment is chosen randomly. The quantification algorithm of RSEM also depends on a high number of multi-mapped reads.
From a biological point of view, the quantification of gene expression is the most important part of an RNA-Seq experiment as researchers are mostly interested in the identification of differentially expressed genes, either between conditions or between genotypes. Correct mapping, as discussed above, is important to identify the correct genes as being differentially expressed. However, determining the correct read count numbers is of at least equal importance [33]. We have addressed this issue on two levels by comparing raw counts for the different genes or transcripts among the mapping tools and by comparing differentially expressed genes between plants grown under ambient and cold conditions identified by the different tools.
To investigate the results obtained by the different tools on the basis of raw counts, raw count numbers for each gene/transcript of a single sample from Col-0 and N14 each, generated by the different mappers, were plotted against each other. In general, high similarities among the mappers were observed, indicated by correlation coefficients close to 1. Similarly, when the raw counts were compared between mappers for all 36 biological samples generated in this study, Rv values close to 1 indicated a good correspondence in the expression levels computed by all seven software tools.
To analyse the effects of the mapping tools on the DGE analysis, we compared expression levels of control plants grown at ambient temperature with expression levels of plants that were exposed to 4 °C for three days (cold acclimation; compare [17] for a detailed description). Significantly differentially expressed genes were in all cases identified using the DESeq2 tool. The results showed that the raw counts generated by the different mappers resulted in clear differences in the number of significantly differentially expressed genes, with an overlap between mappers from 98.0% between kallisto and salmon in Col-0, and 92.1% between bwa and STAR in N14. The small sample size (three samples per condition and accession) may of course contribute to the uncertainty in identifying differentially expressed genes unambiguously [34]. However, this sample size is currently the standard in biological experiments and therefore our results give a realistic impression of what the user can expect from the performance of these tools.
Finally, the results from DESeq2 and from the DGE-pipeline of CLC were compared. Interestingly, CLC identified about 50% more differentially expressed genes than DESeq2. Since the same alignments for downstream analysis were used in both cases, this difference cannot originate from differences in the mapping and raw count generation. Therefore, the normalization (to one million counts) as well as the statistical tests used by CLC must have led to these differences. In a transcriptome analysis of mouse tissues, different DGE tools such as DESeq2 and CLC were compared, resulting in a better performance for DESeq2 compared to both CLC approaches [35]. The results were experimentally validated by qRT-PCR for 18 differentially expressed genes. For the CLC Baggerly approach large differences to qRT-PCR results were shown. The CLC EDGE approach yielded results that were more similar to the expression changes found by qRT-PCR and those detected by DESeq2. However, in our analysis, the CLC approaches yielded results that were largely different from those obtained by DESeq2.

4. Materials and Methods

4.1. Experimental Dataset

RNA samples of the Arabidopsis thaliana accessions Col-0 and N14 were used for RNA-Seq as described in detail recently [17]. Plant material was collected from three independent biological experiments resulting in a total of 36 samples. Samples were taken after 28 days of growth at 20 °C, after an additional three days of cold acclimation at 4 °C, after a subsequent seven day period at 20 °C and after a final three days at 4 °C. Additionally, samples from developmental control plants were taken after 35 days at 20 °C and a subsequent three days of cold acclimation at 4 °C (Details of all samples are given in Table A3). Library preparation and sequencing were performed by the Max-Planck Genome Centre Cologne, Germany (https://mpgc.mpipz.mpg.de/home/). Libraries were constructed with NEBNext Ultra Directional RNA Library Prep Kit for Illumina (New England Biolabs) including polyA enrichment. Illumina HiSeq 3000 technology was used for sequencing and yielded 150 base pair (bp) long single end reads. RNA-Seq raw counts are available at GEO [36] under the accession number GSE112225. A detailed biological analysis of the RNA-Seq data has been presented recently [17].

4.2. Mapping

Quality control of the raw reads and adapter trimming have been described previously [17]. The genomic FASTA sequence, cDNA and GTF annotation files of Arabidopsis thaliana Col-0 were downloaded from EnsemblPlants [37], version TAIR10, release 31 [38]. For read mapping bwa, CLC Genomics Workbench, HISAT2, kallisto, RSEM, salmon and STAR were used, employing pre-defined default parameters as far as possible (Table 6). Bwa aln was used for higher sensitivity and resulting sai files were converted into alignment files with bwa sampe. For kallisto and salmon it was necessary to set parameters for single-end data, define the estimated average read length as well as its estimated standard deviation. As index mode for salmon, --type quasi and a stranded library type were chosen. For expression quantification kallisto and salmon were run in quant mode. For STAR, 1-pass mode was used and additional parameters were defined to sort the alignments, to limit multi-mapping and to keep unmapped reads in the alignments as well as generating the gene count output. HISAT2 was run with default parameters, for index generation annotation was included (Table 6). All tools are freely available except the CLC Genomics Workbench which is a commercial tool that requires purchase of a license. For the mappings without annotation, HISAT2 was run with default parameters and without inserting the annotation into index generation. STAR was run in the 2-pass mode. To determine the reads mapping on exons, featureCounts v2.0.0 [39] (--primary -T 10 -f -O -F GTF -t exon -g gene_id) was used. Expression values were natively generated by five of the seven mappers. For bwa, samtools idxstat and for HISAT2, featureCounts v. 2.0.0 [39] were used to determine raw counts. For mapping statistics and further analysis of the alignment files, samtools v1.3 [40] was employed.

4.3. Comparison Based on Expression Values

For the comparison of the expression values (raw counts), samples A for Col-0 and B for N14 (grown under 20 °C control conditions; see Table A3) were chosen as an example. Raw counts were log2(counts + 1) transformed and results visualized with the R-package ggplot2 [42]. For an overall comparison the Rv coefficient [43] based on correlation matrices of the unfiltered raw count tables of samples A and B over all mappers was calculated using the R-package FactoMineR [44]. Spearman correlation was used for correlation analysis and the significance of the results was tested as described [45]. The results were visualized employing the R-package corrplot [46].

4.4. Differential Gene Expression

Prior to the differential gene expression (DGE) analysis, estimated read counts provided by RSEM, kallisto and salmon were rounded to obtain integer values. The resulting count tables for all mappers were filtered to discard lowly expressed genes by keeping only those with a sum greater than five counts per gene for all 36 samples. The DGE analysis was performed using the R-Package DESeq2 [18] including the normalization step. For CLC, alignment files were extracted and processed in the same way as for the other six mappers. Data was loaded with the function DESeqDataSetFromMatrix. Additional parameters for DGE were used as follows: test = “Wald”, fitType=”local” and including a batch effect correction in the design formula. For determining differentially expressed genes, a threshold p-value < 0.1 after false-discovery rate correction [47] and an absolute log2 fold change > 1 were used. Results of the comparison control vs. cold acclimation (Table A3) for Col-0 (samples A, M, Y vs. C, O, AA) and N14 (samples B, N, Z vs. D, P, AB) were investigated in detail.
Additionally, the built-in CLC workbench plugin for DGE was tested based on the mappings generated by CLC. Data was normalized “By totals” to a value of 1,000,000. Normalized data was used for determination of differentially expressed genes using the “Empirical analysis of DGE” [19] and “Baggerly’s test on proportions” [20] with multiple testing correction of the generated p-values [47]. Next to the control vs. cold acclimation comparisons described above, the cold acclimated developmental controls (samples I, U, AG for Col-0 and J, V, AH for N14) were compared to the second cold stress treatment (samples K, W, AI for Col-0 and L, X, AJ for N14; Table 1). The numbers of significantly differentially expressed genes (FDR p < 0.05, abs(log2 fold change) > 1) were compared with the results obtained by DESeq2 based on the STAR alignments.

4.5. Mapping of in Silico Generated Reads

To investigate the mapping quality of the tools, reads were generated in silico using the A. thaliana transcriptome (TAIR10) and applying a sliding window approach (window size: 150 bp, shift: 1 bp) resulting in approximately 58 × 106 in silico reads. Reads were mapped with HISAT2 (using the same parameters as above), RSEM and STAR (without --outFilterMultimapNmax and --alignSJDBoverhangMin). For identification, the in silico reads contained the transcript name and the position of the read on the transcript as identifiers. Additionally, the GTF annotation file was reduced to the exon entries and the overlap with the resulting alignment files of HISAT2, RSEM and STAR was determined with bedtools [48]. Furthermore, transcript IDs were compared between alignment entry and GTF entry to identify correctly mapped reads.

5. Conclusions

All tested mappers provided highly comparable results for mapping Illumina reads from the genetically distinct Arabidopsis accessions Col-0 and N14 to the Col-0 reference genome or transcriptome. The same was true for the determination of DGE when DESeq2 was used for processing. We conclude that all seven mappers can be equally used for RNA-Seq data analysis in Arabidopsis, even with different accessions. The only caveat is that using the CLC software for the identification of DGE yielded strongly varying results. Further research will be needed to establish whether read mapping to more complex genomes with larger non-coding regions or higher ploidy levels would pose additional challenges that may reveal larger differences between the mappers.

Author Contributions

Formal analysis, S.S.; Funding acquisition, D.K.H.; Methodology, A.F.; Project administration, D.K.H.; Software, A.F.; Supervision, A.F. and E.Z.; Visualization, S.S.; Writing—original draft, S.S. and D.K.H.; Writing—review & editing, A.F. and E.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was in part supported by a grant from the German Research Foundation (DFG) through Collaborative Research Center 973, Project A3 to DKH. The funders had no role in the design of the study and collection, analysis, and interpretation of the data and in writing the manuscript.

Acknowledgments

We thank the Max-Planck Genome Centre Cologne (http://mpgc.mpipz.mpg.de/home/) for RNA-Seq sequencing, Jessica Alpers for RNA extraction and Dirk Walther for critical reading of the manuscript and helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

RNA-SeqRNA-sequencing
DGEDifferential Gene Expression
BWTBurrows–Wheeler-Transformation

Appendix A

Table A1. Number of reads for raw and pre-processed data.
Table A1. Number of reads for raw and pre-processed data.
SampleNumber of Reads Raw Data Number of Reads Pre-Processed Data
A26,551,07825,965,205
B24,160,25323,723,408
C24,987,21124,631,398
D24,679,89124,314,564
E32,902,96632,265,838
F25,343,87024,962,434
G25,633,39125,255,295
H24,767,05624,276,316
I22,434,13822,074,152
J27,102,01326,738,311
K29,909,22029,473,355
L30,039,89529,625,213
M25,373,17325,045,811
N27,401,91127,059,316
O32,172,33931,758,225
P26,713,32526,326,809
Q28,367,00127,941,198
R21,784,60621,476,277
S23,466,19123,142,088
T25,002,98924,642,826
U25,470,73725,137,081
V32,322,58231,890,842
W31,880,03431,451,153
X28,614,86328,223,380
Y25,396,75324,312,026
Z25,402,96224,351,761
AA21,934,47721,095,112
AB29,068,27127,924,700
AC28,363,13327,205,327
AD27,538,80726,446,048
AE21,048,97920,198,121
AF22,915,89321,786,356
AG26,195,08925,161,103
AH23,710,16022,348,705
AI25,915,84024,936,936
AJ27,904,77626,835,785
Pre-processed raw data was filtered for a minimum read length of 80 base pairs and Illumina adapters were removed.
Table A2. Number of mapped reads for each mapper and sample.
Table A2. Number of mapped reads for each mapper and sample.
SamplebwaCLCHISAT2kallistoRSEMsalmonSTAR
A24,990,28825,070,33225,727,06425,202,78825,068,40025,488,50025,877,150
B22,235,86022,831,18522,831,42722,489,98422,450,83422,625,10023,535,895
C23,568,63123,650,96924,398,52723,911,33123,729,82224,096,40024,545,823
D22,665,14523,292,37423,392,01123,182,01123,001,55223,294,40024,114,936
E31,067,88931,136,18331,948,36031,315,58631,186,63531,651,60032,144,692
F22,079,18623,828,24922,529,27423,226,05522,975,46923,289,40024,362,368
G24,360,05324,368,63925,003,45124,630,74324,435,93124,818,00025,152,392
H22,607,76823,256,06023,230,51022,983,23422,847,49723,135,80023,972,434
I20,887,12820,897,57521,647,74421,094,90521,052,74121,301,50021,759,724
J25,002,88925,748,82125,729,98025,530,36125,258,25825,626,80026,525,228
K28,251,89228,394,90229,083,56128,728,01828,398,03128,924,40029,340,134
L27,691,13328,565,61128,456,64028,333,83327,965,33028,411,70029,380,081
M24,027,75424,158,40424,771,96724,370,15024,159,38824,539,20024,947,419
N25,448,51826,128,34726,116,04625,859,91225,708,96825,908,50026,872,750
O30,483,32230,538,74131,426,08230,970,54930,650,48831,145,90031,631,436
P23,748,27524,471,94024,562,93224,318,42224,070,55124,406,80025,332,412
Q26,863,08926,968,68127,679,89127,157,40126,977,07627,405,00027,843,106
R19,700,00020,245,10120,218,35919,970,83619,918,38320,052,80020,826,196
S22,171,45822,274,28022,902,42322,444,86822,280,93622,624,30023,035,647
T23,165,18223,815,75123,748,28423,543,78923,398,52323,638,10024,456,937
U24,145,57524,192,31924,905,59524,499,41124,279,09924,667,80025,057,610
V29,305,19830,181,89930,105,42329,951,05029,635,35530,012,70031,037,834
W30,171,99130,240,22931,135,27230,619,32030,314,72430,820,90031,321,391
X26,417,78127,215,57927,146,63926,999,06826,701,97127,089,60028,004,846
Y23,467,49323,523,45724,062,63723,713,95723,548,79123,915,80024,211,437
Z22,939,89023,531,63723,488,30723,241,25823,186,26223,333,70024,169,828
AA20,347,06220,333,84120,891,79820,594,46020,425,03120,742,50021,011,777
AB26,183,32426,842,81026,903,03326,663,99726,539,90226,769,80027,709,817
AC26,065,88526,102,79526,890,84726,358,64426,235,20926,562,40027,054,414
AD24,904,56025,532,00625,483,65725,267,36625,201,02225,348,10026,234,545
AE19,055,41419,320,69219,842,59719,566,39219,295,59719,670,30019,967,391
AF17,469,94918,053,59018,090,81517,854,05917,755,99717,898,70018,672,118
AG24,163,36524,161,81224,876,95224,468,97124,283,10824,682,50025,047,179
AH20,953,17421,498,74621,436,52021,310,76021,215,86621,379,40022,109,670
AI23,823,05823,916,42924,617,94424,223,76623,973,24524,383,70024,792,766
AJ25,023,00525,804,95825,767,49525,481,09825,325,86625,563,80026,594,060
Col-0 %95.996.298.997.296.497.999.5
N14 %92.495.294.994.293.694.698.1
Total %94.195.796.995.795.096.398.8
Tools are sorted alphabetically by name. Total describes the fraction of mapped reads for both accessions Col-0 and N14.
Table A3. Sample list with sample name, condition (Cond.) and accession (Acc.).
Table A3. Sample list with sample name, condition (Cond.) and accession (Acc.).
Experiment 1Experiment 2Experiment 3
Cond.Acc.Sample Cond.Acc.SampleCond.Acc.
C28Col-0MC28Col-0YC28Col-0
C28N14NC28N14ZC28N14
C28P3Col-0OC28P3Col-0AAC28P3Col-0
C28P3N14PC28P3N14ABC28P3N14
C35Col-0QC35Col-0ACC35Col-0
C35N14RC35N14ADC35N14
C28P3L7Col-0SC28P3L7Col-0AEC28P3L7Col-0
C28P3L7N14TC28P3L7N14AFC28P3L7N14
C35P3Col-0UC35P3Col-0AGC35P3Col-0
C35P3N14VC35P3N14AHC35P3N14
C28P3L7T3Col-0WC28P3L7T3Col-0AIC28P3L7T3Col-0
C28P3L7T3N14XC28P3L7T3N14AJC28P3L7T3N14
Samples were taken from three independent biological experiments. C28/C35: Control plants after 28 days or 35 days of growth at 20 °C; C28P3/C35P3: plants after an additional 3 days of cold treatment at 4 °C; C28P3L7: cold treated plants after a further 7 days at 20 °C; C28P3L7T3: plants after an additional 3 days at 4 °C.

References

  1. Collins, F.S.; Morgan, M.; Patrinos, A. The Human Genome Project: Lessons from large-scale biology. Science 2003, 300, 286–290. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Wang, Z.; Gerstein, M.; Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009, 10, 57–63. [Google Scholar] [CrossRef] [PubMed]
  3. Mortazavi, A.; Williams, B.A.; McCue, K.; Schaeffer, L.; Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Meth. 2008, 5, 621–628. [Google Scholar] [CrossRef] [PubMed]
  4. Dillies, M.-A.; Rau, A.; Aubert, J.; Hennequet-Antier, C.; Jeanmougin, M.; Servant, N.; Keime, C.; Marot, G.; Castel, D.; Estelle, J.; et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis. Brief. Bioinform. 2012, 14, 671–683. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  5. Rapaport, F.; Khanin, R.; Liang, Y.; Pirun, M.; Krek, A.; Zumbo, P.; Mason, C.E.; Socci, N.D.; Betel, D. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data. Genome Biol. 2013, 14, R95. [Google Scholar] [CrossRef] [Green Version]
  6. Benjamin, A.M.; Nichols, M.; Burke, T.W.; Ginsburg, G.S.; Lucas, J.E. Comparing reference-based RNA-Seq mapping methods for non-human primate data. BMC Genom. 2014, 15, 570. [Google Scholar] [CrossRef] [Green Version]
  7. Lin, Y.; Golovnina, K.; Chen, Z.X.; Lee, H.N.; Negron, Y.L.; Sultana, H.; Oliver, B.; Harbison, S.T. Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC Genom. 2016, 17, 28. [Google Scholar] [CrossRef] [Green Version]
  8. Amin, S.; Prentis, P.J.; Gilding, E.K.; Pavasovic, A. Assembly and annotation of a non-model gastropod (Nerita melanotragus) transcriptome: A comparison of De novo assemblers. BMC Res. Notes 2014, 7, 488. [Google Scholar] [CrossRef] [Green Version]
  9. Conesa, A.; Madrigal, P.; Tarazona, S.; Gomez-Cabrero, D.; Cervera, A.; McPherson, A.; Szczesniak, M.W.; Gaffney, D.J.; Elo, L.L.; Zhang, X.; et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016, 17, 13. [Google Scholar] [CrossRef] [Green Version]
  10. Rana, S.B.; Zadlock, F.J.I.V.; Zhang, Z.; Murphy, W.R.; Bentivegna, C.S. Comparison of de novo transcriptome assemblers and k-mer strategies using the killifish, Fundulus heteroclitus. PLoS ONE 2016, 11, e0153104. [Google Scholar] [CrossRef]
  11. Li, H.; Durbin, R. Fast and accurate short read alignment with Burrows—Wheeler transform. Bioinformatics 2009, 25, 1754–1760. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  12. Dobin, A.; Davis, C.A.; Schlesinger, F.; Drenkow, J.; Zaleski, C.; Jha, S.; Batut, P.; Chaisson, M.; Gingeras, T.R. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics 2012, 29, 15–21. [Google Scholar] [CrossRef] [PubMed]
  13. Kim, D.; Paggi, J.M.; Park, C.; Bennett, C.; Salzberg, S.L. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 2019, 37, 907–915. [Google Scholar] [CrossRef] [PubMed]
  14. Li, B.; Dewey, C.N. RSEM: Accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 2011, 12, 323. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Patro, R.; Duggal, G.; Love, M.I.; Irizarry, R.A.; Kingsford, C. Salmon provides fast and bias-aware quantification of transcript expression. Nat. Meth. 2017, 14, 417. [Google Scholar] [CrossRef] [Green Version]
  16. Bray, N.L.; Pimentel, H.; Melsted, P.; Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 2016, 34, 525–527. [Google Scholar] [CrossRef]
  17. Zuther, E.; Schaarschmidt, S.; Fischer, A.; Erban, A.; Pagter, M.; Mubeen, U.; Giavalisco, P.; Kopka, J.; Sprenger, H.; Hincha, D.K. Molecular signatures associated with increased freezing tolerance due to low temperature memory in Arabidopsis. Plant Cell Environ. 2019, 42, 854–873. [Google Scholar]
  18. Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014, 15, 550. [Google Scholar] [CrossRef] [Green Version]
  19. Robinson, M.D.; McCarthy, D.J.; Smyth, G.K. edgeR: A Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 2010, 26, 139–140. [Google Scholar] [CrossRef] [Green Version]
  20. Baggerly, K.A.; Deng, L.; Morris, J.S.; Aldaz, C.M. Differential expression in SAGE: Accounting for normal between-library variation. Bioinformatics 2003, 19, 1477–1483. [Google Scholar] [CrossRef]
  21. Baruzzo, G.; Hayer, K.E.; Kim, E.J.; Di Camillo, B.; FitzGerald, G.A.; Grant, G.R. Simulation-based comprehensive benchmarking of RNA-seq aligners. Nat. Meth. 2016, 14, 135. [Google Scholar] [CrossRef] [PubMed]
  22. Everaert, C.; Luypaert, M.; Maag, J.L.V.; Cheng, Q.X.; Dinger, M.E.; Hellemans, J.; Mestdagh, P. Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Sci. Rep. 2017, 7, 1559. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  23. Jin, H.; Wan, Y.-W.; Liu, Z. Comprehensive evaluation of RNA-seq quantification methods for linearity. BMC Bioinform. 2017, 18 (Suppl. 4), 117. [Google Scholar] [CrossRef] [Green Version]
  24. Sahraeian, S.M.E.; Mohiyuddin, M.; Sebra, R.; Tilgner, H.; Afshar, P.T.; Au, K.F.; Bani Asadi, N.; Gerstein, M.B.; Wong, W.H.; Snyder, M.P.; et al. Gaining comprehensive biological insight into the transcriptome by performing a broad-spectrum RNA-seq analysis. Nat. Commun. 2017, 8, 59. [Google Scholar] [CrossRef] [PubMed]
  25. Teng, M.; Love, M.I.; Davis, C.A.; Djebali, S.; Dobin, A.; Graveley, B.R.; Li, S.; Mason, C.E.; Olson, S.; Pervouchine, D.; et al. Erratum to: A benchmark for RNA-seq quantification pipelines. Genome Biol. 2016, 17, 203. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  26. Garber, M.; Grabherr, M.G.; Guttman, M.; Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Meth. 2011, 8, 469–477. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  27. Ossowski, S.; Schneeberger, K.; Lucas-Lledo, J.I.; Warthmann, N.; Clark, R.M.; Shaw, R.G.; Weigel, D.; Lynch, M. The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science 2010, 327, 92–94. [Google Scholar] [CrossRef] [Green Version]
  28. Atwell, S.; Huang, Y.S.; Vilhjálmsson, B.J.; Willems, G.; Horton, M.; Li, Y.; Meng, D.; Platt, A.; Tarone, A.M.; Hu, T.T.; et al. Genome-wide association study of 107 phenotypes in Arabidopsis thaliana inbred lines. Nature 2010, 465, 627. [Google Scholar] [CrossRef]
  29. Hancock, A.M.; Brachi, B.; Faure, N.; Horton, M.W.; Jarymowycz, L.B.; Sperone, F.G.; Toomajian, C.; Roux, F.; Bergelson, J. Adaptation to climate across the Arabidopsis thaliana genome. Science 2011, 334, 83–86. [Google Scholar] [CrossRef] [Green Version]
  30. Meinke, D.W.; Cherry, J.M.; Dean, C.; Rounsley, S.D.; Koornneef, M. Arabidopsis thaliana: A model plant for genome analysis. Science 1998, 282, 662–682. [Google Scholar] [CrossRef] [Green Version]
  31. Mayer, K.; Schüller, C.; Wambutt, R.; Murphy, G.; Volckaert, G.; Pohl, T.; Düsterhöft, A.; Stiekema, W.; Entian, K.D.; Terryn, N.; et al. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 1999, 402, 769. [Google Scholar] [CrossRef] [PubMed]
  32. Kim, D.; Langmead, B.; Salzberg, S.L. HISAT: A fast spliced aligner with low memory requirements. Nat. Meth. 2015, 12, 357–360. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  33. Fonseca, N.A.; Marioni, J.; Brazma, A. RNA-Seq gene profiling—A systematic empirical comparison. PLoS ONE 2014, 9, e107026. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  34. Soneson, C.; Delorenzi, M. A comparison of methods for differential expression analysis of RNA-seq data. BMC Bioinform. 2013, 14, 91. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  35. Kumar, P.K.; Hoang, T.V.; Robinson, M.L.; Tsonis, P.A.; Liang, C. CADBURE: A generic tool to evaluate the performance of spliced aligners on RNA-Seq data. Sci. Rep. 2015, 5, 13443. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  36. Edgar, R.; Domrachev, M.; AE, L. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002, 30, 2074. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  37. EnsemblPlants Arabidopsis Thaliana Assembly and Gene Annotation. Available online: http://plants.ensembl.org/info/website/ftp/index.html (accessed on 5 June 2016).
  38. Berardini, T.Z.; Reiser, L.; Li, D.; Mezheritsky, Y.; Muller, R.; Strait, E.; Huala, E. The Arabidopsis information resource: Making and mining the “gold standard” annotated reference plant genome. Genesis 2015, 53, 474–485. [Google Scholar] [CrossRef] [Green Version]
  39. Liao, Y.; Smyth, G.K.; Shi, W. featureCounts: An efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 2014, 30, 923–930. [Google Scholar] [CrossRef] [Green Version]
  40. Li, H.; Handsaker, B.; Wysoker, A.; Fennell, T.; Ruan, J.; Homer, N.; Marth, G.; Abecasis, G.; Durbin, R. The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009, 25, 2078–2079. [Google Scholar] [CrossRef] [Green Version]
  41. Qiagen CLC Genomics Workbench. Available online: https://www.qiagenbioinformatics.com/ (accessed on 25 February 2019).
  42. Wickham, H. ggplot2: Elegant Graphics for Data Analysis; Springer: New York, NY, USA, 2016. [Google Scholar]
  43. Kazi-Aoual, F.; Hitier, S.; Sabatier, R.; Lebreton, J.-D. Refined approximations to permutations tests for multivariate inference. Comput. Stat. Data Anal. 1995, 20, 643–656. [Google Scholar] [CrossRef]
  44. Lê, S.; Josse, J.; Husson, F. FactoMineR: An R package for multivariate analysis. J. Stat. Softw. 2008, 25, 1–18. [Google Scholar] [CrossRef] [Green Version]
  45. Josse, J.; Husson, F.; Pagés, J. Testing the significance of the RV coefficient. Comput. Stat. Data Anal. 2007, 53, 82–91. [Google Scholar] [CrossRef]
  46. Wei, T.; Simko, V. R Package “Corrplot”: Visualization of a Correlation Matrix. Available online: https://github.com/taiyun/corrplot (accessed on 3 July 2019).
  47. Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B (Methodol.) 1995, 57, 289–300. [Google Scholar] [CrossRef]
  48. Quinlan, A.R.; Hall, I.M. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics 2010, 26, 841–842. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Figure 1. Analysis workflow. Light gray represents all steps performed for experimental data, light orange for analysis of in silico generated data analyzed with HISAT2, RSEM and STAR.
Figure 1. Analysis workflow. Light gray represents all steps performed for experimental data, light orange for analysis of in silico generated data analyzed with HISAT2, RSEM and STAR.
Ijms 21 01720 g001
Figure 2. Mapper comparison based on mappability. Number of mapped reads against the Col-0 reference sequence for all seven mappers and each accession separately. The analysis included RNA-Seq data from 36 biological samples. Outliers for N14 were in each case sample V for minimum, sample AF for maximum (see Table A3 for sample information).
Figure 2. Mapper comparison based on mappability. Number of mapped reads against the Col-0 reference sequence for all seven mappers and each accession separately. The analysis included RNA-Seq data from 36 biological samples. Outliers for N14 were in each case sample V for minimum, sample AF for maximum (see Table A3 for sample information).
Ijms 21 01720 g002
Figure 3. Raw counts of mapped reads determined by each mapper plotted against each other. Results are shown for sample A of Col-0 (a) and sample B of N14 (b) which both were obtained from plants grown under control conditions at 20 °C (see Table A3 for sample information). Lower triangle represents scatterplots of log2(counts + 1) transformed, unfiltered raw counts for each mapper plotted against each other. The diagonal histograms show the density of the raw count distribution for each mapper. The upper triangle displays the correlation coefficients.
Figure 3. Raw counts of mapped reads determined by each mapper plotted against each other. Results are shown for sample A of Col-0 (a) and sample B of N14 (b) which both were obtained from plants grown under control conditions at 20 °C (see Table A3 for sample information). Lower triangle represents scatterplots of log2(counts + 1) transformed, unfiltered raw counts for each mapper plotted against each other. The diagonal histograms show the density of the raw count distribution for each mapper. The upper triangle displays the correlation coefficients.
Ijms 21 01720 g003
Figure 4. Mapper comparison based on raw count distributions. Graphical representation of the computed Rv values based on the correlation matrices of the unfiltered raw count tables generated by all mappers for all samples from both accessions. Values close to 1 indicate high similarity. The color and shape scales were adjusted to visualize the small differences between the Rv coefficients.
Figure 4. Mapper comparison based on raw count distributions. Graphical representation of the computed Rv values based on the correlation matrices of the unfiltered raw count tables generated by all mappers for all samples from both accessions. Values close to 1 indicate high similarity. The color and shape scales were adjusted to visualize the small differences between the Rv coefficients.
Ijms 21 01720 g004
Figure 5. Overlap of significantly differentially expressed genes among the different mappers for cold acclimated vs control plants. Overlap in % for Col-0 (a) and N14 (b). DGE was determined at FDR p < 0.1 and an absolute log2FC > 1 using the R-package DESeq2. Overlap of differentially expressed genes among each pair of mappers is represented in an asymmetric matrix.
Figure 5. Overlap of significantly differentially expressed genes among the different mappers for cold acclimated vs control plants. Overlap in % for Col-0 (a) and N14 (b). DGE was determined at FDR p < 0.1 and an absolute log2FC > 1 using the R-package DESeq2. Overlap of differentially expressed genes among each pair of mappers is represented in an asymmetric matrix.
Ijms 21 01720 g005
Figure 6. Number of reads mapping on the same genomic position comparing HISAT2, RSEM and STAR for Col-0. Venn diagrams are based on 24,989,667 reads mapped by all three mappers and represent the overlap of mapped reads on the same genomic position for sample A (see Table A3 for sample information). A high number of the uniquely mapped reads in RSEM was based on soft-clipping by one bp performed by HISAT2 and STAR (a). The reads in HISAT2 and STAR were corrected by adding the soft-clipped bp back and the overlap with RSEM increased strongly (b).
Figure 6. Number of reads mapping on the same genomic position comparing HISAT2, RSEM and STAR for Col-0. Venn diagrams are based on 24,989,667 reads mapped by all three mappers and represent the overlap of mapped reads on the same genomic position for sample A (see Table A3 for sample information). A high number of the uniquely mapped reads in RSEM was based on soft-clipping by one bp performed by HISAT2 and STAR (a). The reads in HISAT2 and STAR were corrected by adding the soft-clipped bp back and the overlap with RSEM increased strongly (b).
Ijms 21 01720 g006
Figure 7. Number of reads mapping on the same genomic position comparing HISAT2, RSEM and STAR for N14. Venn diagrams are based on 22,040,847 reads mapped by all three mappers and represent the overlap of mapped reads on the same genomic position for sample B (see Table A3 for sample information). A high number of the uniquely mapped reads in RSEM was based on soft-clipping by one bp performed by HISAT2 and STAR (a). The reads in HISAT2 and STAR were corrected by adding the soft-clipped bp back and the overlap with RSEM increased strongly (b).
Figure 7. Number of reads mapping on the same genomic position comparing HISAT2, RSEM and STAR for N14. Venn diagrams are based on 22,040,847 reads mapped by all three mappers and represent the overlap of mapped reads on the same genomic position for sample B (see Table A3 for sample information). A high number of the uniquely mapped reads in RSEM was based on soft-clipping by one bp performed by HISAT2 and STAR (a). The reads in HISAT2 and STAR were corrected by adding the soft-clipped bp back and the overlap with RSEM increased strongly (b).
Ijms 21 01720 g007
Table 1. Number of expressed genes identified in all samples before and after filtering out lowly expressed genes.
Table 1. Number of expressed genes identified in all samples before and after filtering out lowly expressed genes.
BwaCLCHISAT2KallistoRSEMSalmonSTAR
Before filtering32,24333,60233,60232,24333,60232,24333,602
After filtering24,19723,90324,84024,81025,14424,57424,515
Table 2. DGE analysis using the CLC software.
Table 2. DGE analysis using the CLC software.
DESeq2CLC
ComparisonAccession BaggerlyOverlap
DESeq2
EDGEOverlap
DESeq2
C28P3/C28Col-020143034101329211006
N1421013414106133111052
C28P3L7T3/C35P3Col-01980860
N14116802590
Differential gene expression was calculated with DESeq2 (FDR < 0.1, abs (log2FC > 1), based on STAR alignments and two CLC approaches after Baggerly and EDGE (FDR < 0.05, abs (log2FC > 1)).
Table 3. Comparison of selected key features of the used mappers. Features indicated by X are included in the specified mapper.
Table 3. Comparison of selected key features of the used mappers. Features indicated by X are included in the specified mapper.
Bwa CLCHISAT2KallistoRSEMSalmonSTAR
Reference
Genome XX X
TranscriptomeXXXXXXX
Needs annotationXX XXX
Specifications
Alignment-basedXXX X X
Pseudo-alignment X X
Expression values X XXXX
Splice aware XX X
Commercial software X
Table 4. Fraction of reads mapped to known exons for HISAT2 and STAR.
Table 4. Fraction of reads mapped to known exons for HISAT2 and STAR.
HISAT2STAR
Col-0N14Col-0N14
Assigned to exon94.3490.7093.0590.72
Unmapped1.105.160.501.99
Multimapped4.013.615.936.77
No Feature (intergenic)0.550.530.510.53
To test accuracy of HISAT2 and STAR, reads of the 36 biological samples were mapped against the reference genome without including an annotation. More than 90% of reads were mapped for both accessions and mappers to known exons while a small fraction was either unmapped, multimapped or mapped to intergenic positions.
Table 5. Mapping of the in silico-generated Col-0 transcriptome using HISAT2, RSEM and STAR.
Table 5. Mapping of the in silico-generated Col-0 transcriptome using HISAT2, RSEM and STAR.
HISAT2in %RSEMin %STARin %
Primary
Mapped on right transcript57,981,57098.758,072,53698.958,000,37998.8
Mapped on wrong transcript689,5411.2658,6991.1668,9091.1
Unmapped18,0220.0317730.00119,5260.033
Mapped not on known exon42,8750.07300.043,1940.1
total reads58,732,00810058,732,00810058,732,008100
Secondary
Mapped on right transcript962,75641.51,788,2344.1727,03936.9
Mapped on wrong transcript1,280,62255.142,112,75995.91,164,06559.1
mapped on different gene825,76664.538,112,26590.5842,86472.4
mapped on paralog gene454,17835.53,957,1699.4320,81227.6
mapped on different isoform6780.143,3250.13890.0
Mapped not on exon79,1183.400.077,6473.9
total reads2,322,49610043,900,9931001,968,751100
Table 6. Overview of the seven mappers used in this study.
Table 6. Overview of the seven mappers used in this study.
MapperVersionParametersReference
bwa aln0.7.13DefaultLi and Durbin (2009) [11]
CLC9DefaultQiagen, Hilden, Germany [41]
kallisto quant0.42.5--single, -l 150 and -s 25Bray et al. (2016) [16]
HISAT22.1.0DefaultKim et al. (2019) [19]
RSEM1.2.30--bowtie2, --fragment-length-mean 150 & --fragment-length-sd 25 Li and Dewey (2011) [14]
salmon quant0.6.0--type quasi, -k 31
--fldMean 150, --fldSD 25 and -l SF
Patro et al. (2017) [15]
STAR2.5.2a--outSAMtype BAM
SortedByCoordinate
--outFilterMultimapNmax 20
--alignSJDBoverhangMin 8
--outSAMunmapped Within
--quantMode TranscriptomeSAM GeneCounts
Dobin et al. (2012) [12]

Share and Cite

MDPI and ACS Style

Schaarschmidt, S.; Fischer, A.; Zuther, E.; Hincha, D.K. Evaluation of Seven Different RNA-Seq Alignment Tools Based on Experimental Data from the Model Plant Arabidopsis thaliana. Int. J. Mol. Sci. 2020, 21, 1720. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms21051720

AMA Style

Schaarschmidt S, Fischer A, Zuther E, Hincha DK. Evaluation of Seven Different RNA-Seq Alignment Tools Based on Experimental Data from the Model Plant Arabidopsis thaliana. International Journal of Molecular Sciences. 2020; 21(5):1720. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms21051720

Chicago/Turabian Style

Schaarschmidt, Stephanie, Axel Fischer, Ellen Zuther, and Dirk K. Hincha. 2020. "Evaluation of Seven Different RNA-Seq Alignment Tools Based on Experimental Data from the Model Plant Arabidopsis thaliana" International Journal of Molecular Sciences 21, no. 5: 1720. https://0-doi-org.brum.beds.ac.uk/10.3390/ijms21051720

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop