SWATH2stats: An R/Bioconductor Package to Process and Convert Quantitative SWATH-MS Proteomics Data for Downstream Analysis Tools

Peter Blattmann; Moritz Heusel; Ruedi Aebersold

doi:10.1371/journal.pone.0153160

Abstract

SWATH-MS is an acquisition and analysis technique of targeted proteomics that enables measuring several thousand proteins with high reproducibility and accuracy across many samples. OpenSWATH is popular open-source software for peptide identification and quantification from SWATH-MS data. For downstream statistical and quantitative analysis there exist different tools such as MSstats, mapDIA and aLFQ. However, the transfer of data from OpenSWATH to the downstream statistical tools is currently technically challenging. Here we introduce the R/Bioconductor package SWATH2stats, which allows convenient processing of the data into a format directly readable by the downstream analysis tools. In addition, SWATH2stats allows annotation, analyzing the variation and the reproducibility of the measurements, FDR estimation, and advanced filtering before submitting the processed data to downstream tools. These functionalities are important to quickly analyze the quality of the SWATH-MS data. Hence, SWATH2stats is a new open-source tool that summarizes several practical functionalities for analyzing, processing, and converting SWATH-MS data and thus facilitates the efficient analysis of large-scale SWATH/DIA datasets.

Citation: Blattmann P, Heusel M, Aebersold R (2016) SWATH2stats: An R/Bioconductor Package to Process and Convert Quantitative SWATH-MS Proteomics Data for Downstream Analysis Tools. PLoS ONE 11(4): e0153160. https://doi.org/10.1371/journal.pone.0153160

Editor: Lennart Martens, UGent / VIB, BELGIUM

Received: February 7, 2016; Accepted: March 24, 2016; Published: April 7, 2016

Copyright: © 2016 Blattmann et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The SWATH2stats package is deposited on Bioconductor (http://bioconductor.org/packages/SWATH2stats/). The example script was run with SWATH2stats v1.1.14 currently found on http://bioconductor.org/packages/devel/bioc/html/SWATH2stats.html. The SWATH-MS data to run the example script is deposited on Peptideatlas (http://www.peptideatlas.org/PASS/PASS00289).

Funding: This work was supported by the Swiss SystemsX.ch initiative, evaluated by the Swiss National Science Foundation, to PB. MH was supported by grants from the European research council [233226-PROTEOMICS v3.0] and the Institut Mérieux to RA. The RA group is supported by the Swiss National Science Foundation [3100A0-130530], Advanced ERC grant [#670821], ETH Zurich and SystemsX.ch. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Targeted mass spectrometry-based proteomics allows the consistent and reproducible quantification of peptide analytes in complex samples [1]. SWATH-MS is a recently developed implementation of data-independent acquisition (DIA) and targeted analysis that increases the number of quantified peptides per sample compared to S/MRM by 2–3 orders of magnitude [2]. The SWATH-MS/DIA approach has become increasingly popular in proteomics. Different software tools have been developed for the identification and quantification of peptides from the highly convoluted fragment ion maps generated by DIA. These include OpenSWATH [3], a recent implementation of mProphet scoring in Skyline [4], DIA-Umpire [5], PeakView (ABSciex, Canada) and Spectronaut (Biognosys, Switzerland). Among these, the open-source OpenSWATH pipeline is a popular tool that produces a large tab-delimited results file containing the quantitative SWATH-MS data. The OpenSWATH pipeline consists of the OpenSWATH software [3] identifying and extracting quantitative data from targeted peptides within the fragment ion maps and a statistical assessment of the correct identification of these targeted peptides using the mProphet algorithm [6, 7]. For subsequent quantitative or statistical analyses of proteomic data, several tools have been developed by us and others: MSstats and mapDIA are tools that can be used to identify statistically significant differential expression of peptides and proteins in SWATH-MS data [8, 9]. The R package aLFQ allows absolute label free quantification of proteins in SWATH-MS data [10]. To interface the OpenSWATH output with these tools, the data needs to be processed into the respective input format, a task that can be challenging and time-consuming, due to the size of the data and programming skills required. Before subjecting it to further downstream statistical or quantitative analysis, the data typically needs to be annotated and an initial quality assessment performed. This step can also be used to filter for a subset of the data that will then be tested for differential expression. At the moment no tool exists to facilitate such different tasks for SWATH-MS data. Here we present a convenient R/Bioconductor package called SWATH2stats that allows to i) annotate the data, ii) analyze reproducibility across replicates, iii) estimate the FDR, iv) filter for assays meeting certain confidence or other criteria and v) convert the large proteomic datasets to the respective input formats of the downstream analysis tools MSstats, mapDIA, and aLFQ [8–10] (Fig 1A and Table 1).

Download:

Fig 1. Overview of the R/Bioconductor package SWATH2stats.

(A) SWATH2stats uses the OpenSWATH results or results from similar software. The information on the experimental design for annotation of the conditions and replicates can be provided separately or extracted from the OpenSWATH data. The data is processed along 5 different steps (annotation, analyzing the data, false discovery rate (FDR) estimation, filtering, format conversion) using different functions (Table 1) until the data can be directly exported in a suitable format for the downstream analysis tools aLFQ, MSstats, and mapDIA. (B) Shown are example plots from the package that show the correlation of signals between injections or the coefficient of variation (cv) across conditions. (C) Shown are example plots on how the estimated global FDR or FDR by run changes depending on different score criteria.

https://doi.org/10.1371/journal.pone.0153160.g001

Download:

Table 1. Functions included in the SWATH2stats package.

https://doi.org/10.1371/journal.pone.0153160.t001

Material and Methods

Implementation

SWATH2stats was programmed as an R package and is available on Bioconductor [11] (http://bioconductor.org/packages/SWATH2stats/). A vignette contained within the package explains the analysis procedure in detail. Dedicated explanation of each function is provided in the manual pages within the package. The functions can be grouped into five areas: i) Data loading and annotation, ii) analyzing the variation and correlation of the data, iii) FDR estimation, iv) data filtering and v) format conversion (see Fig 1A and Table 1). In addition to the base R functions, the package builds directly upon functions from the packages ggplot2 [12], reshape2 [13], data.table and grid.

S.pyogenes dataset

In order to show the usage of SWATH2stats, an example script is presented (S1 and S2 Files). This script can be used to process a publicly available SWATH-MS dataset obtained from S.pyogenes [3]. This dataset contains four injections of S.pyogenes exposed to 0% or 10% human plasma with two biological replicates each. The SWATH-MS data was originally searched with the OpenSWATH pipeline using an assay library for S. pyogenes [3]. The results table used in the example script was obtained from PeptideAtlas (www.peptideatlas.org, PASS00289).

Results

Loading of SWATH data and annotation

The SWATH2stats package can process SWATH data from the integrated OpenSWATH pipeline (Rost et al. 2014, Teleman et al. 2015) (S1 File). Alternatively data from other proteomic software can be used when exported to a tab-delimited table where each row represents the quantitative results of one quantified targeted precursor peptide for each sample injection. The minimal information per row required is i) the assignment for each targeted peptide to a protein, ii) in which MS injection the peptide was quantified, and iii) a measure for the signal that was quantified. A score representing the confidence of identification needs to be present both for the target and decoy peptides in order to estimate an FDR with SWATH2stats (for the OpenSWATH results the m-score is used). In addition to the quantitative data, a table containing the meta-data for the experimental design can be provided in order to annotate the SWATH-MS results in SWATH2stats. This table needs to specify for each MS injection to which treatment condition it belongs and define the replicates of the same treatment condition. An example experimental design file is provided within the package and the example script shows how this information can be retrieved from the filename within the data if all the information is contained within the filename (S1 File).

Visualization of data and variation between biological replicates

The SWATH2stats package provides different functions to directly analyze the results (Table 1). These functions provide functionality to count the number of analytes detected, or analyze the correlation and difference in their signal across the measured samples. A table with the summed signal per peptide or protein can be generated. Furthermore, the correlation and coefficient of variation for the signal between replicates and across all samples can be plotted (Fig 1B). These functions are useful to obtain a first impression of the data, but can also be used to assess the effect of filtering towards the signal or correlation across replicates.

Estimation of the FDR on peptide and protein level

When analyzing many runs in parallel, false positive identifications can accumulate in the combined results table, resulting in a higher overall FDR than in one individual run. In addition, the FDR on peptide or protein level is typically higher than on the assay level [10, 14]. Therefore, it is important to control the peptide and protein FDR in large proteomics studies [15]. Here, we implemented an estimation of the FDR based on the target-decoy approach using a correction factor for the ratio of decoys to false targets (called fraction of false targets (FFT) or π₀) [16–18]. The functions in this package support the estimation of the global FDR (Fig 1B) over multiple runs or within single runs (Fig 1C). These functions estimate the FDR on assay, peptide or protein level by counting the decoy assays, peptides or proteins passing a certain m-score criterion. In contrast to the naïve target decoy approach, the FDR estimate is corrected by the FFT or π₀ [16–18]. All of these functions provide plots for visual inspection and can also be used to estimate a more stringent m-score/FDR criterion in order to reach a target FDR.

Filtering the data

Depending on the aim of the downstream analysis, the data might need to be filtered further. For example, a more stringent score criterion can be set to only include data identified at a higher confidence. This can reduce the overall peptide or protein FDR of the data. For some projects, peptides that have not been identified reproducibly across a certain number of conditions should be excluded from further analysis. Therefore, another option is to filter for peptides that were identified across a certain number of injections or replicates. Another possibility is to select only proteins for which two independent peptides were quantified. Typically, these approaches lead to a preferential selection of true versus false targets or decoys and hence reduce the FDR in the data. Furthermore, filters are available to select the data for peptides present in only one protein (proteotypic peptides), or select n peptides per protein showing the highest signal (top n approach). In summary, SWATH2stats provides different functions that allow the user to filter the data based on i) meta-data from experimental design, ii) frequency of observation across samples, iii) number of sibling peptides (peptides mapping to the same protein entry) or on iv) m-score/FDR criteria (Table 1). Such filters can also be applied in combination, e.g. selecting proteotypic peptides that have been quantified in more than 50% of the samples with an estimated FDR on assay level lower than 0.001. The filters are equally applied to the decoy assays and thus the effect of these filters on false targets can be estimated by re-assessing the decoy-estimated FDR. The application of the FDR estimation functions in interplay with the filtering functions can help the researcher in selecting an efficient strategy to establish the highest possible data quality.

Conversion of the data

In the last step SWATH2stats offers functions to convert the OpenSWATH data to a format required for popular statistical tools such as the R/Bioconductor package MSstats [8], the C++ tool mapDIA [9], as well as the quantitative proteomics R package aLFQ [10] (Fig 1A and Table 1). During this conversion, the data table changes from a peakgroup-level format (one row per peakgroup) to a transition-level format (one row per transition), or from a long format to a wide format (the signal for different samples is stored in a single column to a format where the signal for each sample is present in different columns) (Table 1). The converted data can then directly be read by the different downstream statistical or quantitative tools.

Discussion

The R/Bioconductor package SWATH2stats establishes for the first time a convenient link between the OpenSWATH pipeline [3, 7] and different downstream analysis tools such as MSstats [8] and mapDIA [9]. In addition, it enables annotation, analysis, FDR estimation, and filtering of the data with different functions (Table 1). The SWATH2stats package thus enables efficient and convenient data quality control and visualization that helps to improve the quality of the subsequent statistical and quantitative results. The SWATH2stats package has been documented with a detailed vignette and deposited on the popular R/Bioconductor platform. The implementation within R allows the direct usage of other plotting and statistical functions and the open-source implementation allows full transparency on how the data is processed. SWATH2stats is specifically targeted for SWATH projects with samples from many different treatments and containing biological replicates. The implementation of SWATH2stats in the popular framework of R/Bioconductor [11] and its ease-of-use is expected to significantly facilitate end-to-end analysis of large-scale SWATH/DIA datasets for users.

Supporting Information

S1 File. SWATH2stats example script.

Example R code showing the usage of the SWATH2stats package. The data processed is the publicly available dataset of S.pyogenes (Röst et al. 2014, www.peptideatlas.org; PASS 00289) and was processed with SWATH2stats v 1.1.14.

https://doi.org/10.1371/journal.pone.0153160.s001

(PDF)

S2 File. R markdown source file for SWATH2stats example script.

R markdown file that was used to generate the S1 File.

https://doi.org/10.1371/journal.pone.0153160.s002

(RMD)

Acknowledgments

We want to acknowledge Koh Ching Chiek, Olga Schubert and George Rosenberger for testing, critical reading and discussion. We want to acknowledge Hannes Röst for providing the data on S.pyogenes.

Author Contributions

Conceived and designed the experiments: PB MH RA. Performed the experiments: PB MH. Analyzed the data: PB MH. Wrote the paper: PB MH RA. Conceived and implemented the FDR estimation and FDR-filtering functions: MH.

References

1. Picotti P, Aebersold R. Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nature Methods. 2012;9(6):555–66. pmid:22669653
2. Gillet L, Navarro P, Tate S, Röst H, Selevsek N, Reiter L, et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol Cell Proteomics. 2012;11(6).
- View Article
- Google Scholar
3. Röst HL, Rosenberger G, Navarro P, Gillet L, Miladinovic SM, Schubert OT, et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nature biotechnology. 2014;32(3):219–23. pmid:24727770.
4. MacLean B, Tomazela DM, Shulman N, Chambers M, Finney GL, Frewen B, et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics. 2010;26(7):966–8. pmid:20147306; PubMed Central PMCID: PMC2844992.
5. Tsou CC, Avtonomov D, Larsen B, Tucholska M, Choi H, Gingras AC, et al. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat Methods. 2015;12(3):258–64, 7 p following 64. pmid:25599550; PubMed Central PMCID: PMC4399776.
6. Reiter L, Rinner O, Picotti P, Huttenhain R, Beck M, Brusniak MY, et al. mProphet: automated data processing and statistical validation for large-scale SRM experiments. Nat Methods. 2011;8(5):430–5. pmid:21423193.
7. Teleman J, Rost HL, Rosenberger G, Schmitt U, Malmstrom L, Malmstrom J, et al. DIANA—algorithmic improvements for analysis of data-independent acquisition MS data. Bioinformatics. 2015;31(4):555–62. pmid:25348213.
8. Choi M, Chang CY, Clough T, Broudy D, Killeen T, MacLean B, et al. MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics. 2014;30(17):2524–6. pmid:24794931.
9. Teo G, Kim S, Tsou CC, Collins B, Gingras AC, Nesvizhskii AI, et al. mapDIA: Preprocessing and statistical analysis of quantitative proteomics data from data independent acquisition mass spectrometry. Journal of proteomics. 2015;129:108–20. pmid:26381204; PubMed Central PMCID: PMC4630088.
10. Rosenberger G, Ludwig C, Rost HL, Aebersold R, Malmstrom L. aLFQ: an R-package for estimating absolute protein quantities from label-free LC-MS/MS proteomics data. Bioinformatics. 2014;30(17):2511–3. pmid:24753486; PubMed Central PMCID: PMC4147881.
11. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10):R80. pmid:15461798; PubMed Central PMCID: PMC545600.
12. Wickham H. ggplot2: Elegant Graphics for Data Analysis: Springer-Verlag New York; 2009.
13. Wickham H. Reshaping Data with the {reshape} Package. Journal of Statistical Software. 2007;21(12):1–20.
- View Article
- Google Scholar
14. Reiter L, Claassen M, Schrimpf SP, Jovanovic M, Schmidt A, Buhmann JM, et al. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol Cell Proteomics. 2009;8(11):2405–17. pmid:19608599; PubMed Central PMCID: PMC2773710.
15. Serang O, Kall L. Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less. Journal of proteome research. 2015;14(10):4099–103. pmid:26257019.
16. Choi H, Nesvizhskii AI. False discovery rates and related statistical concepts in mass spectrometry-based proteomics. Journal of proteome research. 2008;7(1):47–50. pmid:18067251.
17. Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007;4(3):207–14. pmid:17327847.
18. Lam H, Deutsch EW, Aebersold R. Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics. Journal of proteome research. 2010;9(1):605–10. pmid:19916561.

[ref1] 1. Picotti P, Aebersold R. Selected reaction monitoring-based proteomics: workflows, potential, pitfalls and future directions. Nature Methods. 2012;9(6):555–66. pmid:22669653
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Gillet L, Navarro P, Tate S, Röst H, Selevsek N, Reiter L, et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol Cell Proteomics. 2012;11(6).
View Article
Google Scholar

[6] View Article

[7] Google Scholar

[ref3] 3. Röst HL, Rosenberger G, Navarro P, Gillet L, Miladinovic SM, Schubert OT, et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nature biotechnology. 2014;32(3):219–23. pmid:24727770.
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref4] 4. MacLean B, Tomazela DM, Shulman N, Chambers M, Finney GL, Frewen B, et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics. 2010;26(7):966–8. pmid:20147306; PubMed Central PMCID: PMC2844992.
View Article
PubMed/NCBI
Google Scholar

[13] View Article

[14] PubMed/NCBI

[15] Google Scholar

[ref5] 5. Tsou CC, Avtonomov D, Larsen B, Tucholska M, Choi H, Gingras AC, et al. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat Methods. 2015;12(3):258–64, 7 p following 64. pmid:25599550; PubMed Central PMCID: PMC4399776.
View Article
PubMed/NCBI
Google Scholar

[17] View Article

[18] PubMed/NCBI

[19] Google Scholar

[ref6] 6. Reiter L, Rinner O, Picotti P, Huttenhain R, Beck M, Brusniak MY, et al. mProphet: automated data processing and statistical validation for large-scale SRM experiments. Nat Methods. 2011;8(5):430–5. pmid:21423193.
View Article
PubMed/NCBI
Google Scholar

[21] View Article

[22] PubMed/NCBI

[23] Google Scholar

[ref7] 7. Teleman J, Rost HL, Rosenberger G, Schmitt U, Malmstrom L, Malmstrom J, et al. DIANA—algorithmic improvements for analysis of data-independent acquisition MS data. Bioinformatics. 2015;31(4):555–62. pmid:25348213.
View Article
PubMed/NCBI
Google Scholar

[25] View Article

[26] PubMed/NCBI

[27] Google Scholar

[ref8] 8. Choi M, Chang CY, Clough T, Broudy D, Killeen T, MacLean B, et al. MSstats: an R package for statistical analysis of quantitative mass spectrometry-based proteomic experiments. Bioinformatics. 2014;30(17):2524–6. pmid:24794931.
View Article
PubMed/NCBI
Google Scholar

[29] View Article

[30] PubMed/NCBI

[31] Google Scholar

[ref9] 9. Teo G, Kim S, Tsou CC, Collins B, Gingras AC, Nesvizhskii AI, et al. mapDIA: Preprocessing and statistical analysis of quantitative proteomics data from data independent acquisition mass spectrometry. Journal of proteomics. 2015;129:108–20. pmid:26381204; PubMed Central PMCID: PMC4630088.
View Article
PubMed/NCBI
Google Scholar

[33] View Article

[34] PubMed/NCBI

[35] Google Scholar

[ref10] 10. Rosenberger G, Ludwig C, Rost HL, Aebersold R, Malmstrom L. aLFQ: an R-package for estimating absolute protein quantities from label-free LC-MS/MS proteomics data. Bioinformatics. 2014;30(17):2511–3. pmid:24753486; PubMed Central PMCID: PMC4147881.
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref11] 11. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004;5(10):R80. pmid:15461798; PubMed Central PMCID: PMC545600.
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref12] 12. Wickham H. ggplot2: Elegant Graphics for Data Analysis: Springer-Verlag New York; 2009.

[ref13] 13. Wickham H. Reshaping Data with the {reshape} Package. Journal of Statistical Software. 2007;21(12):1–20.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref14] 14. Reiter L, Claassen M, Schrimpf SP, Jovanovic M, Schmidt A, Buhmann JM, et al. Protein identification false discovery rates for very large proteomics data sets generated by tandem mass spectrometry. Mol Cell Proteomics. 2009;8(11):2405–17. pmid:19608599; PubMed Central PMCID: PMC2773710.
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref15] 15. Serang O, Kall L. Solution to Statistical Challenges in Proteomics Is More Statistics, Not Less. Journal of proteome research. 2015;14(10):4099–103. pmid:26257019.
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Choi H, Nesvizhskii AI. False discovery rates and related statistical concepts in mass spectrometry-based proteomics. Journal of proteome research. 2008;7(1):47–50. pmid:18067251.
View Article
PubMed/NCBI
Google Scholar

[57] View Article

[58] PubMed/NCBI

[59] Google Scholar

[ref17] 17. Elias JE, Gygi SP. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat Methods. 2007;4(3):207–14. pmid:17327847.
View Article
PubMed/NCBI
Google Scholar

[61] View Article

[62] PubMed/NCBI

[63] Google Scholar

[ref18] 18. Lam H, Deutsch EW, Aebersold R. Artificial decoy spectral libraries for false discovery rate estimation in spectral library searching in proteomics. Journal of proteome research. 2010;9(1):605–10. pmid:19916561.
View Article
PubMed/NCBI
Google Scholar

[65] View Article

[66] PubMed/NCBI

[67] Google Scholar

Figures