multiomics: A user-friendly multi-omics data harmonisation R pipeline

Tyrone Chen; Al J Abadi; Kim-Anh Lê Cao; Sonika Tyagi

doi:10.12688/f1000research.53453.2

Home Browse multiomics: A user-friendly multi-omics data harmonisation R pipeline

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

Revised

multiomics: A user-friendly multi-omics data harmonisation R pipeline

[version 2; peer review: 2 not approved]

Tyrone Chen ^1,2, Al J Abadi³, Kim-Anh Lê Cao³, Sonika Tyagi ^2,4,5

PUBLISHED 02 Aug 2023

Author details Author details

¹ Central Clinical School, Faculty of Medicine, Nursing and Health Sciences, The Alfred Centre, VIC, 3004, Australia
² School of Biological Sciences, Monash University, Clayton, VIC, 3800, Australia
³ Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Parkville, VIC, 3010, Australia
⁴ Monash eResearch Centre, Monash University, Clayton, VIC, 3800, Australia
⁵ School of Computational Technologies, Royal Melbourne Institute of Technology, Melbourne, VIC, 3000, Australia

Tyrone Chen
Roles: Conceptualization, Data Curation, Formal Analysis, Software, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Al J Abadi
Roles: Methodology, Software, Validation, Visualization, Writing – Review & Editing

Kim-Anh Lê Cao
Roles: Formal Analysis, Funding Acquisition, Methodology, Software, Supervision, Validation, Visualization, Writing – Review & Editing

Sonika Tyagi
Roles: Conceptualization, Data Curation, Funding Acquisition, Project Administration, Resources, Supervision, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the RPackage gateway.

This article is included in the Bioinformatics gateway.

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

Data from multiple omics layers of a biological system is growing in quantity, heterogeneity and dimensionality. Simultaneous multi-omics data integration is of immense interest to researchers as it has potential to unlock previously hidden biomolecular relationships leading to early diagnosis, prognosis, and expedited treatments. Many tools for multi-omics data integration are developed. However, these tools are often restricted to highly specific experimental designs, types of omics data, and specific data formats. A major limitation of the field is the lack of a pipeline that can accept data in unrefined form to preserve maximum biology in an individual dataset prior to integration. We fill this gap by developing a flexible, generic multi-omics pipeline called multiomics, to facilitate general-purpose data exploration and analysis of heterogeneous data. The pipeline takes unrefined multi-omics data as input, sample information and user-specified parameters to generate a list of output plots and data tables for quality control and downstream analysis. We have demonstrated its application on a sepsis case study. We enabled limited checkpointing functionality where intermediate output is staged to allow continuation after errors or interruptions in the pipeline and generate a script for reproducing the analysis to improve reproducibility. Our pipeline can be installed as an R package or manually from the git repository, and is accompanied by detailed documentation with walkthroughs on three case studies.

Keywords

machine learning, multi-omics, data integration, data harmonisation, multivariate analysis

Corresponding authors: Tyrone Chen, Sonika Tyagi

Competing interests: No competing interests were disclosed.

Grant information: S. T acknowledges the AISRF EMCR Fellowship by the Australian Academy of Science and Australian Women ResearchSuccess Grant at Monash University. T. C received funding from the Australian Government Research Training ProgramScholarship and Monash Faculty of Science Dean’s Postgraduate Research Scholarship. K-A. L-C was supported in partby the National Health and Medical Research Council (NHMRC) Career Development fellowship (GNT1159458)
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2023 Chen T et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Chen T, Abadi AJ, Lê Cao KA and Tyagi S. multiomics: A user-friendly multi-omics data harmonisation R pipeline [version 2; peer review: 2 not approved]. F1000Research 2023, 10:538 (https://doi.org/10.12688/f1000research.53453.2) First published: 06 Jul 2021, 10:538 (https://doi.org/10.12688/f1000research.53453.1) Latest published: 02 Aug 2023, 10:538 (https://doi.org/10.12688/f1000research.53453.2)

Revised Amendments from Version 1

Major pipeline updates, including to the API, the way optimal hyperparameters are stored and handled, and parallelisation. Also removed redundant plots and correctly skipped cases where the number of components are too few for a 2D plot. Replaced the existing case study with a separate, simplified case study, which is compatible with our latest pipeline API. Reworked installation to fix dependency issues, and included new automated tests. Rewrote introduction. Repository now contains information about R version compatibility and operating system requirements.

See the authors' detailed response to the review by Javad Zahiri
See the authors' detailed response to the review by Arjun Krishnan

Introduction

A biological phenotype is an emergent property of a complex network of biological interactions. Since relying on a single layer of omics data to test a biological hypothesis results in an incomplete perspective of a biological system, interest in multi-omics data integration is steadily increasing as a means to decipher complex biological phenotypes.¹

As a result, methods have been developed to leverage the multitude of data modalities in characterising biological systems. While many tools are available, most of these methods are heavily customised to fit a specific experimental design, and are not flexible enough to handle generic use cases.¹ Furthermore, many tools that claim to perform data integration actually perform high-level data aggregation, where datasets are processed individually and only summarised.¹ Of these algorithms, few perform data integration of multiple layers of omics data simultaneously, which we refer to specifically as “data harmonisation” to distinguish it from the more general term of “data integration”.¹

Contributing to the lack of a generic “data harmonisation” tool is the nature of conventional biological pipelines, where multiple layers of data preprocessing and summarisation occur, causing an irreversible loss of information during data analysis. Therefore, in the context of this article, unrefined and information rich data refers to data in a primary form before heavy information loss occurs, such as matrices of molecular abundance data. By exploiting these low-level correlations instead of high-level summarised information, it is even possible to identify the relationship between individual biological molecules.

We illustrate these points with a hypothetical case of measuring protein and transcript levels in a same set of matched samples. A correlation across transcript and protein abundance functions as an interpretable association metric, highlighting interesting features (strong correlations) for further investigation [Figure 1].² Furthermore, increasing the number of omics data theoretically increases resolution, and subsequently the resulting information obtained. Published multi-omics studies discovering novel biological insights which are not possible with single-omics data further supports our points.³^–⁹ With the increasing volume of multi-omics data present in publicly accessible biological data repositories,¹⁰^–¹² multi-omics data integration is expected to be the core strategy of modern and future biological data analyses.

Figure 1. An illustration of a hypothetical multi-omics perspective on a simple biological system.

The rectangles represent different layers of omics data (e.g. proteome, transcriptome and lipidome) while the circles represent features within their respective omics data layer. Black single-line arrows show correlation between features within the omics data (e.g. a regulatory factor) while blue double-lines show correlation between features across different omics data layers. A powerful abstraction of the system under study can be obtained by reviewing multiple layers of omics data holistically.

It is important to note that at this time, no end-to-end pipeline or framework exists which allows the user to quickly and easily input unrefined data, run a pipeline and export output data which can be used for downstream analyses. Therefore, to facilitate this, we developed multiomics, a flexible, easy-to-install and easy-to-use pipeline targeted at bioinformaticians.¹³ We implemented functions from the mixOmics¹⁴ R package, as it is one of the only methods in the field which is generic in scope, makes no restrictive assumptions and integrates data at the level of individual molecules. It can be installed as a conventional R¹⁵ package or used by cloning the associated git repository.¹⁶ A series of quality control plots are generated automatically and compiled into a pdf file. There is seamless integration with mixOmics, where data generated by the pipeline is exported automatically as a R data object of mixOmics classes, allowing expert users to intervene where needed, while allowing new users to perform a comprehensive screen of their data. As a form of checkpointing, the R data object is updated at every major stage of the pipeline, and can be loaded directly into the mixOmics suite of tools for further investigation or plot customisation. To increase reproducibility, command line arguments as well as parameters are also exported as files which can be rerun directly to reproduce the output. For convenience, the option to provide command line arguments as a json file is also available.

Detailed documentation is provided both within the source git repository and as vignettes in the R package. Multiple installation methods are shown in the git repository to maximise accessibility of our pipeline for users.¹⁷^–¹⁹ Additionally, walkthroughs of three case studies are included. Complete and detailed examples of input data format are also provided, including a sample dataset which can be loaded directly from the R package. In this manuscript, we summarise these information and show a minimum working example to highlight some of the key features of our pipeline.

Methods

Implementation

Quick install

You can install this directly as a R package from github:

install.packages("devtools")
library("devtools")
install_gitlab("tyagilab/sars-cov-2", subdir="multiomics", INSTALL_opts=”—no-multiarch”)

Manual install

If the above automated install steps do not work, detailed manual installation instructions are available in the source git repository at https://github.com/tyronechen/SARS-CoV-2 and https://gitlab.com/tyagilab/sars-cov-2/-/tree/master for conda and R.

You may need to install mixOmics from source. If needed, please follow the installation instructions on https://github.com/mixOmicsTeam/mixOmics:

install_github("mixOmicsTeam/mixOmics")

The actual script used to run the pipeline is not directly callable but provided as a separate script. Running the following command will show you the path to the script. A copy of this is also available in the source git repository.

system.file("scripts", "run_pipeline.R", package="multiomics")
# outside of R
Rscript run_pipeline.R -h

Operation

Example input

Three elements are the minimum required input for the pipeline [Figure 2]. First, a file containing biological class information is required. Next, at least two files corresponding to omics data blocks are required. Finally, a list of unique names labelling each data block is required. Examples of these input files and their internal data structure as they appear in the pipeline are shown.

# data is included within the package
# for demonstration purposes we extract the data into files, since the pipeline takes files as input.
library (multiomics)
data (BPH2819)
names (BPH2819)
# [1] "classes"    "metabolome"    "proteome"    "transcriptome"

export <- function (name, data) {
  write.table(
   data.frame (data),
   paste (name, ".tsv", sep=""),
   quote=FALSE, sep="\t",
   row.names=TRUE,
   col.names=NA
   )
  }
mapply (export, names (BPH2819), BPH2819, SIMPLIFY=FALSE)

# if the above does not work, they are available online
url_class <- "https://github.com/tyronechen/SARS-CoV-2/blob/master/multiomics/data/classes.tsv"
url_meta <- "https://github.com/tyronechen/SARS-CoV-2/blob/master/multiomics/data/metabolome.tsv"
url_prot <- "https://github.com/tyronechen/SARS-CoV-2/blob/master/multiomics/data/proteome.tsv"
url_tran <- "https://github.com/tyronechen/SARS-CoV-2/blob/master/multiomics/data/transcriptome.tsv"

urls <- c (url_class, url_meta, url_prot, url_tran)
file_names <- sapply (strsplit (urls, "/"), tail, 1)
mapply (function(x, y) download.file(x, y), urls, file_names, SIMPLIFY=FALSE)
if (any (file.exists (file_names)) != TRUE) {stop("Files incorrectly downloaded!")}

Figure 2. Technical notes for the pipeline.

We summarise pipeline installation steps and the flow of data through the pipeline. This figure was originally published on gitlab under a CC-BY-3.0 AU license and is reproduced here with permission.

Note that column names and row names should be truncated to avoid bugs in the pipeline associated with name length. Furthermore, usage of non-alphanumeric characters in their names should be avoided as R quietly replaces these with “.” (periods).

In addition to this case study, data and metadata for two other case studies are included in the source git repository. Please refer to the corresponding case studies detailed on github for more information.

Running the pipeline

The pipeline is run with the command Rscript run_pipeline.R and passing a list of command line arguments either as strings of text or in a json file (recommended). Running the actual pipeline can take some time. The main bottleneck is parameter tuning which scales exponentially with the number of omics data blocks, but it is possible to disable this if the user wants to perform a test run or is already aware of the parameters. We note that R Data objects are periodically exported that allow for seamless integration with functions in the underlying mixOmics package when needed. A secondary bottleneck is data imputation, which scales with the number of components used and the dimensions of the input data. If needed, it is possible to impute and export this imputed data either with the pipeline or with the underlying mixOmics function, and then substitute that as input. The user can adjust the number of cpus if needed to speed up the process. Data imputation can be skipped if it is not required.

Code for the pipeline can be examined in detail from the git repository or individual functions can be inspected directly after loading the R multiomics package.

Example output

Output files include a pdf file compiling all graphical output.²⁰^–²⁴ Note that this can be quite large, especially if you have a large dataset. A graphml file is also exported for input into cytoscape.²⁵ Due to the size and volume of plots, we provide a link to some example plots here. A manuscript using figures generated from this pipeline is also available for reference.²⁶

Each analysis generates a series of text files containing feature weights. In some ways, these are functionally analogous to differential expression analyses, where these coefficients summarise the features with the most phenotypically relevant information. At the same time, a table of feature correlations across multi-omics data is generated. Some examples of these are shown below:

# download single-omic variable weights
   url <- paste(
   "https://raw.githubusercontent.com/tyronechen/",
   "SARS-CoV-2/master/results/case_study_3/",
   "Metabolomics_GC_MS_1_sPLSDA_max.txt",
   sep=""
   )
download.file (url, "Metabolomics_GC_MS_1_PLSDA_max.txt")

url <- paste(
   "https://raw.githubusercontent.com/tyronechen/",
   "SARS-CoV-2/master/results/case_study_3/",
   "Metabolomics_GC_MS_1_sPLSDA_max.txt",
   sep=""
   )
download.file (url, "Metabolomics_GC_MS_1_sPLSDA_max.txt")

# download multi-omic variable weights
# this is for a single block of omics data
url <- paste(
   "https://raw.githubusercontent.com/tyronechen/",
   "SARS-CoV-2/master/results/case_study_3/",
   "Metabolomics_GC_MS_1_DIABLO_var_keepx_max.txt",
   sep=""
   )
download.file (url, "Metabolomics_GC_MS_1_DIABLO_var_keepx_max.txt")

# download multi-omic correlations
url <- paste(
   "https://raw.githubusercontent.com/tyronechen/",
   "SARS-CoV-2/master/results/case_study_3/",
   "DIABLO_var_keepx_correlations.txt",
   sep=""
   )
download.file (url, "DIABLO_var_keepx_correlations.txt")

metabolomics_plsda <- read.table(
   "Metabolomics_GC_MS_1_PLSDA_max.txt",
   header=TRUE, sep="\t", row.names=1
   )
colnames (metabolomics_plsda)
# [1] "Sera"     "Contrib.RPMI" "Contrib.Sera" "Contrib"    GroupContrib"
# [6] "color"     "importance

metabolomics_splsda <- read.table(
   "Metabolomics_GC_MS_1_sPLSDA_max.txt",
   header=TRUE, sep="\t", row.names=1
   )
colnames (metabolomics_splsda)
# [1] "Sera"   "Contrib.RPMI" "Contrib.Sera" "Contrib"   "GroupContrib"
# [6] "color"   "importance

head (metabolomics_splsda[,1:2])
#                  Sera Contrib.RPMI
# HMDB0000673-0.9031770    0.9405335
# HMDB0000067-0.9197936    0.9595830
# HMDB0000273-0.9501236    0.9371693
# HMDB0000207-0.9487778    1.0114701
# HMDB0003229-1.0032847    1.0099811
# HMDB0001043-0.7016579    0.9593041

metabolomics_diablo <- read.table(
   "Metabolomics_GC_MS_1_DIABLO_var_keepx_max.txt",
   header=TRUE, sep="\t", row.names=1
   )
colnames (metabolomics_diablo)
# [1] "More.severe"     "Contrib.Less.severe" "Contrib.More.severe" "Contrib"
# [5] "GroupContrib"     "color"     "importance"

head (metabolomics_diablo[,1:2])
#                                     Sera Contrib.RPMI
# Metabolomics_GC_MS_HMDB0000067-0.9197936    0.9595830
# Metabolomics_GC_MS_HMDB0000673-0.9031770    0.9405335
# Metabolomics_GC_MS_HMDB0000273-0.9501236    0.9371693
# Metabolomics_GC_MS_HMDB0000207-0.9487778    1.0114701
# Metabolomics_GC_MS_HMDB0003229-1.0032847    1.0099811

correlations <- read.table(
   "DIABLO_var_keepx_correlations.txt",
   header=TRUE, sep="\t", row.names=1
   )
dim (correlations)
# [1] 15 15

head (correlations[,1:2])
#                                     Metabolomics_GC_MS_HMDB0000207
# Metabolomics_GC_MS_HMDB0000067      0.9746416
# Metabolomics_GC_MS_HMDB0000207      0.9727075
# Metabolomics_GC_MS_HMDB0000273      0.9785230
# Metabolomics_GC_MS_HMDB0000673      0.9803517
# Metabolomics_GC_MS_HMDB0003229      0.9648338
# Proteomics_MS1_DDA_WP_000514408.1   -0.9706518

An R data file containing all of the information above and a script containing command line arguments which can be used to reproduce the analysis are also exported to enable full reproducibility.

Examples of these output files for three case studies are included in the source git repository.

Use cases

We demonstrate a sample use case of our pipeline with reference to an earlier re-analysis of a published dataset.¹³^,²⁶ For simplicity, we highlight one case study only in this manuscript, but note that detailed walkthroughs for all three are available in the source git repository.²⁷^–²⁹ Our tool takes as input at least two data files present as tables of quantitative information, with samples as rows and features as columns. A list of names corresponding to the names of these data blocks are required. A file containing class information is also required as a list of newline separated values. Examples of these data and class files for three case studies are included in the source git repository. Other command line arguments are also possible pertaining to distance metrics of choice for prediction, number of features to select and others. A full description of these can be obtained by running Rscript run_pipeline.R -h, which will list every flag in detail. Because of the number of command line arguments, an option is provided to pass these parameters as a json file to the pipeline. Examples of these json files for three case studies are included in the source git repository.

Example data included within the multiomics package

Regarding input data, some example data²⁹ is provided as part of our R package.

library (multiomics)
data (BPH2819)
names (BPH2819)
# [1] "classes"     "metabolome"     "proteome"     "transcriptome"

sapply (BPH2819, dim)
# $classes
# NULL
# $metabolome
# [1]  12 153
# $proteome
# [1]  12 1451
# $transcriptome
# [1]  12 2771

Alternatively, you may download this from our git repository directly. This is a subset of sepsis data generated in a separate publication.²⁹

Example processing workflow

We provide a fully processed dataset as a guide for the user. The steps below can be reproduced by downloading the R data object with the following command:

url <- paste(
   "https://github.com/tyronechen/SARS-CoV-2/",
   "raw/master/results/case_study_3/data. RData",
   sep=""
)
download.file (url, "RData.RData")
load("RData.RData")
ls()
# [1] "argpath"                "argv"                "classes"
# [4] "contrib"                "corr_cutoff"         "correlations"
# [7] "data"                   "data_imp"            "data_names"
# [10] "data_pca_multilevel"   "data_plsda"          "data_splsda"
# [13] "design"                "diablo"              "diablo_input"
# [16] "diablo_keepx"          "diablo_ncomp"        "dimensions"
# [19] "dist_diablo"           "dist_plsda"          "dist_splsda"
# [22] "export"                "heatmaps"            "i"
# [25] "input_data"            "linkage"             "low_var"
# [28] "mappings"              "metabolomics_diablo" "metabolomics_plsda"
# [31] "metabolomics_splsda"   "missing"             "optimal_params"
# [34] "optimal_params_values" "outdir"              "paths"
# [37] "pca_impute"            "pca_withna"          "pch"
# [40] "perf_diablo"           "plot"                "plsda_ncomp"
# [43] "rdata"                 "splsda_keepx"        "splsda_ncomp"
# [46] "tuned_diablo"          "tuned_splsda"        "url"
# [49] "x"                     "y"

Inspecting the minimum required input (classes and data) reveals the following:

# number of samples
> length (classes)
# [1] 12

# data dimensions
sapply (data, dim)
#     Metabolomics_GC_MS Proteomics_MS1_DDA RNA_Seq
# [1,]                    12                 12      12
# [2,]                   153               1451    2771

table (classes)
# classes
# RPMI Sera
#    6    6

head (data$Metabolomics_GC_MS[,1:3])
#        X3.Aminoglutaric.acid HMDB0000005 HMDB0000008
# RPMI_0            -1.7814083   -9.103010   -3.471373
# RPMI_1            -1.9108074   -5.401229   -3.488496
# RPMI_2            -1.5458964  -10.898804   -2.845025
# RPMI_3            -2.1842312   -9.563557   -1.232155
# RPMI_4            -1.3106881   -4.755440   -1.723564
# RPMI_5            -0.9600247   -4.771127   -1.403044

Data preprocessing

First, data is filtered if associated options are specified by the user. Features with missing values across sample groups are discarded by default. The user can also choose to filter out features (columns) exceeding a certain threshold of missing values.

Imputing missing values is optional as PLS-derived methods can function without this step. However, we include this information in case the user would like to perform this step manually. Remaining missing values can be imputed by the user-specified --icomp flag. Imputation is effective when the quantity of missing values is <20% of the data. To investigate if the data has been significantly changed, the user can plot a correlation plot of the principal components before and after imputation. Since imputation can take a long time, especially for large datasets, the imputed data is saved by default and the user can load it in directly as input if desired.

If the study design is longitudinal (e.g. has repeated measurements on the same sample), then the --pch flag should be enabled by the user. The user should pass in a file with the same format as the classes file, but containing information regarding the repeated measurements.²³^,³⁰ Providing this information allows the pipeline to adjust for this internally.

Method parameters

Most of the parameters for the machine learning algorithms are specified by the user. These cover the three methods PLSDA (partial least squares discriminant analysis), sPLSDA (sparse PLSDA) and multi-block sPLSDA (also known as DIABLO). The underlying methods are implemented within the mixOmics software package and more information is available on their website http://mixomics.org/. For each method, a distance metric is specified, either “max.dist”, “centroids.dist” or “mahalanobis.dist”. Unlike PLSDA, sPLSDA and multi-block sPLSDA focus on selecting subset of the most relevant features and therefore require a user-specified list describing the quantity of features to be selected from the data. The number of components to derive for each method is also provided. For this section, several exploratory runs with a wide range can be carried out to find the optimal configuration of features, e.g. starting at 5,10,30,50,100, inspecting subsequent output and further narrowing the range. The user can specify a few additional special parameters to the multi-block sPLSDA (block.splsda) function. The linkage parameter is a continuous value from 0 to 1, and describes the type of analysis, with a value closer to 0 prioritising class discrimination and a value closer to 1 prioritising correlation between data sets. Meanwhile, setting the number of multi-block sPLSDA components to 0 causes the pipeline to perform parameter tuning internally. Note that this can take a long time, and scales exponentially per added block of omics data. The user can also specify the number of cpus to be used for parallel processing, which mainly affects parameter tuning. Using our example, these arguments are provided here:

> argv
# …
[[1]]
[1] FALSE

$help
[1] FALSE

$low_var
[1] FALSE

$mini_run
[1] FALSE

$progress_bar
[1] TRUE

$opts
[1] NA

$json
[1] NA

$classes
[1] "BPH2819_info_all.tsv"

$classes_secondary
[1] NA

$dropna_classes
[1] FALSE

$dropna_prop
[1] 0.6

$data
[1] "Staphylococcus_aureus_BPH2819_Metabolomics_GC_MS/BPH2819.tsv"
[2] "Staphylococcus_aureus_BPH2819_Proteomics_MS1_DDA/BPH2819.tsv"
[3] "Staphylococcus_aureus_BPH2819_RNA_Seq/BPH2819.tsv"

$data_names
[1] "Metabolomics_GC_MS" "Proteomics_MS1_DDA" "RNA_Seq"

$force_unique
[1] TRUE

$mappings
[1] NA

$ncpus
[1] 24

$diablocomp
[1] 2

$linkage
[1] 0.1

$diablo_keepx
[1] NA

$icomp
[1] 12

$zero_as_na
[1] TRUE

$replace_missing
[1] TRUE

$pcomp
[1] 10

$plsdacomp
[1] 2

$splsdacomp
[1] 2

$splsda_keepx
[1] NA

$dist_plsda
[1] "centroids.dist"

$dist_splsda
[1] "centroids.dist"

$dist_diablo
[1] "centroids.dist"

$cross_val
[1] "Mfold"

$cross_val_nrepeat
[1] 50

$cross_val_folds
[1] 5

$contrib
[1] "max"

$corr_cutoff
[1] 0.1

$optimal_params
[1] NA# …

Result and quality control metrics visualisation

Results as well as quality control metrics (including cross-validation error rates) are exported in a series of plots and compiled into a pdf [Figure 3]. They can also be accessed internally from our provided R data object. Some sample output is shown below.

Figure 3. Example results visualisation.

Example results visualisation. (a) Upper left: Multi-block sPLSDA (DIABLO) correlation plots, with numbers showing the Pearson correlation between omics data, and corresponding scatter plots. (b) Upper right: Clustered image maps show the relationship between variables and omics data blocks. (c) Lower left: Barplots of loading weights show the contributions of variables towards each biological condition for each block. (d) Lower right: Circos plot depicts the high multivariate correlations between the selected features from each block. Red and blue colours indicate positive and negative correlations respectively.

Output control

Pipeline output can be controlled by specifying a number of flags. By default, the pipeline deposits data in the current working directory. This behaviour can be easily modified. Setting outfile_dir specifies the master output directory. A R data object containing objects shown in the loaded RData file can be renamed with the rdata option, generating a file similar to the one used in this example. The plot flag defines the pdf file containing all graphical output as a multi-page pdf of all plots generated in the pipeline. A reproducible script is generated and named by the user with the args flag (this defaults to Rscript.sh).

> argv
# continued from previous
…
$outfile_dir
[1] "/path/to/outdir"

$rdata
[1] "./data. RData"

$plot
[1] "./Rplots.pdf"

$args
[1] "Rscript.sh"
# …

Reproducibility and integration with mixOmics

Finally, the pipeline has a limited check-pointing built-in. At each milestone in the pipeline, the relevant output is saved and written out as a RData file, similar to the one presented above. This allows the user to manually inspect the data and adjust it to their needs where needed. In the case of completed output, the user can further customise plots and data exports for publication or downstream analysis. Importantly, data objects are compatible with core mixOmics functions, and allows seamless integration with the mixOmics suite of tools if the user intends to extend or perform their own custom analysis workflows.

Author contributions

Conceptualization, S. T, T. C; Data Curation, S. T, T. C; Formal Analysis, K-A. L-C, T. C; Funding Acquisition, K-A. L-C, S. T; Methodology, A. J. A, K-A. L-C; Project Administration, S. T; Resources, S. T; Supervision, K-A. L-C, S. T; Software, A. J. A, K-A. L-C, T. C; Validation, A. J. A, K-A. L-C, S. T, T. C; Visualization, A. J. A, K-A. L-C, Writing Original Draft Preparation, S. T, T. C; Writing Review & Editing, A. J. A, K-A. L-C, S. T, T. C.

Data availability

Source data

Primary data was generated by third parties and is publicly available.²⁷^–²⁹ For case study 1, translatome data is available from the source publication as Supplementary Table 1 and proteome data is available as Supplementary Table 2. For case study 2, the authors provided their data in a sql database. For case study 3, data is provided in publicly available accessions.

Underlying data

Zenodo: Multi-omics data harmonisation for the discovery of COVID-19 drug targets. https://doi.org/10.5281/zenodo.4602867.¹³

This project contains the following data.

• Documentation in markdown format describing pipeline usage on two case studies.
• Input data files in plain text (see Source Data for more information).
• Graphical output as pdf files and feature weights as text files.
• Source code, including code to reproduce figures in this article and source code for the R package.
• Docker file specifications for use with Docker and singularity images.

Github: SARS-CoV-2. https://github.com/tyronechen/SARS-CoV-2

Gitlab: SARS-CoV-2. https://gitlab.com/tyagilab/sars-cov-2.¹³

• Documentation in markdown format describing pipeline usage on three case studies.
• Input data files in plain text (see Source Data for more information).
• Graphical output as pdf files and feature weights as text files.
• Source code, including code to reproduce figures in this article and source code for the R package.
• Docker file specifications for use with Docker and singularity images.

The following underlying data is used in this article:

• metabolome.tsv (Text file as raw input data (metabolomics) for case study 3)
• proteome.tsv (Text file as raw input data (proteomics) for case study 3.)
• transcriptome.tsv (Text file as raw input data (transcriptome) for case study 3.)
• classes.tsv (Text file as raw input data (biological classes) for case study 3.)
• data.RData (R data object containing all input, intermediate and output data for case study 3.)
• manuscript_figures (Example output plots that can be generated by the pipeline.)²⁷^,²⁹

Code and data is available under the MIT license. Documentation is available under the CC-BY-3.0 AU license.

Extended data

The following extended data is available in the same repository:

• data/case_study_1 (All raw input data for case study 1.)
• data/case_study_2 (All raw input data for case study 2.)
• data/case_study_3 (All raw input data for case study 2.)
• results/case_study_1 (Example output data for case study 1.)
• results/case_study_2 (Example output data for case study 2.)
• results/case_study_3 (Example output data for case study 3.)

Similar to underlying data, extended code and data is available under the MIT license. Documentation is available under the CC-BY-3.0 AU license.

Software availability

• Software available through R directly:
• install.packages("devtools")
• library("devtools")
• install_github("tyronechen/SARS-CoV-2", subdir="multiomics", INSTALL_opts="--no-multiarch")
The actual script used to run the pipeline is not directly callable but provided as a separate script.

# this will show you the path to the script

system.file("scripts", "run_pipeline.R", package="multiomics")

• Source code available from: https://github.com/tyronechen/SARS-CoV-2
• Archived source code at time of publication: https://doi.org/10.5281/zenodo.4562009
• License: MIT License. Documentation provided under a CC-BY-3.0 AU license

The specific version numbers of the packages used are shown below, along with the version of the R installation.

> library (multiomics)

> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /fs04/lz25/tyronec/envs/multiomics/lib/libopenblasp-r0.3.17.so

locale:

[1] LC_CTYPE=en_AU.UTF-8        LC_NUMERIC=C

[3] LC_TIME=en_AU.UTF-8         LC_COLLATE=en_AU.UTF-8

[5] LC_MONETARY=en_AU.UTF-8     LC_MESSAGES=en_AU.UTF-8

[7] LC_PAPER=en_AU.UTF-8        LC_NAME=C

[9] LC_ADDRESS=C                LC_TELEPHONE=C

[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] stats    graphics  grDevices utils    datasets  methods    base

other attached packages:

[1] multiomics_1.2.1

loaded via a namespace (and not attached):

[1] compiler_4.2.3 tools_4.2.3

Acknowledgements

Data was generated as part of the Antibiotic Resistant Pathogens Framework Data Initiative. The authors thank the HPC team at Monash eResearch Centre for their continuous personnel support. This work was supported by the MASSIVE HPC facility. We acknowledge and pay respects to the Elders and Traditional Owners of the land on which our 4 Australian campuses stand.

References

1. Chen T, Tyagi S:Integrative computational epigenomics to build data-driven gene regulation hypotheses. GigaScience. June 2020; 9(6): 1–13. 2047-217X. PubMed Abstract | Publisher Full Text | Free Full Text
2. Maier T, Güell M, Serrano L:Correlation of mRNA and protein in complex biological samples. FEBS Lett. October 2009; 583(24): 3966–3973. 0014-5793. PubMed Abstract | Publisher Full Text
3. Benevento M, Tonge PD, Puri MC, et al.:Proteome adaptation in cell reprogramming proceeds via distinct transcriptional networks. Nat Commun. December 2014; 5(1). 2041-1723. PubMed Abstract | Publisher Full Text
4. Clancy JL, Patel HR, Hussein SMI, et al.:Small RNA changes en route to distinct cellular states of induced pluripotency. Nat Commun. December 2014; 5 (1). 2041-1723. PubMed Abstract | Publisher Full Text
5. Hussein SMI, Puri MC, Tonge PD, et al.:Genome-wide characterization of the routes to pluripotency. Nature. December 2014; 516(7530): 198–206. 0028-0836, 1476-4687. PubMed Abstract | Publisher Full Text
6. Lee D-S, Shin J-Y, Tonge PD, et al.:An epigenomic roadmap to induced pluripotency reveals DNA methylation as a reprogramming modulator. Nat Commun. December 2014; 5(1). 2041-1723. PubMed Abstract | Publisher Full Text | Free Full Text
7. Tonge PD, Corso AJ, Monetti C, et al.:Divergent reprogramming routes lead to alternative stem-cell states. Nature. December 2014; 516(7530): 192–197. 0028-0836, 1476-4687. PubMed Abstract | Publisher Full Text
8. Angermueller C, Clark SJ, Lee HJ, et al.:Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat Methods. January 2016; 13(3): 229–232. 1548-7091, 1548-7105. PubMed Abstract | Publisher Full Text | Free Full Text
9. Argelaguet R, Clark SJ, Mohammed H, et al.:Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature. December 2019; 576(7787): 487–491. 0028-0836, 1476-4687. PubMed Abstract | Publisher Full Text | Free Full Text
10. Leinonen R, Sugawara H, Shumway M:The sequence read archive. Nucleic Acids Res. November 2010; 39(Database): D19–D21. 0305-1048, 1362-4962. PubMed Abstract | Publisher Full Text | Free Full Text
11. Mashima J, Kodama Y, Fujisawa T, et al.:DNA data bank of Japan. Nucleic Acids Res. October 2016; 45(D1): D25–D31. 0305-1048, 1362-4962. PubMed Abstract | Publisher Full Text | Free Full Text
12. Athar A, Füllgrabe A, George N, et al.:ArrayExpress update – from bulk to single-cell expression data. Nucleic Acids Res. October 2018; 47(D1): D711–D715. 0305-1048, 1362-4962. PubMed Abstract | Publisher Full Text | Free Full Text
13. Chen T, Philip M, Lê Cao K-A, et al.:A multi-modal data harmonisation approach for discovery of COVID-19 drug targets. Brief. Bioinform. May 2021. 1467-5463, 1477-4054. PubMed Abstract | Publisher Full Text | Free Full Text
14. Rohart F, Gautier Be, Singh A, et al.:mixOmics: An r package for ‘omics feature selection and multiple data integration. PLoS Comput Biol. November 2017; 13(11): e1005752. 1553-7358. PubMed Abstract | Publisher Full Text | Free Full Text
15. R Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria:R Foundation for Statistical Computing;2020.Reference Source
16. Chacon S, Straub B Pro Git. Apress;2014. 9781484200773, 9781484200766. Publisher Full Text
17. Merkel D:Docker: Lightweight Linux containers for consistent development and deployment. Linux J. March 2014; 2014(239). 1075-3583.
18. Kurtzer GM:Singularity 2.1.2 - Linux application and environment containers for science.August 2016. Publisher Full Text
19. Kurtzer GM, Sochat V, Bauer MW:Singularity: Scientific containers for mobility of compute. PLoS ONE. May 2017; 12(5): e0177459. 1932-6203. PubMed Abstract | Publisher Full Text | Free Full Text
20. Lê Cao K-A, Rossouw D, Robert-Granié Christèle, et al.:A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. January 2008; 7(1). ISSN 1544-6115. PubMed Abstract | Publisher Full Text
21. Lê Cao K-A, Boitard S, Besse P:Sparse PLS discriminant analysis: Biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinf. June 2011; 12(1). 1471-2105. PubMed Abstract | Publisher Full Text | Free Full Text
22. González I, Lê Cao K-A, Davis MJ, et al.:Visualising associations between paired ‘omics’ data sets. BioData Min. November 2012; 5(1): 1–23. 1756-0381. PubMed Abstract | Publisher Full Text | Free Full Text
23. Liquet B, Lê Cao K-A, Hocini H, et al.:A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC Bioinf. December 2012; 13(1): 1–14. 1471-2105. PubMed Abstract | Publisher Full Text | Free Full Text
24. Singh A, Shannon CP, Gautier B, et al.:DIABLO: An integrative approach for identifying key molecular drivers from multi-omics assays. Method. Biochem. Anal. January 2019; 35(17): 3055–3062. 1367-4803, 1460-2059. PubMed Abstract | Publisher Full Text | Free Full Text
25. Smoot ME, Ono K, Ruscheinski J, et al.:Cytoscape 2.8: New features for data integration and network visualization. Method. Biochem. Anal. December 2010; 27(3): 431–432. 1367-4803, 1460-2059. PubMed Abstract | Publisher Full Text | Free Full Text
26. Chen T, Philip M, Lê Cao K-A, et al.:A multi-modal data harmonisation approach for discovery of COVID-19 drug targets. Brief. Bioinform. May 2021; 0(0): 0. 1467-5463, 1477-4054. PubMed Abstract | Publisher Full Text | Free Full Text
27. Overmyer KA, Shishkova E, Miller IJ, et al.:Large-scale multi-omic analysis of COVID-19 severity. Cell Systems. January 2021; 12(1): 23–40.e7. 2405-4712. PubMed Abstract | Publisher Full Text | Free Full Text
28. Bojkova D, Klann K, Koch B, et al.:Proteomics of SARS-CoV-2-infected host cells reveals therapy targets. Nature. May 2020; 583(7816): 469–472. 0028-0836, 1476-4687. PubMed Abstract | Publisher Full Text
29. Mu A, Klare WP, Baines SL, et al.:Integrative omics identifies conserved and pathogen-specific responses of sepsis-causing bacteria. Nat. Commun. 2023 March; 14: 1530. Publisher Full Text
30. Westerhuis JA, van Velzen EJJ, Hoefsloot HCJ, et al.:Multivariate paired data analysis: Multilevel PLSDA versus OPLSDA. Metabolomics. October 2009; 6(1): 119–128. 1573-3882, 1573-3890. PubMed Abstract | Publisher Full Text | Free Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 06 Jul 2021

Author details Author details

Tyrone Chen
Roles: Conceptualization, Data Curation, Formal Analysis, Software, Validation, Writing – Original Draft Preparation, Writing – Review & Editing

Al J Abadi
Roles: Methodology, Software, Validation, Visualization, Writing – Review & Editing

Kim-Anh Lê Cao
Roles: Formal Analysis, Funding Acquisition, Methodology, Software, Supervision, Validation, Visualization, Writing – Review & Editing

Competing interests

No competing interests were disclosed.

Grant information

S. T acknowledges the AISRF EMCR Fellowship by the Australian Academy of Science and Australian Women ResearchSuccess Grant at Monash University. T. C received funding from the Australian Government Research Training ProgramScholarship and Monash Faculty of Science Dean’s Postgraduate Research Scholarship. K-A. L-C was supported in partby the National Health and Medical Research Council (NHMRC) Career Development fellowship (GNT1159458)
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 02 Aug 2023, 10:538

https://doi.org/10.12688/f1000research.53453.2

version 1

Published: 06 Jul 2021, 10:538

https://doi.org/10.12688/f1000research.53453.1

© 2023 Chen T et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Chen T, Abadi AJ, Lê Cao KA and Tyagi S. multiomics: A user-friendly multi-omics data harmonisation R pipeline [version 2; peer review: 2 not approved] F1000Research 2023, 10:538 (https://doi.org/10.12688/f1000research.53453.2)

NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 06 Jul 2021

Views

Reviewer Report 08 Dec 2021

Arjun Krishnan, Department of Computational Mathematics, Science, and Engineering & Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824-1226, USA

Not Approved

https://doi.org/10.5256/f1000research.56837.r89102

In this article, Chen and colleagues present an R pipeline for multi-omics data analysis that can potentially accept unrefined data and produce convenient outputs. The pipeline is available as an R package and as Docker/Singularity containers. It is built on top of and closely integrated with the popular mixOmics package. Though this work could be useful, in its current form, it is unfortunately unclear what the new contributions are.

Major comments

The authors talk about the “lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation”. What does “unrefined and information-rich mean” mean? As this is the primary motivator for this new pipeline, it needs to be explained clearly, especially in terms of the content and structure of data that multiomics can accept but existing packages like mixOmics cannot?
How is this pipeline different from the one published by the same authors in Briefings in Bioinformatics: A multi-modal data harmonisation approach for discovery of COVID-19 drug targets (Chen et al. (2021¹)?
The proposed pipeline – multiomics – heavily relies on the mixOmics package for all its data preprocessing, multivariate analyses, and plotting. The authors note that the speed and memory bottlenecks (e.g. parameter tuning and data imputation) are still problems. So, is multiomics a convenient wrapper for mixOmics? What are the contributions of the multiomics pipeline in terms of features that are not already part of mixOmics or any other existing multi-omics packages?
The writing can be considerably tightened.
- The first two paragraphs in Introduction can be condensed to a few sentences so that the practicalities of multi-omics data analysis can be brought up soon.
- Figure 1 is not contributing to the exposition and can be removed.
- What do the following statements on Page 4 at the beginning of passage 3 mean?
  - “implementing one of the state of the art tools in data harmonisation from the mixOmics R package”.
  - “It is portable with multiple implementations”
- Fix: “run a pipeline and export output data which can be used for downstream analyses and further downstream analyses".

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

References

1. Chen T, Philip M, Lê Cao KA, Tyagi S: A multi-modal data harmonisation approach for discovery of COVID-19 drug targets.Brief Bioinform. 2021. PubMed Abstract | Publisher Full Text

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Computational biology, Bioinformatics, Machine learning, Software development

I confirm that I have read this submission and believe that I have an appropriate level of expertise to state that I do not consider it to be of an acceptable scientific standard, for reasons outlined above.

CITE

Report a concern

Author Response 17 Nov 2023

Tyrone Chen, School of Biological Sciences, Monash University, Clayton, 3800, Australia

17 Nov 2023

Author Response
> In this article, Chen and colleagues present an R pipeline for multi-omics data analysis that can potentially accept unrefined data and produce convenient outputs. The pipeline is available as ... Continue reading
> In this article, Chen and colleagues present an R pipeline for multi-omics data analysis that can potentially accept unrefined data and produce convenient outputs. The pipeline is available as an R package and as Docker/Singularity containers. It is built on top of and closely integrated with the popular mixOmics package. Though this work could be useful, in its current form, it is unfortunately unclear what the new contributions are.

Major comments

The authors talk about the “lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation”. What does “unrefined and information-rich mean” mean? As this is the primary motivator for this new pipeline, it needs to be explained clearly, especially in terms of the content and structure of data that multiomics can accept but existing packages like mixOmics cannot?

----
Conventional biological pipelines involve multiple layers of data preprocessing and summarisation, where at each stage information is irreversibly lost. Therefore, in this context, unrefined and information rich refers to data in a primary form at a stage before heavy information loss occurs, such as matrices of molecular abundance data. Our pipeline takes these quantitative matrices as input, and can identify low-level correlations across individual molecules. We made this point clearer in introduction paragraph 5 of the latest version of the manuscript.

We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined.
----
> 2. How is this pipeline different from the one published by the same authors in Briefings in Bioinformatics: A multi-modal data harmonisation approach for discovery of COVID-19 drug targets (Chen et al. (20211)?
----
Although the concept and function of the analysis workflow is identical, we saw the gap in the field of developing generic multiomics data integration solutions. Hence, we automated the workflow and the latest version contains major improvements to the internal logic and additional features since the original pipeline was first developed. A new simplified case study is available as well in the latest update and available on github.
----
> 3. The proposed pipeline – multiomics – heavily relies on the mixOmics package for all its data preprocessing, multivariate analyses, and plotting. The authors note that the speed and memory bottlenecks (e.g. parameter tuning and data imputation) are still problems. So, is multiomics a convenient wrapper for mixOmics? What are the contributions of the multiomics pipeline in terms of features that are not already part of mixOmics or any other existing multi-omics packages?
----
Data preprocessing steps such as the filtering of low-variance columns and formatting missing/zero values are our own custom additions and not part of the original mixOmics pipeline. These functions were written in collaboration with the original mixOmics authors and designed to create input compatible with mixOmics functions. (ref section: “Data preprocessing”)

Regarding speed and memory bottlenecks, the latest version of the pipeline has been updated to align with the improved parallelisation in the latest version of mixOmics, and now also outputs a more comprehensive table of parameters for easy reuse to skip computationally expensive parameter tuning. To avoid the data imputation bottleneck, we save the imputation output for reuse, therefore imputation only needs to be run once instead of every pipeline iteration (ref section: “Running the pipeline”).

We use mixOmics at the centre of our multiomics pipeline. Here, our key contribution is in the specific context of adding a layer of abstraction for non-expert users, resulting in an end-to-end pipeline for many independent mixOmics functions that can be run in a single command. This is non-trivial and conceptually similar to projects like nf-core (Ewels et al, 2020), where their main contribution to the field is achieving a high degree of automation to streamline complex bioinformatics pipelines. The main advantages are that it is both (a) accessible to new users who can generate a large quantity of relevant information in a single step, as well as being (b) convenient to expert users as an initial screen, as the internal data structures are identical and can be extracted for further analysis.

Ewels, P.A., Peltzer, A., Fillinger, S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020). https://doi.org/10.1038/s41587-020-0439-x
----
> 4. The writing can be considerably tightened.

The first two paragraphs in Introduction can be condensed to a few sentences so that the practicalities of multi-omics data analysis can be brought up soon.

----
We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined. The information is also condensed as recommended.
----

Figure 1 is not contributing to the exposition and can be removed.

----
We believe that providing a figure targeting a general audience would be helpful, and note that this figure was used in multiple presentations to summarise the importance of multi-omics to a non-expert audience (including non-biologists), who found it helpful based on their feedback. In particular, the low-level correlations across individual molecules that we show are not a part of conventional multi-omics studies and is an important piece of information for most audiences, including biologists. We therefore decided to retain this figure.
----

What do the following statements on Page 4 at the beginning of passage 3 mean?

“implementing one of the state of the art tools in data harmonisation from the mixOmics R package”.

----
This was redundant with a previous statement and is now removed. Original intention was to highlight the uniqueness of mixOmics as a generic data integration method. This point is now emphasised in the introduction section.
----

“It is portable with multiple implementations”

----
“Portable” refers to the usability of our package across different machines and operating systems. “Multiple implementations” referred to the original availability of our software as both a docker container as well as a R package.

In the latest version of the manuscript, we removed “multiple implementations” since the latest version of our software does not use a docker container, as we considered this to be now redundant with the R package.

As for “portability”, this still holds across different linux machines, but we did not test this for use on windows or mac, and we now clarified this point in the installation instructions for the github repository. We prioritised the linux version since the pipeline can be computationally expensive, and it would ideally be run on high performance compute clusters which run on linux.
----

Fix: “run a pipeline and export output data which can be used for downstream analyses and further downstream analyses".

----
We thank the reviewers for catching this, now fixed.
> In this article, Chen and colleagues present an R pipeline for multi-omics data analysis that can potentially accept unrefined data and produce convenient outputs. The pipeline is available as an R package and as Docker/Singularity containers. It is built on top of and closely integrated with the popular mixOmics package. Though this work could be useful, in its current form, it is unfortunately unclear what the new contributions are.

Major comments

The authors talk about the “lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation”. What does “unrefined and information-rich mean” mean? As this is the primary motivator for this new pipeline, it needs to be explained clearly, especially in terms of the content and structure of data that multiomics can accept but existing packages like mixOmics cannot?

----
Conventional biological pipelines involve multiple layers of data preprocessing and summarisation, where at each stage information is irreversibly lost. Therefore, in this context, unrefined and information rich refers to data in a primary form at a stage before heavy information loss occurs, such as matrices of molecular abundance data. Our pipeline takes these quantitative matrices as input, and can identify low-level correlations across individual molecules. We made this point clearer in introduction paragraph 5 of the latest version of the manuscript.

We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined.
----
> 2. How is this pipeline different from the one published by the same authors in Briefings in Bioinformatics: A multi-modal data harmonisation approach for discovery of COVID-19 drug targets (Chen et al. (20211)?
----
Although the concept and function of the analysis workflow is identical, we saw the gap in the field of developing generic multiomics data integration solutions. Hence, we automated the workflow and the latest version contains major improvements to the internal logic and additional features since the original pipeline was first developed. A new simplified case study is available as well in the latest update and available on github.
----
> 3. The proposed pipeline – multiomics – heavily relies on the mixOmics package for all its data preprocessing, multivariate analyses, and plotting. The authors note that the speed and memory bottlenecks (e.g. parameter tuning and data imputation) are still problems. So, is multiomics a convenient wrapper for mixOmics? What are the contributions of the multiomics pipeline in terms of features that are not already part of mixOmics or any other existing multi-omics packages?
----
Data preprocessing steps such as the filtering of low-variance columns and formatting missing/zero values are our own custom additions and not part of the original mixOmics pipeline. These functions were written in collaboration with the original mixOmics authors and designed to create input compatible with mixOmics functions. (ref section: “Data preprocessing”)

Regarding speed and memory bottlenecks, the latest version of the pipeline has been updated to align with the improved parallelisation in the latest version of mixOmics, and now also outputs a more comprehensive table of parameters for easy reuse to skip computationally expensive parameter tuning. To avoid the data imputation bottleneck, we save the imputation output for reuse, therefore imputation only needs to be run once instead of every pipeline iteration (ref section: “Running the pipeline”).

We use mixOmics at the centre of our multiomics pipeline. Here, our key contribution is in the specific context of adding a layer of abstraction for non-expert users, resulting in an end-to-end pipeline for many independent mixOmics functions that can be run in a single command. This is non-trivial and conceptually similar to projects like nf-core (Ewels et al, 2020), where their main contribution to the field is achieving a high degree of automation to streamline complex bioinformatics pipelines. The main advantages are that it is both (a) accessible to new users who can generate a large quantity of relevant information in a single step, as well as being (b) convenient to expert users as an initial screen, as the internal data structures are identical and can be extracted for further analysis.

Ewels, P.A., Peltzer, A., Fillinger, S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020). https://doi.org/10.1038/s41587-020-0439-x
----
> 4. The writing can be considerably tightened.

The first two paragraphs in Introduction can be condensed to a few sentences so that the practicalities of multi-omics data analysis can be brought up soon.

----
We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined. The information is also condensed as recommended.
----

Figure 1 is not contributing to the exposition and can be removed.

----
We believe that providing a figure targeting a general audience would be helpful, and note that this figure was used in multiple presentations to summarise the importance of multi-omics to a non-expert audience (including non-biologists), who found it helpful based on their feedback. In particular, the low-level correlations across individual molecules that we show are not a part of conventional multi-omics studies and is an important piece of information for most audiences, including biologists. We therefore decided to retain this figure.
----

What do the following statements on Page 4 at the beginning of passage 3 mean?

“implementing one of the state of the art tools in data harmonisation from the mixOmics R package”.

----
This was redundant with a previous statement and is now removed. Original intention was to highlight the uniqueness of mixOmics as a generic data integration method. This point is now emphasised in the introduction section.
----

“It is portable with multiple implementations”

----
“Portable” refers to the usability of our package across different machines and operating systems. “Multiple implementations” referred to the original availability of our software as both a docker container as well as a R package.

In the latest version of the manuscript, we removed “multiple implementations” since the latest version of our software does not use a docker container, as we considered this to be now redundant with the R package.

As for “portability”, this still holds across different linux machines, but we did not test this for use on windows or mac, and we now clarified this point in the installation instructions for the github repository. We prioritised the linux version since the pipeline can be computationally expensive, and it would ideally be run on high performance compute clusters which run on linux.
----

Fix: “run a pipeline and export output data which can be used for downstream analyses and further downstream analyses".

----
We thank the reviewers for catching this, now fixed.
Competing Interests: There is no competing interest Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 17 Nov 2023

Tyrone Chen, School of Biological Sciences, Monash University, Clayton, 3800, Australia

17 Nov 2023

Author Response
> In this article, Chen and colleagues present an R pipeline for multi-omics data analysis that can potentially accept unrefined data and produce convenient outputs. The pipeline is available as ... Continue reading
> In this article, Chen and colleagues present an R pipeline for multi-omics data analysis that can potentially accept unrefined data and produce convenient outputs. The pipeline is available as an R package and as Docker/Singularity containers. It is built on top of and closely integrated with the popular mixOmics package. Though this work could be useful, in its current form, it is unfortunately unclear what the new contributions are.

Major comments

The authors talk about the “lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation”. What does “unrefined and information-rich mean” mean? As this is the primary motivator for this new pipeline, it needs to be explained clearly, especially in terms of the content and structure of data that multiomics can accept but existing packages like mixOmics cannot?

----
Conventional biological pipelines involve multiple layers of data preprocessing and summarisation, where at each stage information is irreversibly lost. Therefore, in this context, unrefined and information rich refers to data in a primary form at a stage before heavy information loss occurs, such as matrices of molecular abundance data. Our pipeline takes these quantitative matrices as input, and can identify low-level correlations across individual molecules. We made this point clearer in introduction paragraph 5 of the latest version of the manuscript.

We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined.
----
> 2. How is this pipeline different from the one published by the same authors in Briefings in Bioinformatics: A multi-modal data harmonisation approach for discovery of COVID-19 drug targets (Chen et al. (20211)?
----
Although the concept and function of the analysis workflow is identical, we saw the gap in the field of developing generic multiomics data integration solutions. Hence, we automated the workflow and the latest version contains major improvements to the internal logic and additional features since the original pipeline was first developed. A new simplified case study is available as well in the latest update and available on github.
----
> 3. The proposed pipeline – multiomics – heavily relies on the mixOmics package for all its data preprocessing, multivariate analyses, and plotting. The authors note that the speed and memory bottlenecks (e.g. parameter tuning and data imputation) are still problems. So, is multiomics a convenient wrapper for mixOmics? What are the contributions of the multiomics pipeline in terms of features that are not already part of mixOmics or any other existing multi-omics packages?
----
Data preprocessing steps such as the filtering of low-variance columns and formatting missing/zero values are our own custom additions and not part of the original mixOmics pipeline. These functions were written in collaboration with the original mixOmics authors and designed to create input compatible with mixOmics functions. (ref section: “Data preprocessing”)

Regarding speed and memory bottlenecks, the latest version of the pipeline has been updated to align with the improved parallelisation in the latest version of mixOmics, and now also outputs a more comprehensive table of parameters for easy reuse to skip computationally expensive parameter tuning. To avoid the data imputation bottleneck, we save the imputation output for reuse, therefore imputation only needs to be run once instead of every pipeline iteration (ref section: “Running the pipeline”).

We use mixOmics at the centre of our multiomics pipeline. Here, our key contribution is in the specific context of adding a layer of abstraction for non-expert users, resulting in an end-to-end pipeline for many independent mixOmics functions that can be run in a single command. This is non-trivial and conceptually similar to projects like nf-core (Ewels et al, 2020), where their main contribution to the field is achieving a high degree of automation to streamline complex bioinformatics pipelines. The main advantages are that it is both (a) accessible to new users who can generate a large quantity of relevant information in a single step, as well as being (b) convenient to expert users as an initial screen, as the internal data structures are identical and can be extracted for further analysis.

Ewels, P.A., Peltzer, A., Fillinger, S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020). https://doi.org/10.1038/s41587-020-0439-x
----
> 4. The writing can be considerably tightened.

The first two paragraphs in Introduction can be condensed to a few sentences so that the practicalities of multi-omics data analysis can be brought up soon.

----
We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined. The information is also condensed as recommended.
----

Figure 1 is not contributing to the exposition and can be removed.

----
We believe that providing a figure targeting a general audience would be helpful, and note that this figure was used in multiple presentations to summarise the importance of multi-omics to a non-expert audience (including non-biologists), who found it helpful based on their feedback. In particular, the low-level correlations across individual molecules that we show are not a part of conventional multi-omics studies and is an important piece of information for most audiences, including biologists. We therefore decided to retain this figure.
----

What do the following statements on Page 4 at the beginning of passage 3 mean?

“implementing one of the state of the art tools in data harmonisation from the mixOmics R package”.

----
This was redundant with a previous statement and is now removed. Original intention was to highlight the uniqueness of mixOmics as a generic data integration method. This point is now emphasised in the introduction section.
----

“It is portable with multiple implementations”

----
“Portable” refers to the usability of our package across different machines and operating systems. “Multiple implementations” referred to the original availability of our software as both a docker container as well as a R package.

In the latest version of the manuscript, we removed “multiple implementations” since the latest version of our software does not use a docker container, as we considered this to be now redundant with the R package.

As for “portability”, this still holds across different linux machines, but we did not test this for use on windows or mac, and we now clarified this point in the installation instructions for the github repository. We prioritised the linux version since the pipeline can be computationally expensive, and it would ideally be run on high performance compute clusters which run on linux.
----

Fix: “run a pipeline and export output data which can be used for downstream analyses and further downstream analyses".

----
We thank the reviewers for catching this, now fixed.
> In this article, Chen and colleagues present an R pipeline for multi-omics data analysis that can potentially accept unrefined data and produce convenient outputs. The pipeline is available as an R package and as Docker/Singularity containers. It is built on top of and closely integrated with the popular mixOmics package. Though this work could be useful, in its current form, it is unfortunately unclear what the new contributions are.

Major comments

The authors talk about the “lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation”. What does “unrefined and information-rich mean” mean? As this is the primary motivator for this new pipeline, it needs to be explained clearly, especially in terms of the content and structure of data that multiomics can accept but existing packages like mixOmics cannot?

----
Conventional biological pipelines involve multiple layers of data preprocessing and summarisation, where at each stage information is irreversibly lost. Therefore, in this context, unrefined and information rich refers to data in a primary form at a stage before heavy information loss occurs, such as matrices of molecular abundance data. Our pipeline takes these quantitative matrices as input, and can identify low-level correlations across individual molecules. We made this point clearer in introduction paragraph 5 of the latest version of the manuscript.

We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined.
----
> 2. How is this pipeline different from the one published by the same authors in Briefings in Bioinformatics: A multi-modal data harmonisation approach for discovery of COVID-19 drug targets (Chen et al. (20211)?
----
Although the concept and function of the analysis workflow is identical, we saw the gap in the field of developing generic multiomics data integration solutions. Hence, we automated the workflow and the latest version contains major improvements to the internal logic and additional features since the original pipeline was first developed. A new simplified case study is available as well in the latest update and available on github.
----
> 3. The proposed pipeline – multiomics – heavily relies on the mixOmics package for all its data preprocessing, multivariate analyses, and plotting. The authors note that the speed and memory bottlenecks (e.g. parameter tuning and data imputation) are still problems. So, is multiomics a convenient wrapper for mixOmics? What are the contributions of the multiomics pipeline in terms of features that are not already part of mixOmics or any other existing multi-omics packages?
----
Data preprocessing steps such as the filtering of low-variance columns and formatting missing/zero values are our own custom additions and not part of the original mixOmics pipeline. These functions were written in collaboration with the original mixOmics authors and designed to create input compatible with mixOmics functions. (ref section: “Data preprocessing”)

Regarding speed and memory bottlenecks, the latest version of the pipeline has been updated to align with the improved parallelisation in the latest version of mixOmics, and now also outputs a more comprehensive table of parameters for easy reuse to skip computationally expensive parameter tuning. To avoid the data imputation bottleneck, we save the imputation output for reuse, therefore imputation only needs to be run once instead of every pipeline iteration (ref section: “Running the pipeline”).

We use mixOmics at the centre of our multiomics pipeline. Here, our key contribution is in the specific context of adding a layer of abstraction for non-expert users, resulting in an end-to-end pipeline for many independent mixOmics functions that can be run in a single command. This is non-trivial and conceptually similar to projects like nf-core (Ewels et al, 2020), where their main contribution to the field is achieving a high degree of automation to streamline complex bioinformatics pipelines. The main advantages are that it is both (a) accessible to new users who can generate a large quantity of relevant information in a single step, as well as being (b) convenient to expert users as an initial screen, as the internal data structures are identical and can be extracted for further analysis.

Ewels, P.A., Peltzer, A., Fillinger, S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020). https://doi.org/10.1038/s41587-020-0439-x
----
> 4. The writing can be considerably tightened.

The first two paragraphs in Introduction can be condensed to a few sentences so that the practicalities of multi-omics data analysis can be brought up soon.

----
We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined. The information is also condensed as recommended.
----

Figure 1 is not contributing to the exposition and can be removed.

----
We believe that providing a figure targeting a general audience would be helpful, and note that this figure was used in multiple presentations to summarise the importance of multi-omics to a non-expert audience (including non-biologists), who found it helpful based on their feedback. In particular, the low-level correlations across individual molecules that we show are not a part of conventional multi-omics studies and is an important piece of information for most audiences, including biologists. We therefore decided to retain this figure.
----

What do the following statements on Page 4 at the beginning of passage 3 mean?

“implementing one of the state of the art tools in data harmonisation from the mixOmics R package”.

----
This was redundant with a previous statement and is now removed. Original intention was to highlight the uniqueness of mixOmics as a generic data integration method. This point is now emphasised in the introduction section.
----

“It is portable with multiple implementations”

----
“Portable” refers to the usability of our package across different machines and operating systems. “Multiple implementations” referred to the original availability of our software as both a docker container as well as a R package.

In the latest version of the manuscript, we removed “multiple implementations” since the latest version of our software does not use a docker container, as we considered this to be now redundant with the R package.

As for “portability”, this still holds across different linux machines, but we did not test this for use on windows or mac, and we now clarified this point in the installation instructions for the github repository. We prioritised the linux version since the pipeline can be computationally expensive, and it would ideally be run on high performance compute clusters which run on linux.
----

Fix: “run a pipeline and export output data which can be used for downstream analyses and further downstream analyses".

----
We thank the reviewers for catching this, now fixed.
Competing Interests: There is no competing interest Close
Report a concern

Views

Reviewer Report 29 Nov 2021

Javad Zahiri, Department of Neuroscience, University of California San Diego, La Jolla, California, USA

Not Approved

https://doi.org/10.5256/f1000research.56837.r98916

In the present study, "multiomics: A user-friendly multi-omics data harmonisation R pipeline" the authors tried to develop a tool for multiple omics integration and analysis. The problem is of utmost importance. However, the tool needs more work to be suitable for publication.
The major problem is that installation is not easy at all for non-expert users. I had a bunch of biological researchers to install the tool, but they couldn't.

In addition the below codes produce errors:

> metab <- read.table(
"data_metabolomics.tsv", sep="\t", header=TRUE, row.names=1)

> data_names
(the variable has not been defined)

>download.file(url, "RData.RData)

Another important point is comparing multiomics to other recent similar tools like MOVICS and CNet (among several tools) and showing the current tool's strength compared to others.

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

No
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics, computational genomics, machine learning

CITE

Report a concern

Author Response 17 Nov 2023

Tyrone Chen, School of Biological Sciences, Monash University, Clayton, 3800, Australia

17 Nov 2023

Author Response

> In the present study, "multiomics: A user-friendly multi-omics data harmonisation R pipeline" the authors tried to develop a tool for multiple omics integration and analysis. The problem is of ... Continue reading > In the present study, "multiomics: A user-friendly multi-omics data harmonisation R pipeline" the authors tried to develop a tool for multiple omics integration and analysis. The problem is of utmost importance. However, the tool needs more work to be suitable for publication.
The major problem is that installation is not easy at all for non-expert users. I had a bunch of biological researchers to install the tool, but they couldn't.

In addition the below codes produce errors:

> metab <- read.table(
"data_metabolomics.tsv", sep="\t", header=TRUE, row.names=1)

> data_names
(the variable has not been defined)

>download.file(url, "RData.RData)

----
We agree and acknowledge that we did not sufficiently test the installation process of the pipeline to be usable for non-expert users. Since the original submission, we introduced a new automated test involving a full run of the pipeline on a new case study (a recent example is publicly visible online here), made major changes to the internal logic of the pipeline, and improved the install experience for users. We also now use a new simplified case study, and provided the associated data internally within the package as well as through downloadable text files as a backup option. The latest version of the pipeline should now be installable and accessible following the instructions provided. Our publicly available github issue tracker remains open for any bug reports or feature requests.
----

> Another important point is comparing multiomics to other recent similar tools like MOVICS and CNet (among several tools) and showing the current tool's strength compared to others.

----
We agree that providing some rationale for using this tool over others available is useful for readers. We note that our previous review covered this topic in significant detail (Chen & Tyagi, 2020), and provides strong justification for its use, since it is one of the only methods that is generic in scope as well as input data, compared to most existing methods that make restrictive assumptions or require highly specific input data. In the latest manuscript text, we made the above point clear in paragraph 5, while the new paragraphs 2,3,5 together provide a summary of key points to justify its use over the current ecosystem of tools, and this information combined with the cited review places this tool in context of the field.

Chen T., Tyagi S., Integrative computational epigenomics to build data-driven gene regulation hypotheses, GigaScience, Volume 9, Issue 6, June 2020, giaa064, https://doi.org/10.1093/gigascience/giaa064
----
> In the present study, "multiomics: A user-friendly multi-omics data harmonisation R pipeline" the authors tried to develop a tool for multiple omics integration and analysis. The problem is of utmost importance. However, the tool needs more work to be suitable for publication.
The major problem is that installation is not easy at all for non-expert users. I had a bunch of biological researchers to install the tool, but they couldn't.

In addition the below codes produce errors:

> metab <- read.table(
"data_metabolomics.tsv", sep="\t", header=TRUE, row.names=1)

> data_names
(the variable has not been defined)

>download.file(url, "RData.RData)

----
We agree and acknowledge that we did not sufficiently test the installation process of the pipeline to be usable for non-expert users. Since the original submission, we introduced a new automated test involving a full run of the pipeline on a new case study (a recent example is publicly visible online here), made major changes to the internal logic of the pipeline, and improved the install experience for users. We also now use a new simplified case study, and provided the associated data internally within the package as well as through downloadable text files as a backup option. The latest version of the pipeline should now be installable and accessible following the instructions provided. Our publicly available github issue tracker remains open for any bug reports or feature requests.
----

> Another important point is comparing multiomics to other recent similar tools like MOVICS and CNet (among several tools) and showing the current tool's strength compared to others.

----
We agree that providing some rationale for using this tool over others available is useful for readers. We note that our previous review covered this topic in significant detail (Chen & Tyagi, 2020), and provides strong justification for its use, since it is one of the only methods that is generic in scope as well as input data, compared to most existing methods that make restrictive assumptions or require highly specific input data. In the latest manuscript text, we made the above point clear in paragraph 5, while the new paragraphs 2,3,5 together provide a summary of key points to justify its use over the current ecosystem of tools, and this information combined with the cited review places this tool in context of the field.

Chen T., Tyagi S., Integrative computational epigenomics to build data-driven gene regulation hypotheses, GigaScience, Volume 9, Issue 6, June 2020, giaa064, https://doi.org/10.1093/gigascience/giaa064
----
Competing Interests: There is no competing interest. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 17 Nov 2023

Tyrone Chen, School of Biological Sciences, Monash University, Clayton, 3800, Australia

17 Nov 2023

Author Response

> In the present study, "multiomics: A user-friendly multi-omics data harmonisation R pipeline" the authors tried to develop a tool for multiple omics integration and analysis. The problem is of ... Continue reading > In the present study, "multiomics: A user-friendly multi-omics data harmonisation R pipeline" the authors tried to develop a tool for multiple omics integration and analysis. The problem is of utmost importance. However, the tool needs more work to be suitable for publication.
The major problem is that installation is not easy at all for non-expert users. I had a bunch of biological researchers to install the tool, but they couldn't.

In addition the below codes produce errors:

> metab <- read.table(
"data_metabolomics.tsv", sep="\t", header=TRUE, row.names=1)

> data_names
(the variable has not been defined)

>download.file(url, "RData.RData)

----
We agree and acknowledge that we did not sufficiently test the installation process of the pipeline to be usable for non-expert users. Since the original submission, we introduced a new automated test involving a full run of the pipeline on a new case study (a recent example is publicly visible online here), made major changes to the internal logic of the pipeline, and improved the install experience for users. We also now use a new simplified case study, and provided the associated data internally within the package as well as through downloadable text files as a backup option. The latest version of the pipeline should now be installable and accessible following the instructions provided. Our publicly available github issue tracker remains open for any bug reports or feature requests.
----

> Another important point is comparing multiomics to other recent similar tools like MOVICS and CNet (among several tools) and showing the current tool's strength compared to others.

----
We agree that providing some rationale for using this tool over others available is useful for readers. We note that our previous review covered this topic in significant detail (Chen & Tyagi, 2020), and provides strong justification for its use, since it is one of the only methods that is generic in scope as well as input data, compared to most existing methods that make restrictive assumptions or require highly specific input data. In the latest manuscript text, we made the above point clear in paragraph 5, while the new paragraphs 2,3,5 together provide a summary of key points to justify its use over the current ecosystem of tools, and this information combined with the cited review places this tool in context of the field.

Chen T., Tyagi S., Integrative computational epigenomics to build data-driven gene regulation hypotheses, GigaScience, Volume 9, Issue 6, June 2020, giaa064, https://doi.org/10.1093/gigascience/giaa064
----
> In the present study, "multiomics: A user-friendly multi-omics data harmonisation R pipeline" the authors tried to develop a tool for multiple omics integration and analysis. The problem is of utmost importance. However, the tool needs more work to be suitable for publication.
The major problem is that installation is not easy at all for non-expert users. I had a bunch of biological researchers to install the tool, but they couldn't.

In addition the below codes produce errors:

> metab <- read.table(
"data_metabolomics.tsv", sep="\t", header=TRUE, row.names=1)

> data_names
(the variable has not been defined)

>download.file(url, "RData.RData)

----
We agree and acknowledge that we did not sufficiently test the installation process of the pipeline to be usable for non-expert users. Since the original submission, we introduced a new automated test involving a full run of the pipeline on a new case study (a recent example is publicly visible online here), made major changes to the internal logic of the pipeline, and improved the install experience for users. We also now use a new simplified case study, and provided the associated data internally within the package as well as through downloadable text files as a backup option. The latest version of the pipeline should now be installable and accessible following the instructions provided. Our publicly available github issue tracker remains open for any bug reports or feature requests.
----

> Another important point is comparing multiomics to other recent similar tools like MOVICS and CNet (among several tools) and showing the current tool's strength compared to others.

----
We agree that providing some rationale for using this tool over others available is useful for readers. We note that our previous review covered this topic in significant detail (Chen & Tyagi, 2020), and provides strong justification for its use, since it is one of the only methods that is generic in scope as well as input data, compared to most existing methods that make restrictive assumptions or require highly specific input data. In the latest manuscript text, we made the above point clear in paragraph 5, while the new paragraphs 2,3,5 together provide a summary of key points to justify its use over the current ecosystem of tools, and this information combined with the cited review places this tool in context of the field.

Chen T., Tyagi S., Integrative computational epigenomics to build data-driven gene regulation hypotheses, GigaScience, Volume 9, Issue 6, June 2020, giaa064, https://doi.org/10.1093/gigascience/giaa064
----
Competing Interests: There is no competing interest. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 06 Jul 2021

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2
Version 2 (revision) 02 Aug 23
Version 1 06 Jul 21	read	read

Javad Zahiri, University of California San Diego, La Jolla, USA
Arjun Krishnan, Michigan State University, East Lansing, USA

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

65 Views

08 Dec 2021 | for Version 1

Arjun Krishnan, Department of Computational Mathematics, Science, and Engineering & Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, 48824-1226, USA

65 Views Cite this report Responses(1)

Not Approved

The authors talk about the “lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation”. What does “unrefined and information-rich mean” mean? As this is the primary motivator for this new pipeline, it needs to be explained clearly, especially in terms of the content and structure of data that multiomics can accept but existing packages like mixOmics cannot?
How is this pipeline different from the one published by the same authors in Briefings in Bioinformatics: A multi-modal data harmonisation approach for discovery of COVID-19 drug targets (Chen et al. (2021¹)?
The proposed pipeline – multiomics – heavily relies on the mixOmics package for all its data preprocessing, multivariate analyses, and plotting. The authors note that the speed and memory bottlenecks (e.g. parameter tuning and data imputation) are still problems. So, is multiomics a convenient wrapper for mixOmics? What are the contributions of the multiomics pipeline in terms of features that are not already part of mixOmics or any other existing multi-omics packages?
The writing can be considerably tightened.
- The first two paragraphs in Introduction can be condensed to a few sentences so that the practicalities of multi-omics data analysis can be brought up soon.
- Figure 1 is not contributing to the exposition and can be removed.
- What do the following statements on Page 4 at the beginning of passage 3 mean?
  - “implementing one of the state of the art tools in data harmonisation from the mixOmics R package”.
  - “It is portable with multiple implementations”
- Fix: “run a pipeline and export output data which can be used for downstream analyses and further downstream analyses".

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Partly
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

References

1. Chen T, Philip M, Lê Cao KA, Tyagi S: A multi-modal data harmonisation approach for discovery of COVID-19 drug targets.Brief Bioinform. 2021. PubMed Abstract | Publisher Full Text

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Computational biology, Bioinformatics, Machine learning, Software development

Respond to this report

Responses (1)

Author Response

17 Nov 2023

Tyrone Chen, School of Biological Sciences, Monash University, Clayton, 3800, Australia

> In this article, Chen and colleagues present an R pipeline for multi-omics data analysis that can potentially accept unrefined data and produce convenient outputs. The pipeline is available as an R package and as Docker/Singularity containers. It is built on top of and closely integrated with the popular mixOmics package. Though this work could be useful, in its current form, it is unfortunately unclear what the new contributions are.

Major comments

The authors talk about the “lack of a single or multi-omics pipeline which can accept data in an unrefined, information-rich form pre-integration and subsequently generate output for further investigation”. What does “unrefined and information-rich mean” mean? As this is the primary motivator for this new pipeline, it needs to be explained clearly, especially in terms of the content and structure of data that multiomics can accept but existing packages like mixOmics cannot?

----
Conventional biological pipelines involve multiple layers of data preprocessing and summarisation, where at each stage information is irreversibly lost. Therefore, in this context, unrefined and information rich refers to data in a primary form at a stage before heavy information loss occurs, such as matrices of molecular abundance data. Our pipeline takes these quantitative matrices as input, and can identify low-level correlations across individual molecules. We made this point clearer in introduction paragraph 5 of the latest version of the manuscript.

We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined.
----
> 2. How is this pipeline different from the one published by the same authors in Briefings in Bioinformatics: A multi-modal data harmonisation approach for discovery of COVID-19 drug targets (Chen et al. (20211)?
----
Although the concept and function of the analysis workflow is identical, we saw the gap in the field of developing generic multiomics data integration solutions. Hence, we automated the workflow and the latest version contains major improvements to the internal logic and additional features since the original pipeline was first developed. A new simplified case study is available as well in the latest update and available on github.
----
> 3. The proposed pipeline – multiomics – heavily relies on the mixOmics package for all its data preprocessing, multivariate analyses, and plotting. The authors note that the speed and memory bottlenecks (e.g. parameter tuning and data imputation) are still problems. So, is multiomics a convenient wrapper for mixOmics? What are the contributions of the multiomics pipeline in terms of features that are not already part of mixOmics or any other existing multi-omics packages?
----
Data preprocessing steps such as the filtering of low-variance columns and formatting missing/zero values are our own custom additions and not part of the original mixOmics pipeline. These functions were written in collaboration with the original mixOmics authors and designed to create input compatible with mixOmics functions. (ref section: “Data preprocessing”)

Regarding speed and memory bottlenecks, the latest version of the pipeline has been updated to align with the improved parallelisation in the latest version of mixOmics, and now also outputs a more comprehensive table of parameters for easy reuse to skip computationally expensive parameter tuning. To avoid the data imputation bottleneck, we save the imputation output for reuse, therefore imputation only needs to be run once instead of every pipeline iteration (ref section: “Running the pipeline”).

We use mixOmics at the centre of our multiomics pipeline. Here, our key contribution is in the specific context of adding a layer of abstraction for non-expert users, resulting in an end-to-end pipeline for many independent mixOmics functions that can be run in a single command. This is non-trivial and conceptually similar to projects like nf-core (Ewels et al, 2020), where their main contribution to the field is achieving a high degree of automation to streamline complex bioinformatics pipelines. The main advantages are that it is both (a) accessible to new users who can generate a large quantity of relevant information in a single step, as well as being (b) convenient to expert users as an initial screen, as the internal data structures are identical and can be extracted for further analysis.

Ewels, P.A., Peltzer, A., Fillinger, S. et al. The nf-core framework for community-curated bioinformatics pipelines. Nat Biotechnol 38, 276–278 (2020). https://doi.org/10.1038/s41587-020-0439-x
----
> 4. The writing can be considerably tightened.

The first two paragraphs in Introduction can be condensed to a few sentences so that the practicalities of multi-omics data analysis can be brought up soon.

----
We considered point (1) in relation to point (4), and made the above information clearer and more concise by reorganising the logic flow of the introduction. As a result, some paragraphs in the introduction are reordered or combined. The information is also condensed as recommended.
----

Figure 1 is not contributing to the exposition and can be removed.

----
We believe that providing a figure targeting a general audience would be helpful, and note that this figure was used in multiple presentations to summarise the importance of multi-omics to a non-expert audience (including non-biologists), who found it helpful based on their feedback. In particular, the low-level correlations across individual molecules that we show are not a part of conventional multi-omics studies and is an important piece of information for most audiences, including biologists. We therefore decided to retain this figure.
----

What do the following statements on Page 4 at the beginning of passage 3 mean?
- “implementing one of the state of the art tools in data harmonisation from the mixOmics R package”.

----
This was redundant with a previous statement and is now removed. Original intention was to highlight the uniqueness of mixOmics as a generic data integration method. This point is now emphasised in the introduction section.
----

“It is portable with multiple implementations”

----
“Portable” refers to the usability of our package across different machines and operating systems. “Multiple implementations” referred to the original availability of our software as both a docker container as well as a R package.

In the latest version of the manuscript, we removed “multiple implementations” since the latest version of our software does not use a docker container, as we considered this to be now redundant with the R package.

As for “portability”, this still holds across different linux machines, but we did not test this for use on windows or mac, and we now clarified this point in the installation instructions for the github repository. We prioritised the linux version since the pipeline can be computationally expensive, and it would ideally be run on high performance compute clusters which run on linux.
----

Fix: “run a pipeline and export output data which can be used for downstream analyses and further downstream analyses".

----
We thank the reviewers for catching this, now fixed.

View more View less

Competing Interests

There is no competing interest

Back to all reports

Reviewer Report

85 Views

29 Nov 2021 | for Version 1

Javad Zahiri, Department of Neuroscience, University of California San Diego, La Jolla, California, USA

85 Views Cite this report Responses(1)

Not Approved

Is the rationale for developing the new software tool clearly explained?

Partly
Is the description of the software tool technically sound?

No
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

No

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics, computational genomics, machine learning

Respond to this report

Responses (1)

Author Response

17 Nov 2023

Tyrone Chen, School of Biological Sciences, Monash University, Clayton, 3800, Australia

> In the present study, "multiomics: A user-friendly multi-omics data harmonisation R pipeline" the authors tried to develop a tool for multiple omics integration and analysis. The problem is of utmost importance. However, the tool needs more work to be suitable for publication.
The major problem is that installation is not easy at all for non-expert users. I had a bunch of biological researchers to install the tool, but they couldn't.

In addition the below codes produce errors:

> metab <- read.table(
"data_metabolomics.tsv", sep="\t", header=TRUE, row.names=1)

> data_names
(the variable has not been defined)

>download.file(url, "RData.RData)

----
We agree and acknowledge that we did not sufficiently test the installation process of the pipeline to be usable for non-expert users. Since the original submission, we introduced a new automated test involving a full run of the pipeline on a new case study (a recent example is publicly visible online here), made major changes to the internal logic of the pipeline, and improved the install experience for users. We also now use a new simplified case study, and provided the associated data internally within the package as well as through downloadable text files as a backup option. The latest version of the pipeline should now be installable and accessible following the instructions provided. Our publicly available github issue tracker remains open for any bug reports or feature requests.
----

> Another important point is comparing multiomics to other recent similar tools like MOVICS and CNet (among several tools) and showing the current tool's strength compared to others.

----
We agree that providing some rationale for using this tool over others available is useful for readers. We note that our previous review covered this topic in significant detail (Chen & Tyagi, 2020), and provides strong justification for its use, since it is one of the only methods that is generic in scope as well as input data, compared to most existing methods that make restrictive assumptions or require highly specific input data. In the latest manuscript text, we made the above point clear in paragraph 5, while the new paragraphs 2,3,5 together provide a summary of key points to justify its use over the current ecosystem of tools, and this information combined with the cited review places this tool in context of the field.

Chen T., Tyagi S., Integrative computational epigenomics to build data-driven gene regulation hypotheses, GigaScience, Volume 9, Issue 6, June 2020, giaa064, https://doi.org/10.1093/gigascience/giaa064
----

View more View less

Competing Interests

There is no competing interest.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Chen T, Tyagi S:Integrative computational epigenomics to build data-driven gene regulation hypotheses. GigaScience. June 2020; 9(6): 1–13. 2047-217X. PubMed Abstract | Publisher Full Text | Free Full Text

[2] 2. Maier T, Güell M, Serrano L:Correlation of mRNA and protein in complex biological samples. FEBS Lett. October 2009; 583(24): 3966–3973. 0014-5793. PubMed Abstract | Publisher Full Text

[3] 3. Benevento M, Tonge PD, Puri MC, et al.:Proteome adaptation in cell reprogramming proceeds via distinct transcriptional networks. Nat Commun. December 2014; 5(1). 2041-1723. PubMed Abstract | Publisher Full Text

[4] 4. Clancy JL, Patel HR, Hussein SMI, et al.:Small RNA changes en route to distinct cellular states of induced pluripotency. Nat Commun. December 2014; 5 (1). 2041-1723. PubMed Abstract | Publisher Full Text

[5] 5. Hussein SMI, Puri MC, Tonge PD, et al.:Genome-wide characterization of the routes to pluripotency. Nature. December 2014; 516(7530): 198–206. 0028-0836, 1476-4687. PubMed Abstract | Publisher Full Text

[6] 6. Lee D-S, Shin J-Y, Tonge PD, et al.:An epigenomic roadmap to induced pluripotency reveals DNA methylation as a reprogramming modulator. Nat Commun. December 2014; 5(1). 2041-1723. PubMed Abstract | Publisher Full Text | Free Full Text

[7] 7. Tonge PD, Corso AJ, Monetti C, et al.:Divergent reprogramming routes lead to alternative stem-cell states. Nature. December 2014; 516(7530): 192–197. 0028-0836, 1476-4687. PubMed Abstract | Publisher Full Text

[8] 8. Angermueller C, Clark SJ, Lee HJ, et al.:Parallel single-cell sequencing links transcriptional and epigenetic heterogeneity. Nat Methods. January 2016; 13(3): 229–232. 1548-7091, 1548-7105. PubMed Abstract | Publisher Full Text | Free Full Text

[9] 9. Argelaguet R, Clark SJ, Mohammed H, et al.:Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature. December 2019; 576(7787): 487–491. 0028-0836, 1476-4687. PubMed Abstract | Publisher Full Text | Free Full Text

[10] 10. Leinonen R, Sugawara H, Shumway M:The sequence read archive. Nucleic Acids Res. November 2010; 39(Database): D19–D21. 0305-1048, 1362-4962. PubMed Abstract | Publisher Full Text | Free Full Text

[11] 11. Mashima J, Kodama Y, Fujisawa T, et al.:DNA data bank of Japan. Nucleic Acids Res. October 2016; 45(D1): D25–D31. 0305-1048, 1362-4962. PubMed Abstract | Publisher Full Text | Free Full Text

[12] 12. Athar A, Füllgrabe A, George N, et al.:ArrayExpress update – from bulk to single-cell expression data. Nucleic Acids Res. October 2018; 47(D1): D711–D715. 0305-1048, 1362-4962. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Chen T, Philip M, Lê Cao K-A, et al.:A multi-modal data harmonisation approach for discovery of COVID-19 drug targets. Brief. Bioinform. May 2021. 1467-5463, 1477-4054. PubMed Abstract | Publisher Full Text | Free Full Text

[14] 14. Rohart F, Gautier Be, Singh A, et al.:mixOmics: An r package for ‘omics feature selection and multiple data integration. PLoS Comput Biol. November 2017; 13(11): e1005752. 1553-7358. PubMed Abstract | Publisher Full Text | Free Full Text

[15] 15. R Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria:R Foundation for Statistical Computing;2020.Reference Source

[16] 16. Chacon S, Straub B Pro Git. Apress;2014. 9781484200773, 9781484200766. Publisher Full Text

[17] 17. Merkel D:Docker: Lightweight Linux containers for consistent development and deployment. Linux J. March 2014; 2014(239). 1075-3583.

[18] 18. Kurtzer GM:Singularity 2.1.2 - Linux application and environment containers for science.August 2016. Publisher Full Text

[19] 19. Kurtzer GM, Sochat V, Bauer MW:Singularity: Scientific containers for mobility of compute. PLoS ONE. May 2017; 12(5): e0177459. 1932-6203. PubMed Abstract | Publisher Full Text | Free Full Text

[20] 20. Lê Cao K-A, Rossouw D, Robert-Granié Christèle, et al.:A sparse PLS for variable selection when integrating omics data. Stat. Appl. Genet. Mol. January 2008; 7(1). ISSN 1544-6115. PubMed Abstract | Publisher Full Text

[21] 21. Lê Cao K-A, Boitard S, Besse P:Sparse PLS discriminant analysis: Biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinf. June 2011; 12(1). 1471-2105. PubMed Abstract | Publisher Full Text | Free Full Text

[22] 22. González I, Lê Cao K-A, Davis MJ, et al.:Visualising associations between paired ‘omics’ data sets. BioData Min. November 2012; 5(1): 1–23. 1756-0381. PubMed Abstract | Publisher Full Text | Free Full Text

[23] 23. Liquet B, Lê Cao K-A, Hocini H, et al.:A novel approach for biomarker selection and the integration of repeated measures experiments from two assays. BMC Bioinf. December 2012; 13(1): 1–14. 1471-2105. PubMed Abstract | Publisher Full Text | Free Full Text

[24] 24. Singh A, Shannon CP, Gautier B, et al.:DIABLO: An integrative approach for identifying key molecular drivers from multi-omics assays. Method. Biochem. Anal. January 2019; 35(17): 3055–3062. 1367-4803, 1460-2059. PubMed Abstract | Publisher Full Text | Free Full Text

[25] 25. Smoot ME, Ono K, Ruscheinski J, et al.:Cytoscape 2.8: New features for data integration and network visualization. Method. Biochem. Anal. December 2010; 27(3): 431–432. 1367-4803, 1460-2059. PubMed Abstract | Publisher Full Text | Free Full Text

[26] 26. Chen T, Philip M, Lê Cao K-A, et al.:A multi-modal data harmonisation approach for discovery of COVID-19 drug targets. Brief. Bioinform. May 2021; 0(0): 0. 1467-5463, 1477-4054. PubMed Abstract | Publisher Full Text | Free Full Text

[27] 27. Overmyer KA, Shishkova E, Miller IJ, et al.:Large-scale multi-omic analysis of COVID-19 severity. Cell Systems. January 2021; 12(1): 23–40.e7. 2405-4712. PubMed Abstract | Publisher Full Text | Free Full Text

[28] 28. Bojkova D, Klann K, Koch B, et al.:Proteomics of SARS-CoV-2-infected host cells reveals therapy targets. Nature. May 2020; 583(7816): 469–472. 0028-0836, 1476-4687. PubMed Abstract | Publisher Full Text

[29] 29. Mu A, Klare WP, Baines SL, et al.:Integrative omics identifies conserved and pathogen-specific responses of sepsis-causing bacteria. Nat. Commun. 2023 March; 14: 1530. Publisher Full Text

[30] 30. Westerhuis JA, van Velzen EJJ, Hoefsloot HCJ, et al.:Multivariate paired data analysis: Multilevel PLSDA versus OPLSDA. Metabolomics. October 2009; 6(1): 119–128. 1573-3882, 1573-3890. PubMed Abstract | Publisher Full Text | Free Full Text

multiomics: A user-friendly multi-omics data harmonisation R pipeline

Abstract

Keywords

Revised Amendments from Version 1

Introduction

Figure 1. An illustration of a hypothetical multi-omics perspective on a simple biological system.

Methods

Implementation

Operation

Figure 2. Technical notes for the pipeline.

Running the pipeline

Example output

Use cases

Example data included within the multiomics package

Example processing workflow

Data preprocessing

Method parameters

Result and quality control metrics visualisation

Figure 3. Example results visualisation.

Output control

Reproducibility and integration with mixOmics

Author contributions

Data availability

Source data

Underlying data

Extended data

Software availability

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated