Librarian: A quality control tool to analyse sequencing library compositions

Kartavya Vashishtha; Caroline Gaud; Simon Andrews; Christel Krueger

doi:10.12688/f1000research.125325.1

Home Browse Librarian: A quality control tool to analyse sequencing library compositions

ALL Metrics

Views

Downloads

Get PDF

Get XML

Export

▬

✚

Software Tool Article

Librarian: A quality control tool to analyse sequencing library compositions

[version 1; peer review: 1 approved, 2 approved with reservations]

Kartavya Vashishtha¹, Caroline Gaud², Simon Andrews², Christel Krueger ^2,3

PUBLISHED 29 Sep 2022

Author details Author details

¹ Independent Researcher, New Delhi, India
² Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK
³ Bioinformatics, Altos Labs Cambridge Institute of Science, Cambridge, CB21 6GP, UK

Kartavya Vashishtha
Roles: Software, Writing – Review & Editing

Caroline Gaud
Roles: Software, Writing – Review & Editing

Simon Andrews
Roles: Conceptualization, Funding Acquisition, Software, Writing – Review & Editing

Christel Krueger
Roles: Conceptualization, Formal Analysis, Software, Visualization, Writing – Original Draft Preparation

OPEN PEER REVIEW

REVIEWER STATUS

This article is included in the Bioinformatics gateway.

Abstract

Background: Robust analysis of DNA sequencing data needs to include a set of quality control steps to ensure that technical bias is kept to a minimum. A metric easily obtained is the frequency of each of the nucleobases for each position across all sequencing reads. Here, we explore the differences in nucleobase compositions of various library types produced by standard experimental methodologies.
Methods: We obtained the compositions of nearly 3000 publicly available datasets and subjected them to Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction for a two-dimensional representation of their composition characteristics.
Results: We find that most library types result in a specific composition profile. We use this to give an estimate of how strongly the composition of a test library resembles the profiles of previously published libraries, and how likely the test sample is to be of a particular type. We introduce Librarian, a user-friendly web application and command line tool which enables checking base compositions of test libraries against known library types.
Conclusions: Library preparation methods strongly influence the per position nucleobase content. By comparing test libraries to a database of previously published library types we can make predictions regarding the library preparation method. Librarian is a user-friendly tool to access this information for quality assurance purposes as discrepancies can flag potential irregularities very early on.

Keywords

high throughput sequencing, quality control, sequencing libraries, FastQ, base composition

Corresponding author: Christel Krueger

Competing interests: No competing interests were disclosed.

Grant information: This research was supported by the Babraham Institute’s United Kingdom Research and Innovation - Biotechnology and Biological Sciences Research Council (UKRI- BBSRC) core capability grant (reference number BBS/E/B000X0000) and the Epigenetics Institute Strategic Programme (reference number BBS/E/B/000C0425).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2022 Vashishtha K et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

How to cite: Vashishtha K, Gaud C, Andrews S and Krueger C. Librarian: A quality control tool to analyse sequencing library compositions [version 1; peer review: 1 approved, 2 approved with reservations]. F1000Research 2022, 11:1122 (https://doi.org/10.12688/f1000research.125325.1) First published: 29 Sep 2022, 11:1122 (https://doi.org/10.12688/f1000research.125325.1) Latest published: 24 Jan 2024, 11:1122 (https://doi.org/10.12688/f1000research.125325.2)

Introduction

High-throughput sequencing is now a routine technology for the analysis of biological phenomena. A multitude of methods have been developed that obtain genome-wide information on the transcriptome, protein-DNA binding, chromatin compaction, chromosomal conformation and DNA modifications to name but a few. While these approaches address different biological questions and employ various sample preparation techniques, the workflow mostly converges at a stage where adapter flanked short DNA sequences, so called libraries, are subjected to Illumina sequencing.¹ The resulting raw data should pass a number of quality control (QC) steps before analysis is performed.²^–⁴ These can be roughly split into two categories, pre-mapping QC, for example monitoring of base call quality scores, and post-mapping QC, for example overall enrichment scores in ChIP-seq data. For example, raw sequencing data can be queried for adapter contamination and GC bias³^–⁵ to gauge the quality of the library preparation, or using multi-species alignments to confirm the expected species.⁵^–⁷ Early detection of technical biases or problems during sample preparation is important for rigorous data analysis and conservation of resources.

FastQ is a file format commonly used for storing unmapped sequencing data. One of the metrics that can be obtained from such files is the summarised base composition across the sequencing reads. For each position in the read the respective content of the bases adenine (A), thymine (T), guanosine (G), and cytosine (C) can be determined. For a theoretic random genomic library the expectation would be four horizontal lines reflecting the overall base composition of the genome. Since the GC content of DNA varies according to species⁸ sequencing libraries will show different composition profiles depending on which organism was sequenced. Less intuitively, libraries produced by different experimental protocols may show vastly different sequence compositions (Figure 1). A prominent example is Bisulfite-seq used for DNA methylation analysis⁹: during sample preparation unmethylated genomic Cs are converted to Ts resulting in libraries with a strikingly low C content. Another instructive example is ATAC-seq.¹⁰ Here, the fragments to be sequenced are produced by a transposase which shows target sequence preference; ATAC-seq libraries are therefore compositionally biased at the start of the read. Expanding on these observations, we asked if base compositions could be used to distinguish different library types more generally.

Figure 1. Per position base content for different library types.

Base content across the first 50 positions of the sequencing reads was averaged for 54 ChIA-PET, 436 Bisulfite-seq, 416 ATAC-seq and 449 ChIP-seq libraries from mouse and human. Percentages are plotted for each of the four bases.

The ‘Per base sequence content’ module of the widely used QC tool FastQC⁴ provides composition information for individual samples, but makes no comparison. Any judgement of whether a particular composition profile is expected for the analysed sample type would require highly specialised niche knowledge which cannot generally be expected of individual researchers. Using the tool MultiQC,¹¹ researchers can collate composition information from multiple individual FastQC reports and visualise them together. This is useful to compare the base compositions of different samples in an experiment and can flag up outliers, but it does not allow for placing samples in the general base composition landscape.

Here, we describe how sample preparation protocols for sequencing libraries result in characteristic composition signatures, and introduce a new quality control tool to check any sequence library against the expected composition of its preparation method.

Methods

To get an overview of expected library compositions we queried the open Gene Expression Omnibus (GEO) database¹² for high throughput sequencing datasets from mouse and human samples for the years 2018, 2019 and 2020.¹³ Mice and humans are among the most studied species and are similar in overall GC content (42% and 41%, respectively) making them a good choice to look for compositional differences of different library types. Search results were filtered to exclude library type ‘OTHER’ as well as under-represented types (fewer than 25 samples), and over-represented library types (e. g. ribonucleic acid (RNA)-seq) were capped at 500 samples. Figure 2A shows the number of samples per library type for which per position base compositions could be retrieved.

Figure 2. Library types can be distinguished by their base compositions.

A) Number of samples per library type included in the analysis. B) UMAP representation of library compositions (reference map). C) Tile based probability map for each library type. Colour represents the percentage of a particular library type per tile. D) Heatmap illustrating the specificity of each library type for tiles of the reference map. All samples were assigned to a reference map tile and colour represents the average percentage of each library type for these tiles. E) Librarian tile probability output: Percent of each library found in the reference map tile associated with the test library.

We then determined how frequently the bases A, T, G and C were found at the first 50 positions in the read (read1 for paired-end data). To visualise sample groupings, the resulting composition data was subjected to Uniform Manifold Approximation and Projection (UMAP) dimensionality reduction¹⁴ (using the umap R package with parameters n_neighbors = 15, min_dist = 8) and a two-dimensional representation is shown in Figure 2B (‘reference map’). Interestingly, visually distinct clusters are formed largely along library types, with some library types having very specific base compositions (e.g. Bisulfite-seq, ChIA-PET, ATAC-seq) while others are largely overlapping (e.g. RNA-seq and ssRNA-seq).

To explore how well represented each library type was in each region of the reference map, we split the map into tiles and calculated the percentage of each library type per tile normalised to the total number of samples. Figure 2C shows that, indeed, some tiles are exclusively occupied by a certain library type while others are less specific. To get an overall measure for how well library types could be distinguished, we first annotated each of the samples included in the analysis with its reference map tile. We then averaged the percentages of the library types represented by the tile across all samples of a particular library type to produce the confusion matrix visualised in Figure 2D. While most tiles are very indicative of a certain library type, we also find tiles which are co-occupied by more than one type, for example ncRNA-seq and miRNA-seq. Base composition similarity of certain library types comes as no surprise as the probed material and involved preparation methods can be largely overlapping.

Having concluded that different library types result in largely distinct base compositions along the sequencing reads, we propose to include checking library compositions as a pre-mapping quality control step in the analysis of high throughput sequencing data. This will help flag technical irregularities during sample preparation or potential sample swaps early on and avoid bias during downstream analyses. To make this generally accessible, we developed Librarian¹⁵ which allows the user to relate the base composition of any newly sequenced library to other samples in the database.

Implementation

Librarian will first extract base composition of the first 50 positions of randomly selected 100,000 reads from a supplied FastQ file. It will then project the compositions of the test library onto the manifold created by all libraries in the database as described above, thereby assigning it to a tile on the compositions map.

Results are presented graphically: The location of the test sample is indicated both on the compositions map and on the plots displaying the probability of each library type per tile. Moreover, the percentages for each library type for the tile assigned to the test library are plotted as a heatmap. This lets the user easily gauge how similar the test library is to a collection of published library types.

Operation

Librarian is available as a web app and a command line tool. In the web app, one or more FastQ files are selected and processed locally to produce the library compositions. Client-side processing avoids upload of large FastQ files and potentially sensitive data. The resulting library composition is compared to the database on the server, and the graphical output can be viewed and downloaded in svg format from the web page.

Librarian can also be run as a command line client application on Linux. Download and install instructions are provided via GitHub (see Software availability). Multiple FastQ files can be processed in the same query and summarised output plots are produced. Just as for the web app, library compositions are compared to the online database to ensure integration of future database expansions with additional library types.

Use case

As a use case we assume that a researcher has submitted three samples for sequencing and has now received FastQ files from the provider (use case input¹³). They want to check if the data conforms to the expectation of the respective library preparation (i. e. RNA-seq, BS-seq and ATAC-seq). Using the Librarian web app, they choose the FastQ files from a directory on their computer and are presented with a graphical representation of how similar their libraries are to published ones regarding their base composition, and a prediction of how likely these samples are to be of a particular library type (use case output¹³). Any discrepancy to the expected library type should be considered a red flag and investigated further.

Another use case would be for a sequencing facility to run Librarian together with other QC packages and provide results to users together with FastQ files as standard.

Discussion

Our analyses demonstrate that the base composition of sequencing libraries is heavily influenced by the method through which the library was prepared. This finding can be used as an early quality assurance step for newly sequenced or publicly available data. A sample not matching its expected composition should raise a red flag and the underlying cause should be investigated before moving on with the analysis. While this could point to a sample swap or problem during library preparation, it is also possible that it is caused by a non-standard preparation method.

Of note, within our database of published sequencing libraries we find a small subset of samples which cluster with a different library type. This is nicely illustrated by a group of RNA-seq samples which fall into a region of the map which is otherwise very specific for ATAC-seq. Closer inspection of these examples reveals that their libraries were produced by tagmentation,¹⁶ a process that generates short DNA fragments using the same transposase as ATAC-seq. This clearly demonstrates that sequence bias at the start of the read introduced thereby has more of an impact on base composition than the difference between RNA producing genomic regions and generally open chromatin. The limited number of available tags for library types on public sequencing data repositories means that there is inherent heterogenicity within the groups. The example also illustrates that there is a need to update the library database as new methods are developed and certain commercial library preparation kits change popularity over time. We have therefore built Librarian in a way that can easily incorporate future developments.

Data availability

Underlying data

Zenodo: Librarian manuscript data v1, https://doi.org/10.5281/zenodo.7060217.¹³

This project contains the following underlying data:

- Composition data (output from the original GEO database queries, and datasets included in the Librarian database (filtered list))
- Use case input (example FastQ files (subsampled for smaller file size))
- Use case output (Librarian plots generated from the use case input files)

GEO database query parameters: Organism: Mus musculus OR Organism: Homo sapiens AND Platform Technology Type: “high throughput sequencing” AND Publication Date: 2018/010/01 to 2020/12/31.

Data are available under the terms of the GNU General Public License v3.0.

Software availability

Software available from: https://www.bioinformatics.babraham.ac.uk/librarian/ [Librarian web app]

Source code available from: https://github.com/DesmondWillowbrook/Librarian [Librarian command line download and install instructions]

Archived source code at time of publication: https://doi.org/10.5281/zenodo.7003739.¹⁵

Licence: GNU General Public License 3.0

Author contributions

Kartavya Vashishtha: Software, Writing – Review & Editing

Caroline Gaud: Software, Writing – Review & Editing

Simon R. Andrews: Conceptualization, Funding Acquisition, Software, Writing – Review & Editing

Christel Krueger: Conceptualization, Formal Analysis, Software, Visualization, Writing – Original Draft Preparation

Acknowledgements

Librarian was originally started as an open project at the Cambridge Bioinformatics Hackathon (www.cambiohack.uk, 21^st-23^rd Sep 2020) with initial ideas from many people including Stephen Kanyerezi and Lordstrong Akano. We would like to thank Felix Krueger for useful discussions and critical reading of the manuscript.

References

1. Sequencing|Key methods and uses. http
2. Wang L, Wang S, Li W: RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012; 28: 2184–2185. PubMed Abstract | Publisher Full Text
3. Okonechnikov K, Conesa A, García-Alcalde F: Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2016; 32: 292–294. PubMed Abstract | Publisher Full Text
4. Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data. http
5. Hadfield J, Eldridge MD: Multi-genome alignment for quality control and contamination screening of next-generation sequencing data. Front. Genet. 2014; 5: 31.
6. Wingett SW, Andrews S: FastQ Screen: A tool for multi-genome mapping and quality control. 2018. Publisher Full Text Reference Source
7. Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15: R46. PubMed Abstract | Publisher Full Text
8. Li X-Q, Du D: Variation, Evolution, and Correlation Analysis of C+G Content and Genome or Chromosome Size in Different Kingdoms and Phyla. PLoS One. 2014; 9: e88339. PubMed Abstract | Publisher Full Text
9. Bernstein AI, Jin P:Chapter 3 - High-Throughput Sequencing-Based Mapping of Cytosine Modifications. Epigenetic Technological Applications. Zheng YG, editor.Academic Press;2015; 39–53. Publisher Full Text
10. Buenrostro JD, Giresi PG, Zaba LC, et al.: Transposition of native chromatin for multimodal regulatory analysis and personal epigenomics. Nat. Methods. 2013; 10: 1213–1218. PubMed Abstract | Publisher Full Text
11. Ewels P, Magnusson M, Lundin S, et al.: MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016; 32: 3047–3048. PubMed Abstract | Publisher Full Text
12. Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002; 30: 207–210. PubMed Abstract | Publisher Full Text | Free Full Text
13. Vashishtha K, Gaud C, Andrews S, et al.: Librarian manuscript data v1.2022. Publisher Full Text
14. McInnes L, Healy J, Melville J: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv180203426 Cs Stat. 2020.
15. Vashishtha K, Gaud C, Andrews S, et al.: Kartavya Vashishtha/Librarian-1.0.4. Zenodo. 2022. Publisher Full Text
16. Adey A, et al.: Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biol. 2010; 11: R119. PubMed Abstract | Publisher Full Text

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 29 Sep 2022

Author details Author details

¹ Independent Researcher, New Delhi, India
² Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK
³ Bioinformatics, Altos Labs Cambridge Institute of Science, Cambridge, CB21 6GP, UK

Kartavya Vashishtha
Roles: Software, Writing – Review & Editing

Caroline Gaud
Roles: Software, Writing – Review & Editing

Simon Andrews
Roles: Conceptualization, Funding Acquisition, Software, Writing – Review & Editing

Christel Krueger
Roles: Conceptualization, Formal Analysis, Software, Visualization, Writing – Original Draft Preparation

Competing interests

No competing interests were disclosed.

Grant information

This research was supported by the Babraham Institute’s United Kingdom Research and Innovation - Biotechnology and Biological Sciences Research Council (UKRI- BBSRC) core capability grant (reference number BBS/E/B000X0000) and the Epigenetics Institute Strategic Programme (reference number BBS/E/B/000C0425).
The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Article Versions (2)

version 2

Revised

Published: 24 Jan 2024, 11:1122

https://doi.org/10.12688/f1000research.125325.2

version 1

Published: 29 Sep 2022, 11:1122

https://doi.org/10.12688/f1000research.125325.1

© 2022 Vashishtha K et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Download

Export To

metrics

	Views	Downloads
F1000Research	-	-
PubMed Central Data from PMC are received and updated monthly.	-	-

Citations

SEE MORE DETAILS

CITE

how to cite this article

Vashishtha K, Gaud C, Andrews S and Krueger C. Librarian: A quality control tool to analyse sequencing library compositions [version 1; peer review: 1 approved, 2 approved with reservations] F1000Research 2022, 11:1122 (https://doi.org/10.12688/f1000research.125325.1)

NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

track

receive updates on this article

Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?

Key to Reviewer Statuses VIEW HIDE

ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions

Version 1

VERSION 1

PUBLISHED 29 Sep 2022

Views

Reviewer Report 18 Oct 2022

Karim Gharbi, The Earlham Institute, Norwich, UK

Approved with Reservations

https://doi.org/10.5256/f1000research.137618.r151968

In this manuscript, Vashishtha et al describes the implementation of a novel quality control (QC) tool for next-generation sequencing (NGS) datasets, which uses nucleotide composition profiles along sequence reads to infer the likely library preparation method used to generate the data. The authors first demonstrate that nucleotide composition is strongly influenced by library method as recorded in the GEO database for a selection of human and mouse NGS datasets. Having established this result, they implemented a program tool (Librarian) to compare library nucleotide composition profiles to a collection of reference datasets and identify libraries with unexpected profiles, which may be indicative of potential failure during library preparation and/or sample/data mix-ups. The tool, which is available as a web application and command line tool, extracts nucleotide composition from user supplied FASTQ files and returns similarity scores against existing profiles stored in the Librarian database.

The manuscript is well-written, and the authors provide strong evidence of library method influencing nucleotide composition in the read output files. This will be a familiar observation for those experienced with generating and/or analysing diverse NGS datasets, but the manuscript is a welcome documentation and quantification of these patterns. As a tool, Librarian has the potential to become an important step in the QC of NGS data, alongside other, more generic QC tools, such as FastQC, and help detect quality issues early in data processing. However, I have some concerns about the limitations of the software as currently implemented, which I feel are not sufficiently discussed in the manuscript and could cause significant confusion in the hands of less experienced users. The comments below are intended to help the authors improve the current manuscript and indicate areas for future improvement to increase the usability of the tool.

Major comments
- Please comment on the applicability of Librarian to data generated with other NGS technologies than Illumina. If not tested or not applicable, this should be highlighted in the discussion.

- Please provide a rationale for trimming reads to 50 bases and only considering read 1 to build the database of nucleotide composition profiles, i.e., why is this sufficient to accurately capture the nucleotide composition of each library type. Some methods result in asymmetric library fragments (e.g., 10X Genomics), with different nucleotide compositions in read 1 and read 2, which in itself can be diagnostic of the library type.

- The selection of GEO datasets to build reference profiles seems restrictive and potentially biased. Please can you provide evidence that Librarian is applicable to other species than human/mouse, especially species with divergent GC content.

- The date range filter (01/01/2018 - 31/12/2020) is also likely to have resulted in more recent library types to be excluded from the analysis and therefore the reference dataset. 10X Genomics library types are highly popular, but surprisingly absent. Other library types may have been missed too.

- Transposon-based library preparation is increasingly popular and applied across a wide range of library types, including single-cell RNA and DNA sequencing, whole-genome sequencing, ATAC-seq, enrichment capture etc. The authors briefly acknowledge this in the discussion, but it appears to be a major limitation of the tool, i.e., transposon-insertion signature at the start of the reads will likely obscure the underlying library type, causing most transposon-based libraries to cluster together. This should be explicitly documented and investigated further, if possible.

- More generally speaking, I would strongly encourage the authors to explicitly identify library types and species "supported" by Librarian, indicating that submission of other library types and/or species may result in inconclusive or potentially misleading results (I acknowledge that the software will accept any FASTQ file).

Minor comments
- Please briefly comment on the observed pattern for ChIA-PET and ChIP-seq libraries, i.e., why are these expected and consistent with the library method. ChIA-PET is not a widely used library method. A short description should be included in the text for context.

- Please add legend to Figure 1 with key matching coloured lines to individual bases.

- I would suggest meta-analysis of public datasets as another important use case for Librarian, e.g., as a clean-up tool prior to meta-analysis or identifying patterns/biases in library type, or subtypes.

- Please clarify whether Librarian can we be set up with a local, custom server in addition to query against an online database via the web app or command line tool.

- The tabular data in figure 2A shows library types with fewer than 25 samples despite these being classified as under-represented libraries and excluded from the analysis in the text.

Overall, I believe that the premise of Librarian is a very good idea and thank the authors for their efforts in releasing the program as a publicly available tool. I look forward to reading their responses, and future iterations of the software addressing current limitations.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: genomics, next-generation sequencing, bioinformatics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.

CITE

Report a concern

Author Response 23 May 2024

Christel Krueger, Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK

23 May 2024

Author Response
Reviewer Comment: In this manuscript, Vashishtha et al describes the implementation of a novel quality control (QC) tool for next-generation sequencing (NGS) datasets, which uses nucleotide composition profiles along sequence ... Continue reading
Reviewer Comment: In this manuscript, Vashishtha et al describes the implementation of a novel quality control (QC) tool for next-generation sequencing (NGS) datasets, which uses nucleotide composition profiles along sequence reads to infer the likely library preparation method used to generate the data. The authors first demonstrate that nucleotide composition is strongly influenced by library method as recorded in the GEO database for a selection of human and mouse NGS datasets. Having established this result, they implemented a program tool (Librarian) to compare library nucleotide composition profiles to a collection of reference datasets and identify libraries with unexpected profiles, which may be indicative of potential failure during library preparation and/or sample/data mix-ups. The tool, which is available as a web application and command line tool, extracts nucleotide composition from user supplied FASTQ files and returns similarity scores against existing profiles stored in the Librarian database.

The manuscript is well-written, and the authors provide strong evidence of library method influencing nucleotide composition in the read output files. This will be a familiar observation for those experienced with generating and/or analysing diverse NGS datasets, but the manuscript is a welcome documentation and quantification of these patterns. As a tool, Librarian has the potential to become an important step in the QC of NGS data, alongside other, more generic QC tools, such as FastQC, and help detect quality issues early in data processing. However, I have some concerns about the limitations of the software as currently implemented, which I feel are not sufficiently discussed in the manuscript and could cause significant confusion in the hands of less experienced users. The comments below are intended to help the authors improve the current manuscript and indicate areas for future improvement to increase the usability of the tool.

Author Response: We would like to thank Karim Gharbi for reviewing our manuscript and taking the time to test the tool Librarian, and we are very pleased by his positive assessment regarding its usefulness. We much appreciate Karim’s thoughtful comments and useful suggestions to improve the usability. Please find below responses to each of his questions/comments.

Reviewer Comment:
Major comments

Please comment on the applicability of Librarian to data generated with other NGS technologies than Illumina. If not tested or not applicable, this should be highlighted in the discussion.

Author Response: Our analyses focused on Illumina sequencing as this is by far the most widely used technology and also offers the most diverse library types - with many applications not available on other platforms such as Nanopore or PacBio. We have now made this clear in the discussion and included this as a limitation in the section on Operation. We have also added this to the new FAQ section of the documentation.

Reviewer Comment:

Please provide a rationale for trimming reads to 50 bases and only considering read 1 to build the database of nucleotide composition profiles, i.e., why is this sufficient to accurately capture the nucleotide composition of each library type. Some methods result in asymmetric library fragments (e.g., 10X Genomics), with different nucleotide compositions in read 1 and read 2, which in itself can be diagnostic of the library type.

Author Response:  Librarian was designed to check individual FastQ files which is why differences between e. g. read 1 and read 2 are not taken into account despite being informative in some cases. Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

Reviewer Comment:

The selection of GEO datasets to build reference profiles seems restrictive and potentially biased. Please can you provide evidence that Librarian is applicable to other species than human/mouse, especially species with divergent GC content.

Author Response: We deliberately restricted the selection of samples to human and mouse species as it is indeed the case that species with different CG content will produce libraries of different compositions. Thus, as a minimum, test samples should have a similar GC content to the samples from the reference map. We decided to focus on human and mouse data for the reference map because of the abundance of available data and prevalence in current studies. Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see Image 1 here: https://f1000research.s3.amazonaws.com/linked/649555.Image_1.1.pdf

Please see Image 2 here: https://f1000research.s3.amazonaws.com/linked/649556.Image_1.2.pdf

Please see Image 3 here: https://f1000research.s3.amazonaws.com/linked/649557.Image_1.3.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and the new FAQ section of the documentation. Please also see reply to Reviewer 2.

Species specific and broader mammalian reference maps are planned for future Librarian releases.

Reviewer Comment:

The date range filter (01/01/2018 - 31/12/2020) is also likely to have resulted in more recent library types to be excluded from the analysis and therefore the reference dataset. 10X Genomics library types are highly popular, but surprisingly absent. Other library types may have been missed too.

Author Response: It is indeed the case that new library types become available all the time. While Librarian is unlikely to ever be able to cater for niche technologies it will need to be updated over time as new library preparation methods become popular. We have therefore made the design decision to have both the web app and the CLI query the reference data on the Babraham server. This is to ensure that users will have immediate access to updated reference maps as they become available. A current limitation is that GEO only offers a set number of metadata tags for library types upon sample submission which results in grouping of distinct preparation methods under one name. (Please also see responses to Reviewer 1 Question 2 and Reviewer 2). Currently, there is no tag for 10X Genomics libraries on GEO and submitters may choose to either label it as ‘RNA-seq’ or potentially ‘OTHER’. We are hoping to be able to query a wider range of metadata in the future and provide a more fine grained classification of library types.

Reviewer Comment:

Transposon-based library preparation is increasingly popular and applied across a wide range of library types, including single-cell RNA and DNA sequencing, whole-genome sequencing, ATAC-seq, enrichment capture etc. The authors briefly acknowledge this in the discussion, but it appears to be a major limitation of the tool, i.e., transposon-insertion signature at the start of the reads will likely obscure the underlying library type, causing most transposon-based libraries to cluster together. This should be explicitly documented and investigated further, if possible.

Author Response: This is indeed a limitation and laid out in the second paragraph of the discussion. We have now also added this scenario explicitly to the FAQ section of the documentation. With the current set of reference data we find that the proportion of transposon-based library preparation for other types than ATAC-seq is very low. As this library preparation method becomes more popular it will feature more strongly and in a more diverse way in future reference maps.

Reviewer Comment:

More generally speaking, I would strongly encourage the authors to explicitly identify library types and species "supported" by Librarian, indicating that submission of other library types and/or species may result in inconclusive or potentially misleading results (I acknowledge that the software will accept any FASTQ file).

Author Response: We have now been much more explicit about which samples are suitable for analysis with Librarian: We have added the following statement to the Operations section: “Irrespective of platform, Librarian is only suitable to assess datasets which match the types that the reference map is built on. More specifically, test samples need to have been sequenced with Illumina technology, match any of the included library types and be of mammalian origin (ideally mouse or human).”
We also discuss extensively which samples are supported in the new FAQ section of the documentation.

Reviewer Comment:

Minor comments


Please briefly comment on the observed pattern for ChIA-PET and ChIP-seq libraries, i.e., why are these expected and consistent with the library method. ChIA-PET is not a widely used library method. A short description should be included in the text for context.

Author Response:  A description of the techniques BS-Seq, ATAC-Seq, ChIA-PET and ChIP-Seq and an explanation of the method specific composition biases along the read has been added to the legend to Figure 2.

Reviewer Comment:

Please add legend to Figure 1 with key matching coloured lines to individual bases.

Author Response:  This omission has been corrected.

Reviewer Comment:

I would suggest meta-analysis of public datasets as another important use case for Librarian, e.g., as a clean-up tool prior to meta-analysis or identifying patterns/biases in library type, or subtypes.

Author Response: This is a good suggestion and this use case has now been included in the manuscript.

Reviewer Comment:

Please clarify whether Librarian can we be set up with a local, custom server in addition to query against an online database via the web app or command line tool.

Author Response: We provide Librarian as a web tool and with a Linux command line interface. The CLI can be run either in offline mode in which visualisations and predictions are computed locally (this is a new option) or remote mode in which Librarian queries the same web server as the web tool. The latter has the advantage of avoiding R dependencies and ensures that libraries are compared to the latest reference map. The different modes are explained in the section Operation of the manuscript, and in the section Usage in the online tool documentation.

Reviewer Comment:

The tabular data in figure 2A shows library types with fewer than 25 samples despite these being classified as under-represented libraries and excluded from the analysis in the text.

Author Response: This is correct. The search results were filtered to exclude library types with fewer than 25 samples. The table shows the number of samples for which we could retrieve base compositions. We were not able to retrieve sequence compositions for all search hits. Also note that RNA-seq search results were capped at 500 but only 452 FastQ files made it through the analysis. Most samples dropped out at the download stage which is a common thing to happen. Since this was done in high throughput we did not attempt to troubleshoot the process or download manually.

Reviewer Comment: Overall, I believe that the premise of Librarian is a very good idea and thank the authors for their efforts in releasing the program as a publicly available tool. I look forward to reading their responses, and future iterations of the software addressing current limitations.
Reviewer Comment: In this manuscript, Vashishtha et al describes the implementation of a novel quality control (QC) tool for next-generation sequencing (NGS) datasets, which uses nucleotide composition profiles along sequence reads to infer the likely library preparation method used to generate the data. The authors first demonstrate that nucleotide composition is strongly influenced by library method as recorded in the GEO database for a selection of human and mouse NGS datasets. Having established this result, they implemented a program tool (Librarian) to compare library nucleotide composition profiles to a collection of reference datasets and identify libraries with unexpected profiles, which may be indicative of potential failure during library preparation and/or sample/data mix-ups. The tool, which is available as a web application and command line tool, extracts nucleotide composition from user supplied FASTQ files and returns similarity scores against existing profiles stored in the Librarian database.

The manuscript is well-written, and the authors provide strong evidence of library method influencing nucleotide composition in the read output files. This will be a familiar observation for those experienced with generating and/or analysing diverse NGS datasets, but the manuscript is a welcome documentation and quantification of these patterns. As a tool, Librarian has the potential to become an important step in the QC of NGS data, alongside other, more generic QC tools, such as FastQC, and help detect quality issues early in data processing. However, I have some concerns about the limitations of the software as currently implemented, which I feel are not sufficiently discussed in the manuscript and could cause significant confusion in the hands of less experienced users. The comments below are intended to help the authors improve the current manuscript and indicate areas for future improvement to increase the usability of the tool.

Author Response: We would like to thank Karim Gharbi for reviewing our manuscript and taking the time to test the tool Librarian, and we are very pleased by his positive assessment regarding its usefulness. We much appreciate Karim’s thoughtful comments and useful suggestions to improve the usability. Please find below responses to each of his questions/comments.

Reviewer Comment:
Major comments

Please comment on the applicability of Librarian to data generated with other NGS technologies than Illumina. If not tested or not applicable, this should be highlighted in the discussion.

Author Response: Our analyses focused on Illumina sequencing as this is by far the most widely used technology and also offers the most diverse library types - with many applications not available on other platforms such as Nanopore or PacBio. We have now made this clear in the discussion and included this as a limitation in the section on Operation. We have also added this to the new FAQ section of the documentation.

Reviewer Comment:

Please provide a rationale for trimming reads to 50 bases and only considering read 1 to build the database of nucleotide composition profiles, i.e., why is this sufficient to accurately capture the nucleotide composition of each library type. Some methods result in asymmetric library fragments (e.g., 10X Genomics), with different nucleotide compositions in read 1 and read 2, which in itself can be diagnostic of the library type.

Author Response:  Librarian was designed to check individual FastQ files which is why differences between e. g. read 1 and read 2 are not taken into account despite being informative in some cases. Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

Reviewer Comment:

The selection of GEO datasets to build reference profiles seems restrictive and potentially biased. Please can you provide evidence that Librarian is applicable to other species than human/mouse, especially species with divergent GC content.

Author Response: We deliberately restricted the selection of samples to human and mouse species as it is indeed the case that species with different CG content will produce libraries of different compositions. Thus, as a minimum, test samples should have a similar GC content to the samples from the reference map. We decided to focus on human and mouse data for the reference map because of the abundance of available data and prevalence in current studies. Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see Image 1 here: https://f1000research.s3.amazonaws.com/linked/649555.Image_1.1.pdf

Please see Image 2 here: https://f1000research.s3.amazonaws.com/linked/649556.Image_1.2.pdf

Please see Image 3 here: https://f1000research.s3.amazonaws.com/linked/649557.Image_1.3.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and the new FAQ section of the documentation. Please also see reply to Reviewer 2.

Species specific and broader mammalian reference maps are planned for future Librarian releases.

Reviewer Comment:

The date range filter (01/01/2018 - 31/12/2020) is also likely to have resulted in more recent library types to be excluded from the analysis and therefore the reference dataset. 10X Genomics library types are highly popular, but surprisingly absent. Other library types may have been missed too.

Author Response: It is indeed the case that new library types become available all the time. While Librarian is unlikely to ever be able to cater for niche technologies it will need to be updated over time as new library preparation methods become popular. We have therefore made the design decision to have both the web app and the CLI query the reference data on the Babraham server. This is to ensure that users will have immediate access to updated reference maps as they become available. A current limitation is that GEO only offers a set number of metadata tags for library types upon sample submission which results in grouping of distinct preparation methods under one name. (Please also see responses to Reviewer 1 Question 2 and Reviewer 2). Currently, there is no tag for 10X Genomics libraries on GEO and submitters may choose to either label it as ‘RNA-seq’ or potentially ‘OTHER’. We are hoping to be able to query a wider range of metadata in the future and provide a more fine grained classification of library types.

Reviewer Comment:

Transposon-based library preparation is increasingly popular and applied across a wide range of library types, including single-cell RNA and DNA sequencing, whole-genome sequencing, ATAC-seq, enrichment capture etc. The authors briefly acknowledge this in the discussion, but it appears to be a major limitation of the tool, i.e., transposon-insertion signature at the start of the reads will likely obscure the underlying library type, causing most transposon-based libraries to cluster together. This should be explicitly documented and investigated further, if possible.

Author Response: This is indeed a limitation and laid out in the second paragraph of the discussion. We have now also added this scenario explicitly to the FAQ section of the documentation. With the current set of reference data we find that the proportion of transposon-based library preparation for other types than ATAC-seq is very low. As this library preparation method becomes more popular it will feature more strongly and in a more diverse way in future reference maps.

Reviewer Comment:

More generally speaking, I would strongly encourage the authors to explicitly identify library types and species "supported" by Librarian, indicating that submission of other library types and/or species may result in inconclusive or potentially misleading results (I acknowledge that the software will accept any FASTQ file).

Author Response: We have now been much more explicit about which samples are suitable for analysis with Librarian: We have added the following statement to the Operations section: “Irrespective of platform, Librarian is only suitable to assess datasets which match the types that the reference map is built on. More specifically, test samples need to have been sequenced with Illumina technology, match any of the included library types and be of mammalian origin (ideally mouse or human).”
We also discuss extensively which samples are supported in the new FAQ section of the documentation.

Reviewer Comment:

Minor comments


Please briefly comment on the observed pattern for ChIA-PET and ChIP-seq libraries, i.e., why are these expected and consistent with the library method. ChIA-PET is not a widely used library method. A short description should be included in the text for context.

Author Response:  A description of the techniques BS-Seq, ATAC-Seq, ChIA-PET and ChIP-Seq and an explanation of the method specific composition biases along the read has been added to the legend to Figure 2.

Reviewer Comment:

Please add legend to Figure 1 with key matching coloured lines to individual bases.

Author Response:  This omission has been corrected.

Reviewer Comment:

I would suggest meta-analysis of public datasets as another important use case for Librarian, e.g., as a clean-up tool prior to meta-analysis or identifying patterns/biases in library type, or subtypes.

Author Response: This is a good suggestion and this use case has now been included in the manuscript.

Reviewer Comment:

Please clarify whether Librarian can we be set up with a local, custom server in addition to query against an online database via the web app or command line tool.

Author Response: We provide Librarian as a web tool and with a Linux command line interface. The CLI can be run either in offline mode in which visualisations and predictions are computed locally (this is a new option) or remote mode in which Librarian queries the same web server as the web tool. The latter has the advantage of avoiding R dependencies and ensures that libraries are compared to the latest reference map. The different modes are explained in the section Operation of the manuscript, and in the section Usage in the online tool documentation.

Reviewer Comment:

The tabular data in figure 2A shows library types with fewer than 25 samples despite these being classified as under-represented libraries and excluded from the analysis in the text.

Author Response: This is correct. The search results were filtered to exclude library types with fewer than 25 samples. The table shows the number of samples for which we could retrieve base compositions. We were not able to retrieve sequence compositions for all search hits. Also note that RNA-seq search results were capped at 500 but only 452 FastQ files made it through the analysis. Most samples dropped out at the download stage which is a common thing to happen. Since this was done in high throughput we did not attempt to troubleshoot the process or download manually.

Reviewer Comment: Overall, I believe that the premise of Librarian is a very good idea and thank the authors for their efforts in releasing the program as a publicly available tool. I look forward to reading their responses, and future iterations of the software addressing current limitations.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 23 May 2024

Christel Krueger, Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK

23 May 2024

Author Response
Reviewer Comment: In this manuscript, Vashishtha et al describes the implementation of a novel quality control (QC) tool for next-generation sequencing (NGS) datasets, which uses nucleotide composition profiles along sequence ... Continue reading
Reviewer Comment: In this manuscript, Vashishtha et al describes the implementation of a novel quality control (QC) tool for next-generation sequencing (NGS) datasets, which uses nucleotide composition profiles along sequence reads to infer the likely library preparation method used to generate the data. The authors first demonstrate that nucleotide composition is strongly influenced by library method as recorded in the GEO database for a selection of human and mouse NGS datasets. Having established this result, they implemented a program tool (Librarian) to compare library nucleotide composition profiles to a collection of reference datasets and identify libraries with unexpected profiles, which may be indicative of potential failure during library preparation and/or sample/data mix-ups. The tool, which is available as a web application and command line tool, extracts nucleotide composition from user supplied FASTQ files and returns similarity scores against existing profiles stored in the Librarian database.

The manuscript is well-written, and the authors provide strong evidence of library method influencing nucleotide composition in the read output files. This will be a familiar observation for those experienced with generating and/or analysing diverse NGS datasets, but the manuscript is a welcome documentation and quantification of these patterns. As a tool, Librarian has the potential to become an important step in the QC of NGS data, alongside other, more generic QC tools, such as FastQC, and help detect quality issues early in data processing. However, I have some concerns about the limitations of the software as currently implemented, which I feel are not sufficiently discussed in the manuscript and could cause significant confusion in the hands of less experienced users. The comments below are intended to help the authors improve the current manuscript and indicate areas for future improvement to increase the usability of the tool.

Author Response: We would like to thank Karim Gharbi for reviewing our manuscript and taking the time to test the tool Librarian, and we are very pleased by his positive assessment regarding its usefulness. We much appreciate Karim’s thoughtful comments and useful suggestions to improve the usability. Please find below responses to each of his questions/comments.

Reviewer Comment:
Major comments

Please comment on the applicability of Librarian to data generated with other NGS technologies than Illumina. If not tested or not applicable, this should be highlighted in the discussion.

Author Response: Our analyses focused on Illumina sequencing as this is by far the most widely used technology and also offers the most diverse library types - with many applications not available on other platforms such as Nanopore or PacBio. We have now made this clear in the discussion and included this as a limitation in the section on Operation. We have also added this to the new FAQ section of the documentation.

Reviewer Comment:

Please provide a rationale for trimming reads to 50 bases and only considering read 1 to build the database of nucleotide composition profiles, i.e., why is this sufficient to accurately capture the nucleotide composition of each library type. Some methods result in asymmetric library fragments (e.g., 10X Genomics), with different nucleotide compositions in read 1 and read 2, which in itself can be diagnostic of the library type.

Author Response:  Librarian was designed to check individual FastQ files which is why differences between e. g. read 1 and read 2 are not taken into account despite being informative in some cases. Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

Reviewer Comment:

The selection of GEO datasets to build reference profiles seems restrictive and potentially biased. Please can you provide evidence that Librarian is applicable to other species than human/mouse, especially species with divergent GC content.

Author Response: We deliberately restricted the selection of samples to human and mouse species as it is indeed the case that species with different CG content will produce libraries of different compositions. Thus, as a minimum, test samples should have a similar GC content to the samples from the reference map. We decided to focus on human and mouse data for the reference map because of the abundance of available data and prevalence in current studies. Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see Image 1 here: https://f1000research.s3.amazonaws.com/linked/649555.Image_1.1.pdf

Please see Image 2 here: https://f1000research.s3.amazonaws.com/linked/649556.Image_1.2.pdf

Please see Image 3 here: https://f1000research.s3.amazonaws.com/linked/649557.Image_1.3.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and the new FAQ section of the documentation. Please also see reply to Reviewer 2.

Species specific and broader mammalian reference maps are planned for future Librarian releases.

Reviewer Comment:

The date range filter (01/01/2018 - 31/12/2020) is also likely to have resulted in more recent library types to be excluded from the analysis and therefore the reference dataset. 10X Genomics library types are highly popular, but surprisingly absent. Other library types may have been missed too.

Author Response: It is indeed the case that new library types become available all the time. While Librarian is unlikely to ever be able to cater for niche technologies it will need to be updated over time as new library preparation methods become popular. We have therefore made the design decision to have both the web app and the CLI query the reference data on the Babraham server. This is to ensure that users will have immediate access to updated reference maps as they become available. A current limitation is that GEO only offers a set number of metadata tags for library types upon sample submission which results in grouping of distinct preparation methods under one name. (Please also see responses to Reviewer 1 Question 2 and Reviewer 2). Currently, there is no tag for 10X Genomics libraries on GEO and submitters may choose to either label it as ‘RNA-seq’ or potentially ‘OTHER’. We are hoping to be able to query a wider range of metadata in the future and provide a more fine grained classification of library types.

Reviewer Comment:

Transposon-based library preparation is increasingly popular and applied across a wide range of library types, including single-cell RNA and DNA sequencing, whole-genome sequencing, ATAC-seq, enrichment capture etc. The authors briefly acknowledge this in the discussion, but it appears to be a major limitation of the tool, i.e., transposon-insertion signature at the start of the reads will likely obscure the underlying library type, causing most transposon-based libraries to cluster together. This should be explicitly documented and investigated further, if possible.

Author Response: This is indeed a limitation and laid out in the second paragraph of the discussion. We have now also added this scenario explicitly to the FAQ section of the documentation. With the current set of reference data we find that the proportion of transposon-based library preparation for other types than ATAC-seq is very low. As this library preparation method becomes more popular it will feature more strongly and in a more diverse way in future reference maps.

Reviewer Comment:

More generally speaking, I would strongly encourage the authors to explicitly identify library types and species "supported" by Librarian, indicating that submission of other library types and/or species may result in inconclusive or potentially misleading results (I acknowledge that the software will accept any FASTQ file).

Author Response: We have now been much more explicit about which samples are suitable for analysis with Librarian: We have added the following statement to the Operations section: “Irrespective of platform, Librarian is only suitable to assess datasets which match the types that the reference map is built on. More specifically, test samples need to have been sequenced with Illumina technology, match any of the included library types and be of mammalian origin (ideally mouse or human).”
We also discuss extensively which samples are supported in the new FAQ section of the documentation.

Reviewer Comment:

Minor comments


Please briefly comment on the observed pattern for ChIA-PET and ChIP-seq libraries, i.e., why are these expected and consistent with the library method. ChIA-PET is not a widely used library method. A short description should be included in the text for context.

Author Response:  A description of the techniques BS-Seq, ATAC-Seq, ChIA-PET and ChIP-Seq and an explanation of the method specific composition biases along the read has been added to the legend to Figure 2.

Reviewer Comment:

Please add legend to Figure 1 with key matching coloured lines to individual bases.

Author Response:  This omission has been corrected.

Reviewer Comment:

I would suggest meta-analysis of public datasets as another important use case for Librarian, e.g., as a clean-up tool prior to meta-analysis or identifying patterns/biases in library type, or subtypes.

Author Response: This is a good suggestion and this use case has now been included in the manuscript.

Reviewer Comment:

Please clarify whether Librarian can we be set up with a local, custom server in addition to query against an online database via the web app or command line tool.

Author Response: We provide Librarian as a web tool and with a Linux command line interface. The CLI can be run either in offline mode in which visualisations and predictions are computed locally (this is a new option) or remote mode in which Librarian queries the same web server as the web tool. The latter has the advantage of avoiding R dependencies and ensures that libraries are compared to the latest reference map. The different modes are explained in the section Operation of the manuscript, and in the section Usage in the online tool documentation.

Reviewer Comment:

The tabular data in figure 2A shows library types with fewer than 25 samples despite these being classified as under-represented libraries and excluded from the analysis in the text.

Author Response: This is correct. The search results were filtered to exclude library types with fewer than 25 samples. The table shows the number of samples for which we could retrieve base compositions. We were not able to retrieve sequence compositions for all search hits. Also note that RNA-seq search results were capped at 500 but only 452 FastQ files made it through the analysis. Most samples dropped out at the download stage which is a common thing to happen. Since this was done in high throughput we did not attempt to troubleshoot the process or download manually.

Reviewer Comment: Overall, I believe that the premise of Librarian is a very good idea and thank the authors for their efforts in releasing the program as a publicly available tool. I look forward to reading their responses, and future iterations of the software addressing current limitations.
Reviewer Comment: In this manuscript, Vashishtha et al describes the implementation of a novel quality control (QC) tool for next-generation sequencing (NGS) datasets, which uses nucleotide composition profiles along sequence reads to infer the likely library preparation method used to generate the data. The authors first demonstrate that nucleotide composition is strongly influenced by library method as recorded in the GEO database for a selection of human and mouse NGS datasets. Having established this result, they implemented a program tool (Librarian) to compare library nucleotide composition profiles to a collection of reference datasets and identify libraries with unexpected profiles, which may be indicative of potential failure during library preparation and/or sample/data mix-ups. The tool, which is available as a web application and command line tool, extracts nucleotide composition from user supplied FASTQ files and returns similarity scores against existing profiles stored in the Librarian database.

The manuscript is well-written, and the authors provide strong evidence of library method influencing nucleotide composition in the read output files. This will be a familiar observation for those experienced with generating and/or analysing diverse NGS datasets, but the manuscript is a welcome documentation and quantification of these patterns. As a tool, Librarian has the potential to become an important step in the QC of NGS data, alongside other, more generic QC tools, such as FastQC, and help detect quality issues early in data processing. However, I have some concerns about the limitations of the software as currently implemented, which I feel are not sufficiently discussed in the manuscript and could cause significant confusion in the hands of less experienced users. The comments below are intended to help the authors improve the current manuscript and indicate areas for future improvement to increase the usability of the tool.

Author Response: We would like to thank Karim Gharbi for reviewing our manuscript and taking the time to test the tool Librarian, and we are very pleased by his positive assessment regarding its usefulness. We much appreciate Karim’s thoughtful comments and useful suggestions to improve the usability. Please find below responses to each of his questions/comments.

Reviewer Comment:
Major comments

Please comment on the applicability of Librarian to data generated with other NGS technologies than Illumina. If not tested or not applicable, this should be highlighted in the discussion.

Author Response: Our analyses focused on Illumina sequencing as this is by far the most widely used technology and also offers the most diverse library types - with many applications not available on other platforms such as Nanopore or PacBio. We have now made this clear in the discussion and included this as a limitation in the section on Operation. We have also added this to the new FAQ section of the documentation.

Reviewer Comment:

Please provide a rationale for trimming reads to 50 bases and only considering read 1 to build the database of nucleotide composition profiles, i.e., why is this sufficient to accurately capture the nucleotide composition of each library type. Some methods result in asymmetric library fragments (e.g., 10X Genomics), with different nucleotide compositions in read 1 and read 2, which in itself can be diagnostic of the library type.

Author Response:  Librarian was designed to check individual FastQ files which is why differences between e. g. read 1 and read 2 are not taken into account despite being informative in some cases. Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

Reviewer Comment:

The selection of GEO datasets to build reference profiles seems restrictive and potentially biased. Please can you provide evidence that Librarian is applicable to other species than human/mouse, especially species with divergent GC content.

Author Response: We deliberately restricted the selection of samples to human and mouse species as it is indeed the case that species with different CG content will produce libraries of different compositions. Thus, as a minimum, test samples should have a similar GC content to the samples from the reference map. We decided to focus on human and mouse data for the reference map because of the abundance of available data and prevalence in current studies. Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see Image 1 here: https://f1000research.s3.amazonaws.com/linked/649555.Image_1.1.pdf

Please see Image 2 here: https://f1000research.s3.amazonaws.com/linked/649556.Image_1.2.pdf

Please see Image 3 here: https://f1000research.s3.amazonaws.com/linked/649557.Image_1.3.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and the new FAQ section of the documentation. Please also see reply to Reviewer 2.

Species specific and broader mammalian reference maps are planned for future Librarian releases.

Reviewer Comment:

The date range filter (01/01/2018 - 31/12/2020) is also likely to have resulted in more recent library types to be excluded from the analysis and therefore the reference dataset. 10X Genomics library types are highly popular, but surprisingly absent. Other library types may have been missed too.

Author Response: It is indeed the case that new library types become available all the time. While Librarian is unlikely to ever be able to cater for niche technologies it will need to be updated over time as new library preparation methods become popular. We have therefore made the design decision to have both the web app and the CLI query the reference data on the Babraham server. This is to ensure that users will have immediate access to updated reference maps as they become available. A current limitation is that GEO only offers a set number of metadata tags for library types upon sample submission which results in grouping of distinct preparation methods under one name. (Please also see responses to Reviewer 1 Question 2 and Reviewer 2). Currently, there is no tag for 10X Genomics libraries on GEO and submitters may choose to either label it as ‘RNA-seq’ or potentially ‘OTHER’. We are hoping to be able to query a wider range of metadata in the future and provide a more fine grained classification of library types.

Reviewer Comment:

Transposon-based library preparation is increasingly popular and applied across a wide range of library types, including single-cell RNA and DNA sequencing, whole-genome sequencing, ATAC-seq, enrichment capture etc. The authors briefly acknowledge this in the discussion, but it appears to be a major limitation of the tool, i.e., transposon-insertion signature at the start of the reads will likely obscure the underlying library type, causing most transposon-based libraries to cluster together. This should be explicitly documented and investigated further, if possible.

Author Response: This is indeed a limitation and laid out in the second paragraph of the discussion. We have now also added this scenario explicitly to the FAQ section of the documentation. With the current set of reference data we find that the proportion of transposon-based library preparation for other types than ATAC-seq is very low. As this library preparation method becomes more popular it will feature more strongly and in a more diverse way in future reference maps.

Reviewer Comment:

More generally speaking, I would strongly encourage the authors to explicitly identify library types and species "supported" by Librarian, indicating that submission of other library types and/or species may result in inconclusive or potentially misleading results (I acknowledge that the software will accept any FASTQ file).

Author Response: We have now been much more explicit about which samples are suitable for analysis with Librarian: We have added the following statement to the Operations section: “Irrespective of platform, Librarian is only suitable to assess datasets which match the types that the reference map is built on. More specifically, test samples need to have been sequenced with Illumina technology, match any of the included library types and be of mammalian origin (ideally mouse or human).”
We also discuss extensively which samples are supported in the new FAQ section of the documentation.

Reviewer Comment:

Minor comments


Please briefly comment on the observed pattern for ChIA-PET and ChIP-seq libraries, i.e., why are these expected and consistent with the library method. ChIA-PET is not a widely used library method. A short description should be included in the text for context.

Author Response:  A description of the techniques BS-Seq, ATAC-Seq, ChIA-PET and ChIP-Seq and an explanation of the method specific composition biases along the read has been added to the legend to Figure 2.

Reviewer Comment:

Please add legend to Figure 1 with key matching coloured lines to individual bases.

Author Response:  This omission has been corrected.

Reviewer Comment:

I would suggest meta-analysis of public datasets as another important use case for Librarian, e.g., as a clean-up tool prior to meta-analysis or identifying patterns/biases in library type, or subtypes.

Author Response: This is a good suggestion and this use case has now been included in the manuscript.

Reviewer Comment:

Please clarify whether Librarian can we be set up with a local, custom server in addition to query against an online database via the web app or command line tool.

Author Response: We provide Librarian as a web tool and with a Linux command line interface. The CLI can be run either in offline mode in which visualisations and predictions are computed locally (this is a new option) or remote mode in which Librarian queries the same web server as the web tool. The latter has the advantage of avoiding R dependencies and ensures that libraries are compared to the latest reference map. The different modes are explained in the section Operation of the manuscript, and in the section Usage in the online tool documentation.

Reviewer Comment:

The tabular data in figure 2A shows library types with fewer than 25 samples despite these being classified as under-represented libraries and excluded from the analysis in the text.

Author Response: This is correct. The search results were filtered to exclude library types with fewer than 25 samples. The table shows the number of samples for which we could retrieve base compositions. We were not able to retrieve sequence compositions for all search hits. Also note that RNA-seq search results were capped at 500 but only 452 FastQ files made it through the analysis. Most samples dropped out at the download stage which is a common thing to happen. Since this was done in high throughput we did not attempt to troubleshoot the process or download manually.

Reviewer Comment: Overall, I believe that the premise of Librarian is a very good idea and thank the authors for their efforts in releasing the program as a publicly available tool. I look forward to reading their responses, and future iterations of the software addressing current limitations.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 14 Oct 2022

Konstantin Okonechnikov, German Cancer Research Center, Heidelberg, Germany

Approved with Reservations

https://doi.org/10.5256/f1000research.137618.r151966

The manuscript describes the quality control (QC) tool Librarian that predicts the similarity of sequencing library correctness in comparison to the control cohort based on the analysis of reads. Initially for this purpose, the composition of nucleotide variance in the first 50 bp of read was used as input to create a large reference control cohort from publicly available data. Visualisation of these merged nucleotide profiles via UMAP allows to observe clear groups formation based on the data type of a dataset. Novel sample check is a projection into this reference UMAP. Testing the online tool confirmed its usefulness: from inspection of own data, the majority of cases were distinguished correctly. Such projection of a novel dataset into reference would be a useful QC step for any sequencing experiment. However, the manuscript and the software description could be improved in order to provide more details about the tool as well as explain certain missing blocks.

In general, the manuscript clearly describes the technique, however, only one possible limitation of the method is stated (effect of a cut in RNA-seq fragments leading to similarity to ATAC-seq in Discussion). More variance factors could be inspected to avoid misleading conclusions about the analysis results. For example, the tissue materials can be obtained from frozen tissue (FFPE), is there any impact of this preparation procedure? In my inspection, the standard RNA-seq datasets were distinguished quite well, but FFPE RNA-seq demonstrated the closest similarity to MBPD and MeDIP-seq. The scRNA-seq protocols are also included, however, they vary since they could be either full gene body covering or only 5’/3’ segment of a gene. Could this have an impact on read nucleotide variance?
The reads selection is performed with 100K subsampling - how was this amount selected? What is the effect of the total number of reads? Is it sufficient to provide only a subset of them? In this case, what is the suggested limit?
Also, 50 bp read segment is used as the reference, but how was this selection made? Currently, the main standard for sequencing is 100-150 bp. Would it be more beneficial to use a larger segment of the read for reference generation? Or do quality issues in longer reads have a negative impact?
How strong is the species effect? Are there variances observed between mice and human materials in full UMAP, e.g. clusters formation? Does it make sense to create own reference for such a procedure, especially when working on other species, e.g. Drosophila?

Further additional comments could help to improve the manuscript for easier reading:

In Figure 1, the nucleotide type color legend is missing, also segments are not cited in the text directly by suffix (a,b,c,d). Figure 1a demonstrates ChIA-PET, but not clear why it is included since it's not stated in the manuscript text.
Figure 2a: Are the amounts of mice and humans mixed? What is the variance?
Figure 2c: Several enrichment UMAP locations for certain data types are far from each other, e.g. RNA-seq. How to interpret this? Could it be certain subclusters, further splitting the dataset types?
When downloading example datasets (sample FASTQ files), the archive cannot be opened. Also, there is no documentation available regarding input format preparation, e.g fastq are not allowed to be gzipped, it’s not clear without testing.
Github documentation on the establishment/launch lacks some details. Would be useful to extend it especially to state what are the system environment requirements before installation.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

No
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Bioinformatics data analysis in pediatric neurooncology

CITE

Report a concern

Author Response 23 May 2024

Christel Krueger, Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK

23 May 2024

Author Response
The manuscript describes the quality control (QC) tool Librarian that predicts the similarity of sequencing library correctness in comparison to the control cohort based on the analysis of reads. Initially ... Continue reading
The manuscript describes the quality control (QC) tool Librarian that predicts the similarity of sequencing library correctness in comparison to the control cohort based on the analysis of reads. Initially for this purpose, the composition of nucleotide variance in the first 50 bp of read was used as input to create a large reference control cohort from publicly available data. Visualisation of these merged nucleotide profiles via UMAP allows to observe clear groups formation based on the data type of a dataset. Novel sample check is a projection into this reference UMAP. Testing the online tool confirmed its usefulness: from inspection of own data, the majority of cases were distinguished correctly. Such projection of a novel dataset into reference would be a useful QC step for any sequencing experiment. However, the manuscript and the software description could be improved in order to provide more details about the tool as well as explain certain missing blocks.

We would like to thank Konstantin Okonechnikov for reviewing our manuscript and for taking the time to test the tool Librarian. We much appreciate his thoughtful comments and suggestions for improvement. Please find below detailed answers to each of the questions/comments.

In general, the manuscript clearly describes the technique, however, only one possible limitation of the method is stated (effect of a cut in RNA-seq fragments leading to similarity to ATAC-seq in Discussion). More variance factors could be inspected to avoid misleading conclusions about the analysis results. For example, the tissue materials can be obtained from frozen tissue (FFPE), is there any impact of this preparation procedure? In my inspection, the standard RNA-seq datasets were distinguished quite well, but FFPE RNA-seq demonstrated the closest similarity to MBPD and MeDIP-seq. The scRNA-seq protocols are also included, however, they vary since they could be either full gene body covering or only 5’/3’ segment of a gene. Could this have an impact on read nucleotide variance?

Systematic analysis of all factors influencing the base compositions is hampered by the limited amount of metadata that is provided by GEO, e. g. sample preparation methods or library kits are not included. One idea that we are pursuing for future developments in Librarian is to mine the entire GEO entry (rather than just the metadata) for more detailed sample information.

Prompted by your observation we manually found and downloaded five FFPE RNA-seq samples from four different studies (SRR11996379, SRR11996380, SRR21404015, SRR21773166, SRR23245053) and ran these through Librarian with the following results:

Please see image 1 here: https://f1000research.s3.amazonaws.com/linked/649544.Image_1.pdf

Please see image 2 here: https://f1000research.s3.amazonaws.com/linked/649545.Image_2.pdf

We cannot confirm that there seems to be a general issue classifying FFPE samples but we note that for some regions of the reference map RNA-seq shows substantial overlap with other library types which may result in misprediction of the library type.

The situation seems to be similar for 3’ RNAseq libraries. We downloaded four 10X libraries from two different studies (SRR18916640, SRR18916641, SRR23254936, SRR23254938). The Librarian results can be found below. Two samples are classified correctly, the other two are from an area in the reference map which is occupied by several library types and therefore have some wobble in the classification.

Please see image 3 here: https://f1000research.s3.amazonaws.com/linked/649546.Image_3.pdf

Please see image 4 here: https://f1000research.s3.amazonaws.com/linked/649547.Image_4.pdf

We have included a section on “My sample doesn't come up as the library type that I expect. How worried do I need to be?” in the FAQs.

The reads selection is performed with 100K subsampling - how was this amount selected? What is the effect of the total number of reads? Is it sufficient to provide only a subset of them? In this case, what is the suggested limit?

The idea behind subsampling is to find a good balance between faithful representation of the characteristics of a FastQ file and file size / processing speed. Subsampling is employed by several Babraham Bioinformatics QC tools including FastQScreen and some modules of FastQC. For Librarian, the subsample needs to be big enough for the base compositions to have stabilised and not be subject to random variation which is indeed the case for 100 K. Selecting the reads randomly rather than from the top of the file ensures that files in which reads are supplied in a sorted manner can also be dealt with. 100 K reads is the lower limit that Librarian accepts in a file. See below for an example in which 100 K, 200 K, 500 K and 13.5 million reads of an RNA-seq library. The samples are located in exactly the same location on the reference map.

Please see image 5 here: https://f1000research.s3.amazonaws.com/linked/649548.Image_5.pdf

Please see image 6 here: https://f1000research.s3.amazonaws.com/linked/649549.Image_6.pdf

Also, 50 bp read segment is used as the reference, but how was this selection made? Currently, the main standard for sequencing is 100-150 bp. Would it be more beneficial to use a larger segment of the read for reference generation? Or do quality issues in longer reads have a negative impact?

Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

How strong is the species effect? Are there variances observed between mice and human materials in full UMAP, e.g. clusters formation? Does it make sense to create own reference for such a procedure, especially when working on other species, e.g. Drosophila?

The magnitude of the species effect will depend on the difference of overall GC content and genome biology. We have included the reference map coloured by species as Supplementary Figure 2. As expected from the similar GC content and similar biology, libraries from mouse and human samples largely show the same base composition. We decided to focus on those two species because of the abundance of available data and prevalence in current studies.
Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see image 7 here: https://f1000research.s3.amazonaws.com/linked/649550.Image_7.pdf

Please see image 8 here: https://f1000research.s3.amazonaws.com/linked/649551.Image_8.pdf

Please see image 9 here: https://f1000research.s3.amazonaws.com/linked/649552.Image_9.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and in the new FAQ section of the documentation. Please also see response to Reviewer 3. Species specific and broader mammalian reference maps are planned for future Librarian releases.

Further additional comments could help to improve the manuscript for easier reading:

In Figure 1, the nucleotide type color legend is missing, also segments are not cited in the text directly by suffix (a,b,c,d). Figure 1a demonstrates ChIA-PET, but not clear why it is included since it's not stated in the manuscript text.

The text has been changed to briefly introduce the characteristic profiles observed with all four library types shown in the figures, and individual subpanels are now referenced. Details regarding the preparation protocols and how these produce the observed profiles have been added to the figure legend. The figure has been changed to reflect the order of library types in the text and the colour legend has been added.

Figure 2a: Are the amounts of mice and humans mixed? What is the variance?

We now include a file called ‘Supplementary Information’ with the publication which is deposited on Zenodo (along with other data and code). This contains a table showing how many data sets were from mouse and human per library type. The inputs were broadly balanced with the exception of ChIA-PET.

Please see image 10 here: https://f1000research.s3.amazonaws.com/linked/649553.Image_10.pdf

Figure 2c: Several enrichment UMAP locations for certain data types are far from each other, e.g. RNA-seq. How to interpret this? Could it be certain subclusters, further splitting the dataset types?

Apart from technical variation an underlying reason can be grouping of library preparation methods that produce distinct composition profiles. For RNA-seq there are a number of popular commercial kits available which will produce their own biases (e. g. stranded vs non-stranded, random vs poly-A primed). The most striking example of grouping by library preparation method is BS-seq where several commonly used library preparation methods produce different compositions. Two easily identifiable groups on the reference map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see image 11 here: https://f1000research.s3.amazonaws.com/linked/649554.Image_11.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

When downloading example datasets (sample FASTQ files), the archive cannot be opened. Also, there is no documentation available regarding input format preparation, e.g fastq are not allowed to be gzipped, it’s not clear without testing.

The example file archive has been fixed and is now working.

Librarian accepts both fastq and gzipped files. However, we found that upload of gzipped files is an issue outside of Librarian with some Mac/browser combinations. The workaround is to drag & drop files onto the upload button. We have added the sentence “Librarian accepts fastq and gzipped files. If gzipped files are greyed out in the pop up, try dragging them onto the upload button.” to the web app and have also included this workaround in the FAQs.

Github documentation on the establishment/launch lacks some details. Would be useful to extend it especially to state what are the system environment requirements before installation.

We have now included more comprehensive documentation including detailed installation instructions and FAQs at https://desmondwillowbrook.github.io/Librarian/. We point to these from the manuscript (section Operation) and the github repository.
The manuscript describes the quality control (QC) tool Librarian that predicts the similarity of sequencing library correctness in comparison to the control cohort based on the analysis of reads. Initially for this purpose, the composition of nucleotide variance in the first 50 bp of read was used as input to create a large reference control cohort from publicly available data. Visualisation of these merged nucleotide profiles via UMAP allows to observe clear groups formation based on the data type of a dataset. Novel sample check is a projection into this reference UMAP. Testing the online tool confirmed its usefulness: from inspection of own data, the majority of cases were distinguished correctly. Such projection of a novel dataset into reference would be a useful QC step for any sequencing experiment. However, the manuscript and the software description could be improved in order to provide more details about the tool as well as explain certain missing blocks.

We would like to thank Konstantin Okonechnikov for reviewing our manuscript and for taking the time to test the tool Librarian. We much appreciate his thoughtful comments and suggestions for improvement. Please find below detailed answers to each of the questions/comments.

In general, the manuscript clearly describes the technique, however, only one possible limitation of the method is stated (effect of a cut in RNA-seq fragments leading to similarity to ATAC-seq in Discussion). More variance factors could be inspected to avoid misleading conclusions about the analysis results. For example, the tissue materials can be obtained from frozen tissue (FFPE), is there any impact of this preparation procedure? In my inspection, the standard RNA-seq datasets were distinguished quite well, but FFPE RNA-seq demonstrated the closest similarity to MBPD and MeDIP-seq. The scRNA-seq protocols are also included, however, they vary since they could be either full gene body covering or only 5’/3’ segment of a gene. Could this have an impact on read nucleotide variance?

Systematic analysis of all factors influencing the base compositions is hampered by the limited amount of metadata that is provided by GEO, e. g. sample preparation methods or library kits are not included. One idea that we are pursuing for future developments in Librarian is to mine the entire GEO entry (rather than just the metadata) for more detailed sample information.

Prompted by your observation we manually found and downloaded five FFPE RNA-seq samples from four different studies (SRR11996379, SRR11996380, SRR21404015, SRR21773166, SRR23245053) and ran these through Librarian with the following results:

Please see image 1 here: https://f1000research.s3.amazonaws.com/linked/649544.Image_1.pdf

Please see image 2 here: https://f1000research.s3.amazonaws.com/linked/649545.Image_2.pdf

We cannot confirm that there seems to be a general issue classifying FFPE samples but we note that for some regions of the reference map RNA-seq shows substantial overlap with other library types which may result in misprediction of the library type.

The situation seems to be similar for 3’ RNAseq libraries. We downloaded four 10X libraries from two different studies (SRR18916640, SRR18916641, SRR23254936, SRR23254938). The Librarian results can be found below. Two samples are classified correctly, the other two are from an area in the reference map which is occupied by several library types and therefore have some wobble in the classification.

Please see image 3 here: https://f1000research.s3.amazonaws.com/linked/649546.Image_3.pdf

Please see image 4 here: https://f1000research.s3.amazonaws.com/linked/649547.Image_4.pdf

We have included a section on “My sample doesn't come up as the library type that I expect. How worried do I need to be?” in the FAQs.

The reads selection is performed with 100K subsampling - how was this amount selected? What is the effect of the total number of reads? Is it sufficient to provide only a subset of them? In this case, what is the suggested limit?

The idea behind subsampling is to find a good balance between faithful representation of the characteristics of a FastQ file and file size / processing speed. Subsampling is employed by several Babraham Bioinformatics QC tools including FastQScreen and some modules of FastQC. For Librarian, the subsample needs to be big enough for the base compositions to have stabilised and not be subject to random variation which is indeed the case for 100 K. Selecting the reads randomly rather than from the top of the file ensures that files in which reads are supplied in a sorted manner can also be dealt with. 100 K reads is the lower limit that Librarian accepts in a file. See below for an example in which 100 K, 200 K, 500 K and 13.5 million reads of an RNA-seq library. The samples are located in exactly the same location on the reference map.

Please see image 5 here: https://f1000research.s3.amazonaws.com/linked/649548.Image_5.pdf

Please see image 6 here: https://f1000research.s3.amazonaws.com/linked/649549.Image_6.pdf

Also, 50 bp read segment is used as the reference, but how was this selection made? Currently, the main standard for sequencing is 100-150 bp. Would it be more beneficial to use a larger segment of the read for reference generation? Or do quality issues in longer reads have a negative impact?

Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

How strong is the species effect? Are there variances observed between mice and human materials in full UMAP, e.g. clusters formation? Does it make sense to create own reference for such a procedure, especially when working on other species, e.g. Drosophila?

The magnitude of the species effect will depend on the difference of overall GC content and genome biology. We have included the reference map coloured by species as Supplementary Figure 2. As expected from the similar GC content and similar biology, libraries from mouse and human samples largely show the same base composition. We decided to focus on those two species because of the abundance of available data and prevalence in current studies.
Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see image 7 here: https://f1000research.s3.amazonaws.com/linked/649550.Image_7.pdf

Please see image 8 here: https://f1000research.s3.amazonaws.com/linked/649551.Image_8.pdf

Please see image 9 here: https://f1000research.s3.amazonaws.com/linked/649552.Image_9.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and in the new FAQ section of the documentation. Please also see response to Reviewer 3. Species specific and broader mammalian reference maps are planned for future Librarian releases.

Further additional comments could help to improve the manuscript for easier reading:

In Figure 1, the nucleotide type color legend is missing, also segments are not cited in the text directly by suffix (a,b,c,d). Figure 1a demonstrates ChIA-PET, but not clear why it is included since it's not stated in the manuscript text.

The text has been changed to briefly introduce the characteristic profiles observed with all four library types shown in the figures, and individual subpanels are now referenced. Details regarding the preparation protocols and how these produce the observed profiles have been added to the figure legend. The figure has been changed to reflect the order of library types in the text and the colour legend has been added.

Figure 2a: Are the amounts of mice and humans mixed? What is the variance?

We now include a file called ‘Supplementary Information’ with the publication which is deposited on Zenodo (along with other data and code). This contains a table showing how many data sets were from mouse and human per library type. The inputs were broadly balanced with the exception of ChIA-PET.

Please see image 10 here: https://f1000research.s3.amazonaws.com/linked/649553.Image_10.pdf

Figure 2c: Several enrichment UMAP locations for certain data types are far from each other, e.g. RNA-seq. How to interpret this? Could it be certain subclusters, further splitting the dataset types?

Apart from technical variation an underlying reason can be grouping of library preparation methods that produce distinct composition profiles. For RNA-seq there are a number of popular commercial kits available which will produce their own biases (e. g. stranded vs non-stranded, random vs poly-A primed). The most striking example of grouping by library preparation method is BS-seq where several commonly used library preparation methods produce different compositions. Two easily identifiable groups on the reference map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see image 11 here: https://f1000research.s3.amazonaws.com/linked/649554.Image_11.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

When downloading example datasets (sample FASTQ files), the archive cannot be opened. Also, there is no documentation available regarding input format preparation, e.g fastq are not allowed to be gzipped, it’s not clear without testing.

The example file archive has been fixed and is now working.

Librarian accepts both fastq and gzipped files. However, we found that upload of gzipped files is an issue outside of Librarian with some Mac/browser combinations. The workaround is to drag & drop files onto the upload button. We have added the sentence “Librarian accepts fastq and gzipped files. If gzipped files are greyed out in the pop up, try dragging them onto the upload button.” to the web app and have also included this workaround in the FAQs.

Github documentation on the establishment/launch lacks some details. Would be useful to extend it especially to state what are the system environment requirements before installation.

We have now included more comprehensive documentation including detailed installation instructions and FAQs at https://desmondwillowbrook.github.io/Librarian/. We point to these from the manuscript (section Operation) and the github repository.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 23 May 2024

Christel Krueger, Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK

23 May 2024

Author Response
The manuscript describes the quality control (QC) tool Librarian that predicts the similarity of sequencing library correctness in comparison to the control cohort based on the analysis of reads. Initially ... Continue reading
The manuscript describes the quality control (QC) tool Librarian that predicts the similarity of sequencing library correctness in comparison to the control cohort based on the analysis of reads. Initially for this purpose, the composition of nucleotide variance in the first 50 bp of read was used as input to create a large reference control cohort from publicly available data. Visualisation of these merged nucleotide profiles via UMAP allows to observe clear groups formation based on the data type of a dataset. Novel sample check is a projection into this reference UMAP. Testing the online tool confirmed its usefulness: from inspection of own data, the majority of cases were distinguished correctly. Such projection of a novel dataset into reference would be a useful QC step for any sequencing experiment. However, the manuscript and the software description could be improved in order to provide more details about the tool as well as explain certain missing blocks.

We would like to thank Konstantin Okonechnikov for reviewing our manuscript and for taking the time to test the tool Librarian. We much appreciate his thoughtful comments and suggestions for improvement. Please find below detailed answers to each of the questions/comments.

In general, the manuscript clearly describes the technique, however, only one possible limitation of the method is stated (effect of a cut in RNA-seq fragments leading to similarity to ATAC-seq in Discussion). More variance factors could be inspected to avoid misleading conclusions about the analysis results. For example, the tissue materials can be obtained from frozen tissue (FFPE), is there any impact of this preparation procedure? In my inspection, the standard RNA-seq datasets were distinguished quite well, but FFPE RNA-seq demonstrated the closest similarity to MBPD and MeDIP-seq. The scRNA-seq protocols are also included, however, they vary since they could be either full gene body covering or only 5’/3’ segment of a gene. Could this have an impact on read nucleotide variance?

Systematic analysis of all factors influencing the base compositions is hampered by the limited amount of metadata that is provided by GEO, e. g. sample preparation methods or library kits are not included. One idea that we are pursuing for future developments in Librarian is to mine the entire GEO entry (rather than just the metadata) for more detailed sample information.

Prompted by your observation we manually found and downloaded five FFPE RNA-seq samples from four different studies (SRR11996379, SRR11996380, SRR21404015, SRR21773166, SRR23245053) and ran these through Librarian with the following results:

Please see image 1 here: https://f1000research.s3.amazonaws.com/linked/649544.Image_1.pdf

Please see image 2 here: https://f1000research.s3.amazonaws.com/linked/649545.Image_2.pdf

We cannot confirm that there seems to be a general issue classifying FFPE samples but we note that for some regions of the reference map RNA-seq shows substantial overlap with other library types which may result in misprediction of the library type.

The situation seems to be similar for 3’ RNAseq libraries. We downloaded four 10X libraries from two different studies (SRR18916640, SRR18916641, SRR23254936, SRR23254938). The Librarian results can be found below. Two samples are classified correctly, the other two are from an area in the reference map which is occupied by several library types and therefore have some wobble in the classification.

Please see image 3 here: https://f1000research.s3.amazonaws.com/linked/649546.Image_3.pdf

Please see image 4 here: https://f1000research.s3.amazonaws.com/linked/649547.Image_4.pdf

We have included a section on “My sample doesn't come up as the library type that I expect. How worried do I need to be?” in the FAQs.

The reads selection is performed with 100K subsampling - how was this amount selected? What is the effect of the total number of reads? Is it sufficient to provide only a subset of them? In this case, what is the suggested limit?

The idea behind subsampling is to find a good balance between faithful representation of the characteristics of a FastQ file and file size / processing speed. Subsampling is employed by several Babraham Bioinformatics QC tools including FastQScreen and some modules of FastQC. For Librarian, the subsample needs to be big enough for the base compositions to have stabilised and not be subject to random variation which is indeed the case for 100 K. Selecting the reads randomly rather than from the top of the file ensures that files in which reads are supplied in a sorted manner can also be dealt with. 100 K reads is the lower limit that Librarian accepts in a file. See below for an example in which 100 K, 200 K, 500 K and 13.5 million reads of an RNA-seq library. The samples are located in exactly the same location on the reference map.

Please see image 5 here: https://f1000research.s3.amazonaws.com/linked/649548.Image_5.pdf

Please see image 6 here: https://f1000research.s3.amazonaws.com/linked/649549.Image_6.pdf

Also, 50 bp read segment is used as the reference, but how was this selection made? Currently, the main standard for sequencing is 100-150 bp. Would it be more beneficial to use a larger segment of the read for reference generation? Or do quality issues in longer reads have a negative impact?

Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

How strong is the species effect? Are there variances observed between mice and human materials in full UMAP, e.g. clusters formation? Does it make sense to create own reference for such a procedure, especially when working on other species, e.g. Drosophila?

The magnitude of the species effect will depend on the difference of overall GC content and genome biology. We have included the reference map coloured by species as Supplementary Figure 2. As expected from the similar GC content and similar biology, libraries from mouse and human samples largely show the same base composition. We decided to focus on those two species because of the abundance of available data and prevalence in current studies.
Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see image 7 here: https://f1000research.s3.amazonaws.com/linked/649550.Image_7.pdf

Please see image 8 here: https://f1000research.s3.amazonaws.com/linked/649551.Image_8.pdf

Please see image 9 here: https://f1000research.s3.amazonaws.com/linked/649552.Image_9.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and in the new FAQ section of the documentation. Please also see response to Reviewer 3. Species specific and broader mammalian reference maps are planned for future Librarian releases.

Further additional comments could help to improve the manuscript for easier reading:

In Figure 1, the nucleotide type color legend is missing, also segments are not cited in the text directly by suffix (a,b,c,d). Figure 1a demonstrates ChIA-PET, but not clear why it is included since it's not stated in the manuscript text.

The text has been changed to briefly introduce the characteristic profiles observed with all four library types shown in the figures, and individual subpanels are now referenced. Details regarding the preparation protocols and how these produce the observed profiles have been added to the figure legend. The figure has been changed to reflect the order of library types in the text and the colour legend has been added.

Figure 2a: Are the amounts of mice and humans mixed? What is the variance?

We now include a file called ‘Supplementary Information’ with the publication which is deposited on Zenodo (along with other data and code). This contains a table showing how many data sets were from mouse and human per library type. The inputs were broadly balanced with the exception of ChIA-PET.

Please see image 10 here: https://f1000research.s3.amazonaws.com/linked/649553.Image_10.pdf

Figure 2c: Several enrichment UMAP locations for certain data types are far from each other, e.g. RNA-seq. How to interpret this? Could it be certain subclusters, further splitting the dataset types?

Apart from technical variation an underlying reason can be grouping of library preparation methods that produce distinct composition profiles. For RNA-seq there are a number of popular commercial kits available which will produce their own biases (e. g. stranded vs non-stranded, random vs poly-A primed). The most striking example of grouping by library preparation method is BS-seq where several commonly used library preparation methods produce different compositions. Two easily identifiable groups on the reference map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see image 11 here: https://f1000research.s3.amazonaws.com/linked/649554.Image_11.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

When downloading example datasets (sample FASTQ files), the archive cannot be opened. Also, there is no documentation available regarding input format preparation, e.g fastq are not allowed to be gzipped, it’s not clear without testing.

The example file archive has been fixed and is now working.

Librarian accepts both fastq and gzipped files. However, we found that upload of gzipped files is an issue outside of Librarian with some Mac/browser combinations. The workaround is to drag & drop files onto the upload button. We have added the sentence “Librarian accepts fastq and gzipped files. If gzipped files are greyed out in the pop up, try dragging them onto the upload button.” to the web app and have also included this workaround in the FAQs.

Github documentation on the establishment/launch lacks some details. Would be useful to extend it especially to state what are the system environment requirements before installation.

We have now included more comprehensive documentation including detailed installation instructions and FAQs at https://desmondwillowbrook.github.io/Librarian/. We point to these from the manuscript (section Operation) and the github repository.
The manuscript describes the quality control (QC) tool Librarian that predicts the similarity of sequencing library correctness in comparison to the control cohort based on the analysis of reads. Initially for this purpose, the composition of nucleotide variance in the first 50 bp of read was used as input to create a large reference control cohort from publicly available data. Visualisation of these merged nucleotide profiles via UMAP allows to observe clear groups formation based on the data type of a dataset. Novel sample check is a projection into this reference UMAP. Testing the online tool confirmed its usefulness: from inspection of own data, the majority of cases were distinguished correctly. Such projection of a novel dataset into reference would be a useful QC step for any sequencing experiment. However, the manuscript and the software description could be improved in order to provide more details about the tool as well as explain certain missing blocks.

We would like to thank Konstantin Okonechnikov for reviewing our manuscript and for taking the time to test the tool Librarian. We much appreciate his thoughtful comments and suggestions for improvement. Please find below detailed answers to each of the questions/comments.

In general, the manuscript clearly describes the technique, however, only one possible limitation of the method is stated (effect of a cut in RNA-seq fragments leading to similarity to ATAC-seq in Discussion). More variance factors could be inspected to avoid misleading conclusions about the analysis results. For example, the tissue materials can be obtained from frozen tissue (FFPE), is there any impact of this preparation procedure? In my inspection, the standard RNA-seq datasets were distinguished quite well, but FFPE RNA-seq demonstrated the closest similarity to MBPD and MeDIP-seq. The scRNA-seq protocols are also included, however, they vary since they could be either full gene body covering or only 5’/3’ segment of a gene. Could this have an impact on read nucleotide variance?

Systematic analysis of all factors influencing the base compositions is hampered by the limited amount of metadata that is provided by GEO, e. g. sample preparation methods or library kits are not included. One idea that we are pursuing for future developments in Librarian is to mine the entire GEO entry (rather than just the metadata) for more detailed sample information.

Prompted by your observation we manually found and downloaded five FFPE RNA-seq samples from four different studies (SRR11996379, SRR11996380, SRR21404015, SRR21773166, SRR23245053) and ran these through Librarian with the following results:

Please see image 1 here: https://f1000research.s3.amazonaws.com/linked/649544.Image_1.pdf

Please see image 2 here: https://f1000research.s3.amazonaws.com/linked/649545.Image_2.pdf

We cannot confirm that there seems to be a general issue classifying FFPE samples but we note that for some regions of the reference map RNA-seq shows substantial overlap with other library types which may result in misprediction of the library type.

The situation seems to be similar for 3’ RNAseq libraries. We downloaded four 10X libraries from two different studies (SRR18916640, SRR18916641, SRR23254936, SRR23254938). The Librarian results can be found below. Two samples are classified correctly, the other two are from an area in the reference map which is occupied by several library types and therefore have some wobble in the classification.

Please see image 3 here: https://f1000research.s3.amazonaws.com/linked/649546.Image_3.pdf

Please see image 4 here: https://f1000research.s3.amazonaws.com/linked/649547.Image_4.pdf

We have included a section on “My sample doesn't come up as the library type that I expect. How worried do I need to be?” in the FAQs.

The reads selection is performed with 100K subsampling - how was this amount selected? What is the effect of the total number of reads? Is it sufficient to provide only a subset of them? In this case, what is the suggested limit?

The idea behind subsampling is to find a good balance between faithful representation of the characteristics of a FastQ file and file size / processing speed. Subsampling is employed by several Babraham Bioinformatics QC tools including FastQScreen and some modules of FastQC. For Librarian, the subsample needs to be big enough for the base compositions to have stabilised and not be subject to random variation which is indeed the case for 100 K. Selecting the reads randomly rather than from the top of the file ensures that files in which reads are supplied in a sorted manner can also be dealt with. 100 K reads is the lower limit that Librarian accepts in a file. See below for an example in which 100 K, 200 K, 500 K and 13.5 million reads of an RNA-seq library. The samples are located in exactly the same location on the reference map.

Please see image 5 here: https://f1000research.s3.amazonaws.com/linked/649548.Image_5.pdf

Please see image 6 here: https://f1000research.s3.amazonaws.com/linked/649549.Image_6.pdf

Also, 50 bp read segment is used as the reference, but how was this selection made? Currently, the main standard for sequencing is 100-150 bp. Would it be more beneficial to use a larger segment of the read for reference generation? Or do quality issues in longer reads have a negative impact?

Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

How strong is the species effect? Are there variances observed between mice and human materials in full UMAP, e.g. clusters formation? Does it make sense to create own reference for such a procedure, especially when working on other species, e.g. Drosophila?

The magnitude of the species effect will depend on the difference of overall GC content and genome biology. We have included the reference map coloured by species as Supplementary Figure 2. As expected from the similar GC content and similar biology, libraries from mouse and human samples largely show the same base composition. We decided to focus on those two species because of the abundance of available data and prevalence in current studies.
Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see image 7 here: https://f1000research.s3.amazonaws.com/linked/649550.Image_7.pdf

Please see image 8 here: https://f1000research.s3.amazonaws.com/linked/649551.Image_8.pdf

Please see image 9 here: https://f1000research.s3.amazonaws.com/linked/649552.Image_9.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and in the new FAQ section of the documentation. Please also see response to Reviewer 3. Species specific and broader mammalian reference maps are planned for future Librarian releases.

Further additional comments could help to improve the manuscript for easier reading:

In Figure 1, the nucleotide type color legend is missing, also segments are not cited in the text directly by suffix (a,b,c,d). Figure 1a demonstrates ChIA-PET, but not clear why it is included since it's not stated in the manuscript text.

The text has been changed to briefly introduce the characteristic profiles observed with all four library types shown in the figures, and individual subpanels are now referenced. Details regarding the preparation protocols and how these produce the observed profiles have been added to the figure legend. The figure has been changed to reflect the order of library types in the text and the colour legend has been added.

Figure 2a: Are the amounts of mice and humans mixed? What is the variance?

We now include a file called ‘Supplementary Information’ with the publication which is deposited on Zenodo (along with other data and code). This contains a table showing how many data sets were from mouse and human per library type. The inputs were broadly balanced with the exception of ChIA-PET.

Please see image 10 here: https://f1000research.s3.amazonaws.com/linked/649553.Image_10.pdf

Figure 2c: Several enrichment UMAP locations for certain data types are far from each other, e.g. RNA-seq. How to interpret this? Could it be certain subclusters, further splitting the dataset types?

Apart from technical variation an underlying reason can be grouping of library preparation methods that produce distinct composition profiles. For RNA-seq there are a number of popular commercial kits available which will produce their own biases (e. g. stranded vs non-stranded, random vs poly-A primed). The most striking example of grouping by library preparation method is BS-seq where several commonly used library preparation methods produce different compositions. Two easily identifiable groups on the reference map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see image 11 here: https://f1000research.s3.amazonaws.com/linked/649554.Image_11.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

When downloading example datasets (sample FASTQ files), the archive cannot be opened. Also, there is no documentation available regarding input format preparation, e.g fastq are not allowed to be gzipped, it’s not clear without testing.

The example file archive has been fixed and is now working.

Librarian accepts both fastq and gzipped files. However, we found that upload of gzipped files is an issue outside of Librarian with some Mac/browser combinations. The workaround is to drag & drop files onto the upload button. We have added the sentence “Librarian accepts fastq and gzipped files. If gzipped files are greyed out in the pop up, try dragging them onto the upload button.” to the web app and have also included this workaround in the FAQs.

Github documentation on the establishment/launch lacks some details. Would be useful to extend it especially to state what are the system environment requirements before installation.

We have now included more comprehensive documentation including detailed installation instructions and FAQs at https://desmondwillowbrook.github.io/Librarian/. We point to these from the manuscript (section Operation) and the github repository.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Views

Reviewer Report 06 Oct 2022

Andrew Keniry, Molecular Medicine Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia

Approved

https://doi.org/10.5256/f1000research.137618.r151965

Vashishtha and colleagues perform an analysis of the base composition from ~3000 publicly available sequencing data sets and show that these segregate by library type in a UMAP analysis of fastq files. The authors suggest that this analysis could be a useful pre-mapping QC step to identify incorrect libraries early in analysis pipelines and provide a tool, Librarian, for performing this test. Such an analysis could certainly be useful and has the potential to be widely adopted. I have a few suggestions that could improve the manuscript:

The authors show that Librarian identifies libraries prepared by different techniques, but is it able to identify a failed sample within a specific technique? Can the method be tested on failed samples such as ChIP-seq without enrichment, BS-seq with a low conversion efficiency, or samples with high duplication?
BS-seq seems to segregate into multiple clusters. Is there an easily identifiable reason for this – perhaps enrichment techniques or developmental stage?
I’m not sure of the logistics of this, but Librarian may be more widely used if it was available as an option within the already widely used fastqc.
An example of the Librarian output would be beneficial.
The terms ‘reference map’ and ‘compositions map’ seem to be used interchangeably. For simplicity, one term should be used throughout.
Fig 1A shows the base composition of ChIA-pet data. As this is a less well known technique, it would be beneficial to have an explanation of the base composition results.
Fig 1 is missing a legend explaining which base each colour represents.
I’m not certain what Fig 2E is showing. Could the authors provide more explanation in the legend? There is also no reference to this figure in the text.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Epigenetics, development, cell biology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

CITE

Report a concern

Author Response 23 May 2024

Christel Krueger, Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK

23 May 2024

Author Response
Vashishtha and colleagues perform an analysis of the base composition from ~3000 publicly available sequencing data sets and show that these segregate by library type in a UMAP analysis of ... Continue reading
Vashishtha and colleagues perform an analysis of the base composition from ~3000 publicly available sequencing data sets and show that these segregate by library type in a UMAP analysis of fastq files. The authors suggest that this analysis could be a useful pre-mapping QC step to identify incorrect libraries early in analysis pipelines and provide a tool, Librarian, for performing this test. Such an analysis could certainly be useful and has the potential to be widely adopted. I have a few suggestions that could improve the manuscript:

We would like to thank Andrew Keniry for taking the time to review our manuscript and his thoughtful comments and suggestions. Please find below a point-by-point response to all questions.

The authors show that Librarian identifies libraries prepared by different techniques, but is it able to identify a failed sample within a specific technique? Can the method be tested on failed samples such as ChIP-seq without enrichment, BS-seq with a low conversion efficiency, or samples with high duplication?

Compositional differences between different library types are relatively large in comparison to variations between samples of one specific technique. For example, in ChIP-seq experiments typically only a minority of reads fall within enriched regions. To illustrate this point, the ENCODE consortium only requires a minimum of 1% reads in peaks for their datasets (Genome Res. 2012 Sep; 22(9): 1813–1831). Therefore, for both well enriched and poorly enriched datasets the majority of reads are background, and the overall base composition is not different enough to separate in base composition analysis. Likewise, technical duplication does not change the base composition, and only if the library consisted of very few original molecules would we expect an impact. Since BS-seq libraries are striking in their low C content, we explored if Librarian could spot low conversion rates: The tested library has very high non-CG methylation (around 20 %), but still the test sample is predicted as BS-seq - most Cs are still converted. In summary, technical noise adds markedly lower variation than library type. This makes Librarian robust against variation in sample preparation quality.

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649537.download_1.pdf

BS-seq seems to segregate into multiple clusters. Is there an easily identifiable reason for this – perhaps enrichment techniques or developmental stage?

The reason behind this is that there are several commonly used library preparation methods which produce data with different base compositions, for example directional vs non-directional libraries or reduced genomic representation. Two easily identifiable groups on the Compositions map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649538.Download_2.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

I’m not sure of the logistics of this, but Librarian may be more widely used if it was available as an option within the already widely used fastqc.

We discussed this as an option but decided against it. FastQC is deliberately kept lean and without dependencies which Librarian would bring. For more convenient incorporation into standard QC we are now writing out the prediction percentages as a text file. This can be picked up by the widely used tool MultiQC and is then displayed as a heatmap in the MultiQC report. We would like to thank Phil Ewels for incorporating this feature.

An example of the Librarian output would be beneficial.

Librarian will produce the plots shown in Figure 2B, C, E with circles indicating the map location of test samples. We deliberately did not include output examples in the main body of the manuscript as they are almost identical to the ones shown in Figure 2. Instead, they can be found in the accompanying use case data at https://doi.org/10.5281/zenodo.7060217.

The terms ‘reference map’ and ‘compositions map’ seem to be used interchangeably. For simplicity, one term should be used throughout.

This has been changed to ‘reference map’ throughout.

Fig 1A shows the base composition of ChIA-pet data. As this is a less well known technique, it would be beneficial to have an explanation of the base composition results.

Please see 7.

Fig 1 is missing a legend explaining which base each colour represents.

The text has been changed to briefly introduce the characteristic profiles observed with all four library types shown in the figures, and individual subpanels are now referenced. Details regarding the preparation protocols and how these produce the observed profiles have been added to the figure legend. The figure has been changed to reflect the order of library types in the text and the colour legend has been added.

I’m not certain what Fig 2E is showing. Could the authors provide more explanation in the legend? There is also no reference to this figure in the text.

An in-text reference has been added in the section about Implementation.
Vashishtha and colleagues perform an analysis of the base composition from ~3000 publicly available sequencing data sets and show that these segregate by library type in a UMAP analysis of fastq files. The authors suggest that this analysis could be a useful pre-mapping QC step to identify incorrect libraries early in analysis pipelines and provide a tool, Librarian, for performing this test. Such an analysis could certainly be useful and has the potential to be widely adopted. I have a few suggestions that could improve the manuscript:

We would like to thank Andrew Keniry for taking the time to review our manuscript and his thoughtful comments and suggestions. Please find below a point-by-point response to all questions.

The authors show that Librarian identifies libraries prepared by different techniques, but is it able to identify a failed sample within a specific technique? Can the method be tested on failed samples such as ChIP-seq without enrichment, BS-seq with a low conversion efficiency, or samples with high duplication?

Compositional differences between different library types are relatively large in comparison to variations between samples of one specific technique. For example, in ChIP-seq experiments typically only a minority of reads fall within enriched regions. To illustrate this point, the ENCODE consortium only requires a minimum of 1% reads in peaks for their datasets (Genome Res. 2012 Sep; 22(9): 1813–1831). Therefore, for both well enriched and poorly enriched datasets the majority of reads are background, and the overall base composition is not different enough to separate in base composition analysis. Likewise, technical duplication does not change the base composition, and only if the library consisted of very few original molecules would we expect an impact. Since BS-seq libraries are striking in their low C content, we explored if Librarian could spot low conversion rates: The tested library has very high non-CG methylation (around 20 %), but still the test sample is predicted as BS-seq - most Cs are still converted. In summary, technical noise adds markedly lower variation than library type. This makes Librarian robust against variation in sample preparation quality.

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649537.download_1.pdf

BS-seq seems to segregate into multiple clusters. Is there an easily identifiable reason for this – perhaps enrichment techniques or developmental stage?

The reason behind this is that there are several commonly used library preparation methods which produce data with different base compositions, for example directional vs non-directional libraries or reduced genomic representation. Two easily identifiable groups on the Compositions map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649538.Download_2.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

I’m not sure of the logistics of this, but Librarian may be more widely used if it was available as an option within the already widely used fastqc.

We discussed this as an option but decided against it. FastQC is deliberately kept lean and without dependencies which Librarian would bring. For more convenient incorporation into standard QC we are now writing out the prediction percentages as a text file. This can be picked up by the widely used tool MultiQC and is then displayed as a heatmap in the MultiQC report. We would like to thank Phil Ewels for incorporating this feature.

An example of the Librarian output would be beneficial.

Librarian will produce the plots shown in Figure 2B, C, E with circles indicating the map location of test samples. We deliberately did not include output examples in the main body of the manuscript as they are almost identical to the ones shown in Figure 2. Instead, they can be found in the accompanying use case data at https://doi.org/10.5281/zenodo.7060217.

The terms ‘reference map’ and ‘compositions map’ seem to be used interchangeably. For simplicity, one term should be used throughout.

This has been changed to ‘reference map’ throughout.

Fig 1A shows the base composition of ChIA-pet data. As this is a less well known technique, it would be beneficial to have an explanation of the base composition results.

Please see 7.

Fig 1 is missing a legend explaining which base each colour represents.

The text has been changed to briefly introduce the characteristic profiles observed with all four library types shown in the figures, and individual subpanels are now referenced. Details regarding the preparation protocols and how these produce the observed profiles have been added to the figure legend. The figure has been changed to reflect the order of library types in the text and the colour legend has been added.

I’m not certain what Fig 2E is showing. Could the authors provide more explanation in the legend? There is also no reference to this figure in the text.

An in-text reference has been added in the section about Implementation.
Competing Interests: No competing interests were disclosed. Close
Report a concern
Respond or Comment

COMMENTS ON THIS REPORT

Author Response 23 May 2024

Christel Krueger, Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK

23 May 2024

Author Response
Vashishtha and colleagues perform an analysis of the base composition from ~3000 publicly available sequencing data sets and show that these segregate by library type in a UMAP analysis of ... Continue reading
Vashishtha and colleagues perform an analysis of the base composition from ~3000 publicly available sequencing data sets and show that these segregate by library type in a UMAP analysis of fastq files. The authors suggest that this analysis could be a useful pre-mapping QC step to identify incorrect libraries early in analysis pipelines and provide a tool, Librarian, for performing this test. Such an analysis could certainly be useful and has the potential to be widely adopted. I have a few suggestions that could improve the manuscript:

We would like to thank Andrew Keniry for taking the time to review our manuscript and his thoughtful comments and suggestions. Please find below a point-by-point response to all questions.

The authors show that Librarian identifies libraries prepared by different techniques, but is it able to identify a failed sample within a specific technique? Can the method be tested on failed samples such as ChIP-seq without enrichment, BS-seq with a low conversion efficiency, or samples with high duplication?

Compositional differences between different library types are relatively large in comparison to variations between samples of one specific technique. For example, in ChIP-seq experiments typically only a minority of reads fall within enriched regions. To illustrate this point, the ENCODE consortium only requires a minimum of 1% reads in peaks for their datasets (Genome Res. 2012 Sep; 22(9): 1813–1831). Therefore, for both well enriched and poorly enriched datasets the majority of reads are background, and the overall base composition is not different enough to separate in base composition analysis. Likewise, technical duplication does not change the base composition, and only if the library consisted of very few original molecules would we expect an impact. Since BS-seq libraries are striking in their low C content, we explored if Librarian could spot low conversion rates: The tested library has very high non-CG methylation (around 20 %), but still the test sample is predicted as BS-seq - most Cs are still converted. In summary, technical noise adds markedly lower variation than library type. This makes Librarian robust against variation in sample preparation quality.

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649537.download_1.pdf

BS-seq seems to segregate into multiple clusters. Is there an easily identifiable reason for this – perhaps enrichment techniques or developmental stage?

The reason behind this is that there are several commonly used library preparation methods which produce data with different base compositions, for example directional vs non-directional libraries or reduced genomic representation. Two easily identifiable groups on the Compositions map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649538.Download_2.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

I’m not sure of the logistics of this, but Librarian may be more widely used if it was available as an option within the already widely used fastqc.

We discussed this as an option but decided against it. FastQC is deliberately kept lean and without dependencies which Librarian would bring. For more convenient incorporation into standard QC we are now writing out the prediction percentages as a text file. This can be picked up by the widely used tool MultiQC and is then displayed as a heatmap in the MultiQC report. We would like to thank Phil Ewels for incorporating this feature.

An example of the Librarian output would be beneficial.

Librarian will produce the plots shown in Figure 2B, C, E with circles indicating the map location of test samples. We deliberately did not include output examples in the main body of the manuscript as they are almost identical to the ones shown in Figure 2. Instead, they can be found in the accompanying use case data at https://doi.org/10.5281/zenodo.7060217.

The terms ‘reference map’ and ‘compositions map’ seem to be used interchangeably. For simplicity, one term should be used throughout.

This has been changed to ‘reference map’ throughout.

Fig 1A shows the base composition of ChIA-pet data. As this is a less well known technique, it would be beneficial to have an explanation of the base composition results.

Please see 7.

Fig 1 is missing a legend explaining which base each colour represents.

The text has been changed to briefly introduce the characteristic profiles observed with all four library types shown in the figures, and individual subpanels are now referenced. Details regarding the preparation protocols and how these produce the observed profiles have been added to the figure legend. The figure has been changed to reflect the order of library types in the text and the colour legend has been added.

I’m not certain what Fig 2E is showing. Could the authors provide more explanation in the legend? There is also no reference to this figure in the text.

An in-text reference has been added in the section about Implementation.
Vashishtha and colleagues perform an analysis of the base composition from ~3000 publicly available sequencing data sets and show that these segregate by library type in a UMAP analysis of fastq files. The authors suggest that this analysis could be a useful pre-mapping QC step to identify incorrect libraries early in analysis pipelines and provide a tool, Librarian, for performing this test. Such an analysis could certainly be useful and has the potential to be widely adopted. I have a few suggestions that could improve the manuscript:

We would like to thank Andrew Keniry for taking the time to review our manuscript and his thoughtful comments and suggestions. Please find below a point-by-point response to all questions.

The authors show that Librarian identifies libraries prepared by different techniques, but is it able to identify a failed sample within a specific technique? Can the method be tested on failed samples such as ChIP-seq without enrichment, BS-seq with a low conversion efficiency, or samples with high duplication?

Compositional differences between different library types are relatively large in comparison to variations between samples of one specific technique. For example, in ChIP-seq experiments typically only a minority of reads fall within enriched regions. To illustrate this point, the ENCODE consortium only requires a minimum of 1% reads in peaks for their datasets (Genome Res. 2012 Sep; 22(9): 1813–1831). Therefore, for both well enriched and poorly enriched datasets the majority of reads are background, and the overall base composition is not different enough to separate in base composition analysis. Likewise, technical duplication does not change the base composition, and only if the library consisted of very few original molecules would we expect an impact. Since BS-seq libraries are striking in their low C content, we explored if Librarian could spot low conversion rates: The tested library has very high non-CG methylation (around 20 %), but still the test sample is predicted as BS-seq - most Cs are still converted. In summary, technical noise adds markedly lower variation than library type. This makes Librarian robust against variation in sample preparation quality.

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649537.download_1.pdf

BS-seq seems to segregate into multiple clusters. Is there an easily identifiable reason for this – perhaps enrichment techniques or developmental stage?

The reason behind this is that there are several commonly used library preparation methods which produce data with different base compositions, for example directional vs non-directional libraries or reduced genomic representation. Two easily identifiable groups on the Compositions map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649538.Download_2.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

I’m not sure of the logistics of this, but Librarian may be more widely used if it was available as an option within the already widely used fastqc.

We discussed this as an option but decided against it. FastQC is deliberately kept lean and without dependencies which Librarian would bring. For more convenient incorporation into standard QC we are now writing out the prediction percentages as a text file. This can be picked up by the widely used tool MultiQC and is then displayed as a heatmap in the MultiQC report. We would like to thank Phil Ewels for incorporating this feature.

An example of the Librarian output would be beneficial.

Librarian will produce the plots shown in Figure 2B, C, E with circles indicating the map location of test samples. We deliberately did not include output examples in the main body of the manuscript as they are almost identical to the ones shown in Figure 2. Instead, they can be found in the accompanying use case data at https://doi.org/10.5281/zenodo.7060217.

The terms ‘reference map’ and ‘compositions map’ seem to be used interchangeably. For simplicity, one term should be used throughout.

This has been changed to ‘reference map’ throughout.

Fig 1A shows the base composition of ChIA-pet data. As this is a less well known technique, it would be beneficial to have an explanation of the base composition results.

Please see 7.

Fig 1 is missing a legend explaining which base each colour represents.

The text has been changed to briefly introduce the characteristic profiles observed with all four library types shown in the figures, and individual subpanels are now referenced. Details regarding the preparation protocols and how these produce the observed profiles have been added to the figure legend. The figure has been changed to reflect the order of library types in the text and the colour legend has been added.

I’m not certain what Fig 2E is showing. Could the authors provide more explanation in the legend? There is also no reference to this figure in the text.

An in-text reference has been added in the section about Implementation.
Competing Interests: No competing interests were disclosed. Close
Report a concern

Comments on this article Comments (0)

Version 2

VERSION 2 PUBLISHED 29 Sep 2022

Open Peer Review

Reviewer Status

Reviewer Reports

	Invited Reviewers
	1	2	3
Version 2 (revision) 24 Jan 24	read		read
Version 1 29 Sep 22	read	read	read

Andrew Keniry, Walter and Eliza Hall Institute of Medical Research, Parkville, Australia
Konstantin Okonechnikov, German Cancer Research Center, Heidelberg, Germany
Karim Gharbi, The Earlham Institute, Norwich, UK

Comments on this article

All Comments(0)

Add a comment

Browse by related subjects

Back to all reports

Reviewer Report

3 Views

24 Feb 2024 | for Version 2

Karim Gharbi, The Earlham Institute, Norwich, UK

3 Views Cite this report Responses(0)

Approved

The revised manuscript is significantly improved and addresses the majority of my original comments. The (current) limitations of Librarian have been clarified and accessibility has also been improved.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

genomics, next-generation sequencing, bioinformatics

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

7 Views

15 Feb 2024 | for Version 2

Andrew Keniry, Molecular Medicine Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia

7 Views Cite this report Responses(0)

Approved

No further comments.

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Epigenetics, development, cell biology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (0)

Back to all reports

Reviewer Report

23 Views

18 Oct 2022 | for Version 1

Karim Gharbi, The Earlham Institute, Norwich, UK

23 Views Cite this report Responses(1)

Approved With Reservations

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Partly
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Partly

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

genomics, next-generation sequencing, bioinformatics

Respond to this report

Responses (1)

Author Response

23 May 2024

Christel Krueger, Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK

Reviewer Comment: In this manuscript, Vashishtha et al describes the implementation of a novel quality control (QC) tool for next-generation sequencing (NGS) datasets, which uses nucleotide composition profiles along sequence reads to infer the likely library preparation method used to generate the data. The authors first demonstrate that nucleotide composition is strongly influenced by library method as recorded in the GEO database for a selection of human and mouse NGS datasets. Having established this result, they implemented a program tool (Librarian) to compare library nucleotide composition profiles to a collection of reference datasets and identify libraries with unexpected profiles, which may be indicative of potential failure during library preparation and/or sample/data mix-ups. The tool, which is available as a web application and command line tool, extracts nucleotide composition from user supplied FASTQ files and returns similarity scores against existing profiles stored in the Librarian database.

The manuscript is well-written, and the authors provide strong evidence of library method influencing nucleotide composition in the read output files. This will be a familiar observation for those experienced with generating and/or analysing diverse NGS datasets, but the manuscript is a welcome documentation and quantification of these patterns. As a tool, Librarian has the potential to become an important step in the QC of NGS data, alongside other, more generic QC tools, such as FastQC, and help detect quality issues early in data processing. However, I have some concerns about the limitations of the software as currently implemented, which I feel are not sufficiently discussed in the manuscript and could cause significant confusion in the hands of less experienced users. The comments below are intended to help the authors improve the current manuscript and indicate areas for future improvement to increase the usability of the tool.

Author Response: We would like to thank Karim Gharbi for reviewing our manuscript and taking the time to test the tool Librarian, and we are very pleased by his positive assessment regarding its usefulness. We much appreciate Karim’s thoughtful comments and useful suggestions to improve the usability. Please find below responses to each of his questions/comments.

Reviewer Comment:
Major comments

Please comment on the applicability of Librarian to data generated with other NGS technologies than Illumina. If not tested or not applicable, this should be highlighted in the discussion.

Author Response: Our analyses focused on Illumina sequencing as this is by far the most widely used technology and also offers the most diverse library types - with many applications not available on other platforms such as Nanopore or PacBio. We have now made this clear in the discussion and included this as a limitation in the section on Operation. We have also added this to the new FAQ section of the documentation.

Reviewer Comment:

Please provide a rationale for trimming reads to 50 bases and only considering read 1 to build the database of nucleotide composition profiles, i.e., why is this sufficient to accurately capture the nucleotide composition of each library type. Some methods result in asymmetric library fragments (e.g., 10X Genomics), with different nucleotide compositions in read 1 and read 2, which in itself can be diagnostic of the library type.

Author Response: Librarian was designed to check individual FastQ files which is why differences between e. g. read 1 and read 2 are not taken into account despite being informative in some cases. Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

Reviewer Comment:

The selection of GEO datasets to build reference profiles seems restrictive and potentially biased. Please can you provide evidence that Librarian is applicable to other species than human/mouse, especially species with divergent GC content.

Author Response: We deliberately restricted the selection of samples to human and mouse species as it is indeed the case that species with different CG content will produce libraries of different compositions. Thus, as a minimum, test samples should have a similar GC content to the samples from the reference map. We decided to focus on human and mouse data for the reference map because of the abundance of available data and prevalence in current studies. Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see Image 1 here: https://f1000research.s3.amazonaws.com/linked/649555.Image_1.1.pdf

Please see Image 2 here: https://f1000research.s3.amazonaws.com/linked/649556.Image_1.2.pdf

Please see Image 3 here: https://f1000research.s3.amazonaws.com/linked/649557.Image_1.3.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and the new FAQ section of the documentation. Please also see reply to Reviewer 2.

Species specific and broader mammalian reference maps are planned for future Librarian releases.

Reviewer Comment:

The date range filter (01/01/2018 - 31/12/2020) is also likely to have resulted in more recent library types to be excluded from the analysis and therefore the reference dataset. 10X Genomics library types are highly popular, but surprisingly absent. Other library types may have been missed too.

Author Response: It is indeed the case that new library types become available all the time. While Librarian is unlikely to ever be able to cater for niche technologies it will need to be updated over time as new library preparation methods become popular. We have therefore made the design decision to have both the web app and the CLI query the reference data on the Babraham server. This is to ensure that users will have immediate access to updated reference maps as they become available. A current limitation is that GEO only offers a set number of metadata tags for library types upon sample submission which results in grouping of distinct preparation methods under one name. (Please also see responses to Reviewer 1 Question 2 and Reviewer 2). Currently, there is no tag for 10X Genomics libraries on GEO and submitters may choose to either label it as ‘RNA-seq’ or potentially ‘OTHER’. We are hoping to be able to query a wider range of metadata in the future and provide a more fine grained classification of library types.

Reviewer Comment:

Transposon-based library preparation is increasingly popular and applied across a wide range of library types, including single-cell RNA and DNA sequencing, whole-genome sequencing, ATAC-seq, enrichment capture etc. The authors briefly acknowledge this in the discussion, but it appears to be a major limitation of the tool, i.e., transposon-insertion signature at the start of the reads will likely obscure the underlying library type, causing most transposon-based libraries to cluster together. This should be explicitly documented and investigated further, if possible.

Author Response: This is indeed a limitation and laid out in the second paragraph of the discussion. We have now also added this scenario explicitly to the FAQ section of the documentation. With the current set of reference data we find that the proportion of transposon-based library preparation for other types than ATAC-seq is very low. As this library preparation method becomes more popular it will feature more strongly and in a more diverse way in future reference maps.

Reviewer Comment:

More generally speaking, I would strongly encourage the authors to explicitly identify library types and species "supported" by Librarian, indicating that submission of other library types and/or species may result in inconclusive or potentially misleading results (I acknowledge that the software will accept any FASTQ file).

Author Response: We have now been much more explicit about which samples are suitable for analysis with Librarian: We have added the following statement to the Operations section: “Irrespective of platform, Librarian is only suitable to assess datasets which match the types that the reference map is built on. More specifically, test samples need to have been sequenced with Illumina technology, match any of the included library types and be of mammalian origin (ideally mouse or human).”
We also discuss extensively which samples are supported in the new FAQ section of the documentation.

Reviewer Comment:

Minor comments

Please briefly comment on the observed pattern for ChIA-PET and ChIP-seq libraries, i.e., why are these expected and consistent with the library method. ChIA-PET is not a widely used library method. A short description should be included in the text for context.

Author Response: A description of the techniques BS-Seq, ATAC-Seq, ChIA-PET and ChIP-Seq and an explanation of the method specific composition biases along the read has been added to the legend to Figure 2.

Reviewer Comment:

Please add legend to Figure 1 with key matching coloured lines to individual bases.

Author Response: This omission has been corrected.

Reviewer Comment:

I would suggest meta-analysis of public datasets as another important use case for Librarian, e.g., as a clean-up tool prior to meta-analysis or identifying patterns/biases in library type, or subtypes.

Author Response: This is a good suggestion and this use case has now been included in the manuscript.

Reviewer Comment:

Please clarify whether Librarian can we be set up with a local, custom server in addition to query against an online database via the web app or command line tool.

Author Response: We provide Librarian as a web tool and with a Linux command line interface. The CLI can be run either in offline mode in which visualisations and predictions are computed locally (this is a new option) or remote mode in which Librarian queries the same web server as the web tool. The latter has the advantage of avoiding R dependencies and ensures that libraries are compared to the latest reference map. The different modes are explained in the section Operation of the manuscript, and in the section Usage in the online tool documentation.

Reviewer Comment:

The tabular data in figure 2A shows library types with fewer than 25 samples despite these being classified as under-represented libraries and excluded from the analysis in the text.

Author Response: This is correct. The search results were filtered to exclude library types with fewer than 25 samples. The table shows the number of samples for which we could retrieve base compositions. We were not able to retrieve sequence compositions for all search hits. Also note that RNA-seq search results were capped at 500 but only 452 FastQ files made it through the analysis. Most samples dropped out at the download stage which is a common thing to happen. Since this was done in high throughput we did not attempt to troubleshoot the process or download manually.

Reviewer Comment: Overall, I believe that the premise of Librarian is a very good idea and thank the authors for their efforts in releasing the program as a publicly available tool. I look forward to reading their responses, and future iterations of the software addressing current limitations.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

32 Views

14 Oct 2022 | for Version 1

Konstantin Okonechnikov, German Cancer Research Center, Heidelberg, Germany

32 Views Cite this report Responses(1)

Approved With Reservations

In general, the manuscript clearly describes the technique, however, only one possible limitation of the method is stated (effect of a cut in RNA-seq fragments leading to similarity to ATAC-seq in Discussion). More variance factors could be inspected to avoid misleading conclusions about the analysis results. For example, the tissue materials can be obtained from frozen tissue (FFPE), is there any impact of this preparation procedure? In my inspection, the standard RNA-seq datasets were distinguished quite well, but FFPE RNA-seq demonstrated the closest similarity to MBPD and MeDIP-seq. The scRNA-seq protocols are also included, however, they vary since they could be either full gene body covering or only 5’/3’ segment of a gene. Could this have an impact on read nucleotide variance?
The reads selection is performed with 100K subsampling - how was this amount selected? What is the effect of the total number of reads? Is it sufficient to provide only a subset of them? In this case, what is the suggested limit?
Also, 50 bp read segment is used as the reference, but how was this selection made? Currently, the main standard for sequencing is 100-150 bp. Would it be more beneficial to use a larger segment of the read for reference generation? Or do quality issues in longer reads have a negative impact?
How strong is the species effect? Are there variances observed between mice and human materials in full UMAP, e.g. clusters formation? Does it make sense to create own reference for such a procedure, especially when working on other species, e.g. Drosophila?

Further additional comments could help to improve the manuscript for easier reading:

In Figure 1, the nucleotide type color legend is missing, also segments are not cited in the text directly by suffix (a,b,c,d). Figure 1a demonstrates ChIA-PET, but not clear why it is included since it's not stated in the manuscript text.
Figure 2a: Are the amounts of mice and humans mixed? What is the variance?
Figure 2c: Several enrichment UMAP locations for certain data types are far from each other, e.g. RNA-seq. How to interpret this? Could it be certain subclusters, further splitting the dataset types?
When downloading example datasets (sample FASTQ files), the archive cannot be opened. Also, there is no documentation available regarding input format preparation, e.g fastq are not allowed to be gzipped, it’s not clear without testing.
Github documentation on the establishment/launch lacks some details. Would be useful to extend it especially to state what are the system environment requirements before installation.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Partly
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

No
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Bioinformatics data analysis in pediatric neurooncology

Respond to this report

Responses (1)

Author Response

23 May 2024

Christel Krueger, Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK

In general, the manuscript clearly describes the technique, however, only one possible limitation of the method is stated (effect of a cut in RNA-seq fragments leading to similarity to ATAC-seq in Discussion). More variance factors could be inspected to avoid misleading conclusions about the analysis results. For example, the tissue materials can be obtained from frozen tissue (FFPE), is there any impact of this preparation procedure? In my inspection, the standard RNA-seq datasets were distinguished quite well, but FFPE RNA-seq demonstrated the closest similarity to MBPD and MeDIP-seq. The scRNA-seq protocols are also included, however, they vary since they could be either full gene body covering or only 5’/3’ segment of a gene. Could this have an impact on read nucleotide variance?

Systematic analysis of all factors influencing the base compositions is hampered by the limited amount of metadata that is provided by GEO, e. g. sample preparation methods or library kits are not included. One idea that we are pursuing for future developments in Librarian is to mine the entire GEO entry (rather than just the metadata) for more detailed sample information.

Prompted by your observation we manually found and downloaded five FFPE RNA-seq samples from four different studies (SRR11996379, SRR11996380, SRR21404015, SRR21773166, SRR23245053) and ran these through Librarian with the following results:

Please see image 1 here: https://f1000research.s3.amazonaws.com/linked/649544.Image_1.pdf

Please see image 2 here: https://f1000research.s3.amazonaws.com/linked/649545.Image_2.pdf

We cannot confirm that there seems to be a general issue classifying FFPE samples but we note that for some regions of the reference map RNA-seq shows substantial overlap with other library types which may result in misprediction of the library type.

The situation seems to be similar for 3’ RNAseq libraries. We downloaded four 10X libraries from two different studies (SRR18916640, SRR18916641, SRR23254936, SRR23254938). The Librarian results can be found below. Two samples are classified correctly, the other two are from an area in the reference map which is occupied by several library types and therefore have some wobble in the classification.

Please see image 3 here: https://f1000research.s3.amazonaws.com/linked/649546.Image_3.pdf

Please see image 4 here: https://f1000research.s3.amazonaws.com/linked/649547.Image_4.pdf

We have included a section on “My sample doesn't come up as the library type that I expect. How worried do I need to be?” in the FAQs.

The reads selection is performed with 100K subsampling - how was this amount selected? What is the effect of the total number of reads? Is it sufficient to provide only a subset of them? In this case, what is the suggested limit?

The idea behind subsampling is to find a good balance between faithful representation of the characteristics of a FastQ file and file size / processing speed. Subsampling is employed by several Babraham Bioinformatics QC tools including FastQScreen and some modules of FastQC. For Librarian, the subsample needs to be big enough for the base compositions to have stabilised and not be subject to random variation which is indeed the case for 100 K. Selecting the reads randomly rather than from the top of the file ensures that files in which reads are supplied in a sorted manner can also be dealt with. 100 K reads is the lower limit that Librarian accepts in a file. See below for an example in which 100 K, 200 K, 500 K and 13.5 million reads of an RNA-seq library. The samples are located in exactly the same location on the reference map.

Please see image 5 here: https://f1000research.s3.amazonaws.com/linked/649548.Image_5.pdf

Please see image 6 here: https://f1000research.s3.amazonaws.com/linked/649549.Image_6.pdf

Also, 50 bp read segment is used as the reference, but how was this selection made? Currently, the main standard for sequencing is 100-150 bp. Would it be more beneficial to use a larger segment of the read for reference generation? Or do quality issues in longer reads have a negative impact?

Our decision to use the first 50 bp of a read was influenced by two considerations: Firstly, many types show characteristic biases at the beginning of the read while becoming more similar to the general GC content further along the read (compare Figure 1). Thus, using information from longer reads actually reduces the differences between library types. Secondly, while it is true that currently read lengths tend to be longer, there are still many experiments that use 50 bp sequencing read length. These could not be analysed if the reference map had been created using the information from 100 bp. We therefore chose 50 bp to balance accuracy with broad usability.

How strong is the species effect? Are there variances observed between mice and human materials in full UMAP, e.g. clusters formation? Does it make sense to create own reference for such a procedure, especially when working on other species, e.g. Drosophila?

The magnitude of the species effect will depend on the difference of overall GC content and genome biology. We have included the reference map coloured by species as Supplementary Figure 2. As expected from the similar GC content and similar biology, libraries from mouse and human samples largely show the same base composition. We decided to focus on those two species because of the abundance of available data and prevalence in current studies.
Here is an example of RNA-seq libraries from species with substantially different GC contents (Apis mellifera ~35 % GC, Staphylococcus aureus ~ 30 % GC and Streptomyces coelicolor ~70 %).

Please see image 7 here: https://f1000research.s3.amazonaws.com/linked/649550.Image_7.pdf

Please see image 8 here: https://f1000research.s3.amazonaws.com/linked/649551.Image_8.pdf

Please see image 9 here: https://f1000research.s3.amazonaws.com/linked/649552.Image_9.pdf

We therefore do not recommend using Librarian for species other than mouse or human although it is very likely that it would perform well on other mammalian species with similar GC content to human and mouse. We have now indicated this in the section on Operation and in the new FAQ section of the documentation. Please also see response to Reviewer 3. Species specific and broader mammalian reference maps are planned for future Librarian releases.

Further additional comments could help to improve the manuscript for easier reading:

In Figure 1, the nucleotide type color legend is missing, also segments are not cited in the text directly by suffix (a,b,c,d). Figure 1a demonstrates ChIA-PET, but not clear why it is included since it's not stated in the manuscript text.

The text has been changed to briefly introduce the characteristic profiles observed with all four library types shown in the figures, and individual subpanels are now referenced. Details regarding the preparation protocols and how these produce the observed profiles have been added to the figure legend. The figure has been changed to reflect the order of library types in the text and the colour legend has been added.

Figure 2a: Are the amounts of mice and humans mixed? What is the variance?

We now include a file called ‘Supplementary Information’ with the publication which is deposited on Zenodo (along with other data and code). This contains a table showing how many data sets were from mouse and human per library type. The inputs were broadly balanced with the exception of ChIA-PET.

Please see image 10 here: https://f1000research.s3.amazonaws.com/linked/649553.Image_10.pdf

Figure 2c: Several enrichment UMAP locations for certain data types are far from each other, e.g. RNA-seq. How to interpret this? Could it be certain subclusters, further splitting the dataset types?

Apart from technical variation an underlying reason can be grouping of library preparation methods that produce distinct composition profiles. For RNA-seq there are a number of popular commercial kits available which will produce their own biases (e. g. stranded vs non-stranded, random vs poly-A primed). The most striking example of grouping by library preparation method is BS-seq where several commonly used library preparation methods produce different compositions. Two easily identifiable groups on the reference map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see image 11 here: https://f1000research.s3.amazonaws.com/linked/649554.Image_11.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

When downloading example datasets (sample FASTQ files), the archive cannot be opened. Also, there is no documentation available regarding input format preparation, e.g fastq are not allowed to be gzipped, it’s not clear without testing.

The example file archive has been fixed and is now working.

Librarian accepts both fastq and gzipped files. However, we found that upload of gzipped files is an issue outside of Librarian with some Mac/browser combinations. The workaround is to drag & drop files onto the upload button. We have added the sentence “Librarian accepts fastq and gzipped files. If gzipped files are greyed out in the pop up, try dragging them onto the upload button.” to the web app and have also included this workaround in the FAQs.

Github documentation on the establishment/launch lacks some details. Would be useful to extend it especially to state what are the system environment requirements before installation.

We have now included more comprehensive documentation including detailed installation instructions and FAQs at https://desmondwillowbrook.github.io/Librarian/. We point to these from the manuscript (section Operation) and the github repository.

View more View less

Competing Interests

No competing interests were disclosed.

Back to all reports

Reviewer Report

24 Views

06 Oct 2022 | for Version 1

Andrew Keniry, Molecular Medicine Division, Walter and Eliza Hall Institute of Medical Research, Parkville, VIC, Australia

24 Views Cite this report Responses(1)

Approved

The authors show that Librarian identifies libraries prepared by different techniques, but is it able to identify a failed sample within a specific technique? Can the method be tested on failed samples such as ChIP-seq without enrichment, BS-seq with a low conversion efficiency, or samples with high duplication?
BS-seq seems to segregate into multiple clusters. Is there an easily identifiable reason for this – perhaps enrichment techniques or developmental stage?
I’m not sure of the logistics of this, but Librarian may be more widely used if it was available as an option within the already widely used fastqc.
An example of the Librarian output would be beneficial.
The terms ‘reference map’ and ‘compositions map’ seem to be used interchangeably. For simplicity, one term should be used throughout.
Fig 1A shows the base composition of ChIA-pet data. As this is a less well known technique, it would be beneficial to have an explanation of the base composition results.
Fig 1 is missing a legend explaining which base each colour represents.
I’m not certain what Fig 2E is showing. Could the authors provide more explanation in the legend? There is also no reference to this figure in the text.

Is the rationale for developing the new software tool clearly explained?

Yes
Is the description of the software tool technically sound?

Yes
Are sufficient details of the code, methods and analysis (if applicable) provided to allow replication of the software development and its use by others?

Yes
Is sufficient information provided to allow interpretation of the expected output datasets and any results generated using the tool?

Yes
Are the conclusions about the tool and its performance adequately supported by the findings presented in the article?

Yes

Competing Interests

No competing interests were disclosed.

Reviewer Expertise

Epigenetics, development, cell biology

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard.

Respond to this report

Responses (1)

Author Response

23 May 2024

Christel Krueger, Bioinformatics, Babraham Institute, Cambridge, CB22 3AT, UK

The authors show that Librarian identifies libraries prepared by different techniques, but is it able to identify a failed sample within a specific technique? Can the method be tested on failed samples such as ChIP-seq without enrichment, BS-seq with a low conversion efficiency, or samples with high duplication?

Compositional differences between different library types are relatively large in comparison to variations between samples of one specific technique. For example, in ChIP-seq experiments typically only a minority of reads fall within enriched regions. To illustrate this point, the ENCODE consortium only requires a minimum of 1% reads in peaks for their datasets (Genome Res. 2012 Sep; 22(9): 1813–1831). Therefore, for both well enriched and poorly enriched datasets the majority of reads are background, and the overall base composition is not different enough to separate in base composition analysis. Likewise, technical duplication does not change the base composition, and only if the library consisted of very few original molecules would we expect an impact. Since BS-seq libraries are striking in their low C content, we explored if Librarian could spot low conversion rates: The tested library has very high non-CG methylation (around 20 %), but still the test sample is predicted as BS-seq - most Cs are still converted. In summary, technical noise adds markedly lower variation than library type. This makes Librarian robust against variation in sample preparation quality.

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649537.download_1.pdf

BS-seq seems to segregate into multiple clusters. Is there an easily identifiable reason for this – perhaps enrichment techniques or developmental stage?

The reason behind this is that there are several commonly used library preparation methods which produce data with different base compositions, for example directional vs non-directional libraries or reduced genomic representation. Two easily identifiable groups on the Compositions map are whole genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS).

Please see Image here: https://f1000research.s3.amazonaws.com/linked/649538.Download_2.pdf

Predating the development of Librarian we explored the option of predicting various bisulfite library types from their base compositions. A proof of concept can be found here: https://github.com/ChristelKrueger/Charades
The subgrouping of bisulfite libraries highlights a more general issue with the metadata available on GEO. All the different variations of these libraries carry the flag ‘Bisulfite-Seq’ and further distinction is only found in the method description - if at all. This can easily create problems when re-processing published data as processing pipelines are often specific for a particular sample type. Grouping of disparate sample types is widespread amongst all categories especially as by now a lot of different library preparation methods are on the market.
We have now included an explanation of subgroupings and underlying reasons in the discussion.

I’m not sure of the logistics of this, but Librarian may be more widely used if it was available as an option within the already widely used fastqc.

We discussed this as an option but decided against it. FastQC is deliberately kept lean and without dependencies which Librarian would bring. For more convenient incorporation into standard QC we are now writing out the prediction percentages as a text file. This can be picked up by the widely used tool MultiQC and is then displayed as a heatmap in the MultiQC report. We would like to thank Phil Ewels for incorporating this feature.

An example of the Librarian output would be beneficial.

Librarian will produce the plots shown in Figure 2B, C, E with circles indicating the map location of test samples. We deliberately did not include output examples in the main body of the manuscript as they are almost identical to the ones shown in Figure 2. Instead, they can be found in the accompanying use case data at https://doi.org/10.5281/zenodo.7060217.

The terms ‘reference map’ and ‘compositions map’ seem to be used interchangeably. For simplicity, one term should be used throughout.

This has been changed to ‘reference map’ throughout.

Fig 1A shows the base composition of ChIA-pet data. As this is a less well known technique, it would be beneficial to have an explanation of the base composition results.

Please see 7.

Fig 1 is missing a legend explaining which base each colour represents.

I’m not certain what Fig 2E is showing. Could the authors provide more explanation in the legend? There is also no reference to this figure in the text.

An in-text reference has been added in the section about Implementation.

View more View less

Competing Interests

No competing interests were disclosed.

Alongside their report, reviewers assign a status to the article:

Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested

Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.

Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions

[1] 1. Sequencing|Key methods and uses. http

[2] 2. Wang L, Wang S, Li W: RSeQC: quality control of RNA-seq experiments. Bioinformatics. 2012; 28: 2184–2185. PubMed Abstract | Publisher Full Text

[3] 3. Okonechnikov K, Conesa A, García-Alcalde F: Qualimap 2: advanced multi-sample quality control for high-throughput sequencing data. Bioinformatics. 2016; 32: 292–294. PubMed Abstract | Publisher Full Text

[4] 4. Babraham Bioinformatics - FastQC A Quality Control tool for High Throughput Sequence Data. http

[5] 5. Hadfield J, Eldridge MD: Multi-genome alignment for quality control and contamination screening of next-generation sequencing data. Front. Genet. 2014; 5: 31.

[6] 6. Wingett SW, Andrews S: FastQ Screen: A tool for multi-genome mapping and quality control. 2018. Publisher Full Text Reference Source

[7] 7. Wood DE, Salzberg SL: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15: R46. PubMed Abstract | Publisher Full Text

[8] 8. Li X-Q, Du D: Variation, Evolution, and Correlation Analysis of C+G Content and Genome or Chromosome Size in Different Kingdoms and Phyla. PLoS One. 2014; 9: e88339. PubMed Abstract | Publisher Full Text

[9] 9. Bernstein AI, Jin P:Chapter 3 - High-Throughput Sequencing-Based Mapping of Cytosine Modifications. Epigenetic Technological Applications. Zheng YG, editor.Academic Press;2015; 39–53. Publisher Full Text

[10] 10. Buenrostro JD, Giresi PG, Zaba LC, et al.: Transposition of native chromatin for multimodal regulatory analysis and personal epigenomics. Nat. Methods. 2013; 10: 1213–1218. PubMed Abstract | Publisher Full Text

[11] 11. Ewels P, Magnusson M, Lundin S, et al.: MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016; 32: 3047–3048. PubMed Abstract | Publisher Full Text

[12] 12. Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 2002; 30: 207–210. PubMed Abstract | Publisher Full Text | Free Full Text

[13] 13. Vashishtha K, Gaud C, Andrews S, et al.: Librarian manuscript data v1.2022. Publisher Full Text

[14] 14. McInnes L, Healy J, Melville J: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv180203426 Cs Stat. 2020.

[15] 15. Vashishtha K, Gaud C, Andrews S, et al.: Kartavya Vashishtha/Librarian-1.0.4. Zenodo. 2022. Publisher Full Text

[16] 16. Adey A, et al.: Rapid, low-input, low-bias construction of shotgun fragment libraries by high-density in vitro transposition. Genome Biol. 2010; 11: R119. PubMed Abstract | Publisher Full Text

Librarian: A quality control tool to analyse sequencing library compositions

Abstract

Keywords

Introduction

Figure 1. Per position base content for different library types.

Methods

Figure 2. Library types can be distinguished by their base compositions.

Implementation

Operation

Use case

Discussion

Data availability

Underlying data

Software availability

Author contributions

Acknowledgements

References

Comments on this article Comments (0)

Open Peer Review

Comments on this article Comments (0)

Open Peer Review

Reviewer Status

Reviewer Reports

Comments on this article

Browse by related subjects

Competing Interests Policy

Stay Updated