ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Software Tool Article

HGNChelper: identification and correction of invalid gene symbols for human and mouse

[version 1; peer review: 2 approved, 1 approved with reservations]
PUBLISHED 21 Dec 2020
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the RPackage gateway.

Abstract

Gene symbols are recognizable identifiers for gene names but are unstable and error-prone due to aliasing, manual entry, and unintentional conversion by spreadsheets to date format. Official gene symbol resources such as HUGO Gene Nomenclature Committee (HGNC) for human genes and the Mouse Genome Informatics project (MGI) for mouse genes provide authoritative sources of valid, aliased, and outdated symbols, but lack a programmatic interface and correction of symbols converted by spreadsheets. We present HGNChelper, an R package that identifies known aliases and outdated gene symbols based on the HGNC human and MGI mouse gene symbol databases, in addition to common mislabeling introduced by spreadsheets, and provides corrections where possible. HGNChelper identified invalid gene symbols in the most recent Molecular Signatures Database (mSigDB 7.0) and in platform annotation files of the Gene Expression Omnibus, with prevalence ranging from ~3% in recent platforms to 30-40% in the earliest platforms from 2002-03. HGNChelper is installable from CRAN.

Keywords

gene symbols, molecular biology, HGNC, MGI

Introduction

Gene symbols are widely used in biomedical research because they provide descriptive and memorable nomenclature for communication. However, gene symbols are constantly updated through the discoveries and re-identification of genes, resulting in new names or aliases. For example, GCN5L2 (General Control of amino acid synthesis protein 5-Like 2) is a gene symbol that was later discovered to function as a histone acetyltransferase and therefore renamed as KAT2A (K(lysine) Acetyl Transferase 2A))1. In addition to the rapid and constant updates on valid gene symbols, commonly used spreadsheet software, such as Microsoft Excel, modify some gene symbols, converting them into dates or floating-points numbers2,3. For example, ‘DEC1’, a symbol for ‘Deletion in Esophageal Cancer 1’ gene, can be exported in date format, ‘1-DEC’. There have been attempts to rectify gene symbol issues, but they have largely been limited to Excel-modified gene symbols. Also the suggested solutions often reference static files with the corrections curated at the time of publication3 or comprise scripts for detecting the existence of Excel-modified gene symbols without correction2. In recognition of the importance of the spreadsheet modification issues, HGNC recently announced that all symbols that auto-convert to dates in Excel have been changed4. However, much literature and public data still contains outdated and incorrect gene symbols, motivating a convenient method of systematic detection and correction. To systematically identify historical aliases, correct for capitalization differences, and simultaneously correct spreadsheet-modified gene symbols, we built the HGNChelper R package. HGNChelper maps different aliases and spreadsheet-modified gene symbols to approved gene symbols maintained by The HUGO Gene Nomenclature Committee (HGNC) database5. HGNChelper also supports mouse gene symbol correction based on the Mouse Genome Informatics (MGI) database6.

Methods

Implementation

Source data. Human gene symbols are accessed from HGNC Database ftp site (ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/hgnc_complete_set.txt)7 and mouse gene symbols are acquired from MGI Database (http://www.informatics.jax.org/downloads/reports/MGI_EntrezGene.rpt)6. These URLs, and their access and processing, are handled by HGNChelper so the user does not interact directly with them.

Algorithm. Human gene symbol correction is processed in three steps. First, capitalization is fixed: all letters are converted to upper-case, except the open reading frame (orf) nomenclature, which is written in lower-case. Second, dates or floating-point numbers generated via Excel-modification are corrected using a custom index generated by importing all human gene symbols into Excel, exporting them in all available date formats, and collecting any gene symbols that are different from the originals. In the last and most commonly applied step, aliases are updated to approved gene symbols in the HGNC database. Mouse gene symbol correction follows the same three steps as in human gene symbol correction, except the capitalization step since mouse gene symbols begin with an uppercase character, followed by all lowercase.

User interface. The user interface of HGNChelper does not include any local input or output files; instead it uses R data structures as function arguments and output. Base R data export functions such as write.table can be used to write results to file in whichever format required. The input arguments to the main function, checkGeneSymbols, are:

  • 1. x: A character vector of gene symbols to check for modified or outdated values

  • 2. chromosome: An optional integer vector the same length as x, providing chromosome numbers for each gene

  • 3. unmapped.as.na: A logical value, if TRUE (default), unmapped symbols will appear as NA in the Suggested.Symbol output column. If FALSE, the original unmapped symbol will be kept.

  • 4. map: An optional user-updated or non-standard gene map. The default maps can be updated by running the interactive example provided in the help page to checkGeneSymbols.

  • 5. species: A required character vector of length 1, either "human" (default) or "mouse".

checkGeneSymbols returns an R data.frame with one row per input gene and three columns:

  • 1. The first column of the data frame shows the input gene symbols.

  • 2. The second column indicates whether the input symbols are valid

  • 3. The third column provides a corrected gene symbol where possible.

  • A message is printed indicating when the package’s built-in map was last updated. Because the gene symbol databases are updated as frequently as every day, we provide the getCurrentHumanMap and getCurrentMouseMap functions for updating the reference map without requiring an HGNChelper software update. These functions fetch the most up-to-date version of the map from HGNC and MGI, respectively, and users can provide the output of these functions through the map argument of checkGeneSymbols function. However, fetching a new map requires internet access and takes longer than using the package’s built-in index.

Operation

HGNChelper is an R package installable from CRAN on Linux, Windows, and OSX. It requires a base installation of R (> 3.5.0) and no other dependencies, and has minimal hardware requirements that should be met by any computer capable of installing the R dependency.

Results

To evaluate the performance of HGNChelper, we quantified the extent of invalid gene symbols present in platform annotation files in the Gene Expression Omnibus (GEO) database from 2002 to 2020. We downloaded 20,716 GEO platform annotation (GPL) files using GEOquery::getGEO8, of which 2,044 platforms were suspected to contain gene symbol information based on matching to valid symbols. There is a clear trend of increasing proportion of invalid gene symbols with age of platform submission (Figure 1), ranging from an average of ~3% for recent platforms and increasing with age to ~20% in 2010 and 30–40% in the earliest platforms from 2002–03. The overall proportion of valid gene symbols was 79%, increasing to 92% after HGNChelper correction. The 8% remaining, invalid gene symbols were mostly long non-coding RNA (lncRNA), pseudogenes, commercial product IDs such as probe ID, missing data, and gene symbols from non-human species, erroneously included together with human gene symbols. We also checked the validity of gene symbols in the Molecular Signatures Database (MSigDB 7.0)9. Out of 38,040 gene symbols used in MSigDB version 7.0, 850 were invalid, and this number reduces to 453 after HGNChelper correction, of which the majority were lncRNA and a few withdrawn symbols.

27a8c825-c75d-43ce-adcc-32316fc1af6e_figure1.gif

Figure 1. The fraction of valid gene symbols in GPL files grouped by year of data submission.

Each dot represents a unique GPL. Older entries show a smaller fraction of valid gene symbols than more recent entries (Before, white box), but many of which are successfully corrected by HGNChelper (After, grey box).

Discussion

Gene symbols are error-prone and unstable, but remain in common use for their memorability and interpretability. Our analysis of public databases containing gene symbols emphasizes the need for gene symbol correction particularly when using symbols from older datasets and reported results. Such correction should be routinely done when gene symbols are part of high-throughput analysis, such as re-analysis of targeted gene panels for precision medicine, which tend to be annotated with gene symbols (e.g. 10), in Gene Set Enrichment Analysis using the gene symbol versions of popular databases such as MSigDB9 or GeneSigDB11, or when performing systematic review or meta-analysis of published multi-gene signatures (e.g. 12). HGNChelper implements a programmatic and straightforward approach to the routine identification and correction of invalid gene symbols.

Software availability

Package available from CRAN: https://cran.r-project.org/package=HGNChelper

Source code available from: https://github.com/waldronlab/HGNChelper/

Archived source code as at time of publication: https://doi.org/10.5281/zenodo.430998513

License: GPL (≥ 2.0)

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 21 Dec 2020
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Oh S, Abdelnabi J, Al-Dulaimi R et al. HGNChelper: identification and correction of invalid gene symbols for human and mouse [version 1; peer review: 2 approved, 1 approved with reservations] F1000Research 2020, 9:1493 (https://doi.org/10.12688/f1000research.28033.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 21 Dec 2020
Views
14
Cite
Reviewer Report 02 Feb 2021
Marcin Cieślik, Michigan Center for Translational Pathology, University of Michigan Medical School, Ann Arbor, USA 
Approved
VIEWS 14
HGNChelper is a particularly valuable tool in the toolbox of a bioinformatics practicioneer. It addresses a real problem, which while superficially trivial, actually affects the quality of analyses. 

I use HGNChelper ALL THE TIME especially if a ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Cieślik M. Reviewer Report For: HGNChelper: identification and correction of invalid gene symbols for human and mouse [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2020, 9:1493 (https://doi.org/10.5256/f1000research.31006.r76417)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 09 Jun 2022
    Levi Waldron, Institute for Implementation Science and Population Health, New York, 10027, USA
    09 Jun 2022
    Author Response
    Thank you for reviewing our manuscript and for your encouraging comments. 

    Comment 1: The main issue raised is when HGNChelper fails to map symbols, it is important for users to ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 09 Jun 2022
    Levi Waldron, Institute for Implementation Science and Population Health, New York, 10027, USA
    09 Jun 2022
    Author Response
    Thank you for reviewing our manuscript and for your encouraging comments. 

    Comment 1: The main issue raised is when HGNChelper fails to map symbols, it is important for users to ... Continue reading
Views
29
Cite
Reviewer Report 20 Jan 2021
Susan Tweedie, HUGO Gene Nomenclature Committee (HGNC), European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK 
Approved with Reservations
VIEWS 29
The paper describes an R package for that checks whether human and mouse symbols match an HGNC or MGI approved symbol and if not suggests a replacement by, correction of capitalization, correction of Excel date and floating-point transformations and matching ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Tweedie S. Reviewer Report For: HGNChelper: identification and correction of invalid gene symbols for human and mouse [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2020, 9:1493 (https://doi.org/10.5256/f1000research.31006.r76418)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 09 Jun 2022
    Levi Waldron, Institute for Implementation Science and Population Health, New York, 10027, USA
    09 Jun 2022
    Author Response
    Thank you for reviewing our manuscript and for your constructive comments. Below are our responses to the individual comments.

    Comment 1: It may be worth adding that conversion of ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 09 Jun 2022
    Levi Waldron, Institute for Implementation Science and Population Health, New York, 10027, USA
    09 Jun 2022
    Author Response
    Thank you for reviewing our manuscript and for your constructive comments. Below are our responses to the individual comments.

    Comment 1: It may be worth adding that conversion of ... Continue reading
Views
28
Cite
Reviewer Report 18 Jan 2021
Mikhail G. Dozmorov, Biostatistics Department, Massey Cancer Center, Richmond, VA, USA 
Approved
VIEWS 28
The manuscript "HGNChelper: identification and correction of invalid gene symbols for human and mouse" by Oh S. et al. describes the HGNChelper R package that corrects the common problem of misformatted gene symbols and aliases. The package works with both ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Dozmorov MG. Reviewer Report For: HGNChelper: identification and correction of invalid gene symbols for human and mouse [version 1; peer review: 2 approved, 1 approved with reservations]. F1000Research 2020, 9:1493 (https://doi.org/10.5256/f1000research.31006.r76531)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
  • Author Response 09 Jun 2022
    Levi Waldron, Institute for Implementation Science and Population Health, New York, 10027, USA
    09 Jun 2022
    Author Response
    Thank you for reviewing our manuscript and for your constructive comments. Below are our responses to the individual comments.

    Comment 1: The limma R package has the alias2Symbol function ... Continue reading
COMMENTS ON THIS REPORT
  • Author Response 09 Jun 2022
    Levi Waldron, Institute for Implementation Science and Population Health, New York, 10027, USA
    09 Jun 2022
    Author Response
    Thank you for reviewing our manuscript and for your constructive comments. Below are our responses to the individual comments.

    Comment 1: The limma R package has the alias2Symbol function ... Continue reading

Comments on this article Comments (0)

Version 2
VERSION 2 PUBLISHED 21 Dec 2020
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.