Keywords
computer software
This article is included in the RPackage gateway.
This article is included in the Bioinformatics gateway.
This article is included in the Bioconductor gateway.
computer software
The DNA microarray field is dominated by the three manufacturers: Affymetrix, Illumina and Agilent. While the basic premise behind their competing products is the same (i.e. the measurement of hybridisation between sample and immobilised probes on arrays via fluorescence), the formats in which these data are presented to end users are quite different, with each manufacturer electing to use their own proprietary format. The most ubiquitous of these is the CEL file, which has been accepted as a standard format for publishing the raw data generated on the Affymetrix platform. A search of the Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) database finds over 90% of submissions of Affymetrix data include one or more CEL files as supplementary material. The format itself is well documented by the manufacturer, who also provides an open-source software development kit (SDK). As a result, in addition to Affymetrix’s own software suite, a large number of CEL parsing tools exist, including a parser implemented based on the file format documentation: affyio1 and a parser based on the SDK: affxparser2.
The same is not true of the primary IDAT format from Illumina, with only 1.5% (49 out of 3208) of the submissions in GEO that use Illumina BeadArrays including IDAT files as supplementary material. Given that IDATs are the standard file type generated during BeadArray processing, it seems reasonable to assume that the relative dearth of IDAT files in the public domain is due to the lack of widespread support for the format. The development of alternative parsing tools has proven more challenging for IDATs for a number of reasons. The foremost amongst these is a lack of public documentation, leaving tool developers to determine the file structure themselves. A further hurdle has been the encryption of IDAT files generated from expression chips. These barriers initially left researchers reliant on the output from Illumina’s GenomeStudio software to convert the data into a more convenient format. Existing open source tools, particularly those that focus on gene expression analysis such as beadarray3, lumi4 and limma5, all require that the IDAT files have been processed using GenomeStudio to generate a plain-text ASCII file before any analysis can take place (Figure 1). The GenePattern6 software suite includes support for reading expression IDAT files, although it is limited to extracting only a subset of the array information. GenomeStudio output also omits various information that is available from the IDAT, such as control probe intensities (for SNP and methylation platforms), so-called out-of-band probes (methylation 450k)7 and meta information including software versions and scan date (all platforms).
Here we introduce the Bioconductor8 package illuminaio that can handle IDAT files from any Illumina BeadArray platform, providing a simple unified interface to various low-level data extraction routines.
The IDAT file format varies depending upon the array platform (Table 1). IDATs generated during the scanning of genotyping and methylation BeadArrays are binary files (one for each of the red and green channels). The bulk of each file is comprised of four fields: the ID of each bead-type on the array, the mean and standard deviation of their intensities, and the number of beads of each type. Additionally, metadata including the date the array was scanned, specific software versions used and the type of BeadChip are also included. Once the structure of the file is understood these binary values can be read directly.
Array type | File format | No. data fields |
---|---|---|
SNP genotyping | Binary | 4 |
Methylation | Binary | 4 |
Gene expression | Encrypted XML | 10 |
On the other hand, gene expression IDAT files are produced as encrypted XML files. Once decrypted the majority of the data are found as ten Base64 encoded strings. These ten fields include the ID, mean and standard deviation values as found in genotyping IDATs, as well as median and trimmed-mean intensity values, the mean and standard deviation of local background intensities, and the number of beads both before and after outliers have been excluded.
Each array type is also associated with a manifest file (with file extension BPM or BGX) that provides details of probe sequences, intended genomic targets and whether it is a control probe or not, information that is necessary to correctly interpret the data.
illuminaio is an R package9. The reading of IDAT files is achieved using the readIDAT function. This routine is able to determine the type of IDAT file that has been passed and calls the appropriate code to read the file and return the data as a R list object (Figure 1). This not only contains intensity data, but also the meta information such as scan date that are not routinely extracted and can be useful for detecting batch effects10.
Decryption of expression IDATs is performed using the open-source DES decryption routine available in Gnulib11. There is no official documentation of this file format, but illuminaio includes a document describing our findings in detail. Source code for the appropriate routines has been adapted and included in illuminaio, removing any requirement for specific external libraries to be installed on a user’s computer. Thus the package can be built and run on all three major operating systems (Linux, Windows and Mac).
The illuminaio package also supports the parsing of non-encrypted IDAT files compressed by gzip and the reading of manifest files describing the array design (readBGX and readBPM).
The summarised intensity values obtained by illuminaio are nearly identical to those reported using GenomeStudio. Small discrepancies related to rounding performed by GenomeStudio are observed. The package vignette contains a detailed comparison. The time taken to read an IDAT depends on the platform, with encrypted expression arrays taking around 1 second per file (for 50,000 probes), and methylation and SNP platforms between 1 to 6 seconds depending on the chip density (which can range between a few hundred thousand and several million probes).
The availability of an open-source IDAT reader through illuminaio that can read files from any of Illumina’s BeadArray technologies will promote greater use of the IDAT file as a primary data format in the analysis and sharing of results from BeadArray based profiling studies. The illuminaio package is intended for use by developers to efficiently extract the content of both IDAT and bead-manifest files, thereby expanding the possibilities for conducting reproducible research with these data.
One exception to the dearth of IDAT files noted in the introduction is the The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/). IDAT files from Illumina methylation and genotyping arrays are available in large numbers as Tier 1 data from the TCGA website (https://tcga-data.nci.nih.gov/tcga/). Of particular interest is the Illumina 450k methylation array, for which Triche et al.7 has shown improvements in background correction by using out-of-band probes, information that is only available through IDAT files and not the GenomeStudio output. For this work Triche et al. used illuminaio to access the out-of-band probes, which shows the advantage of having access to low-level data.
illuminaio is currently used in the minfi12, methylumi13 and crlmm14,15 packages for importing IDAT files from the Infinium methylation and genotyping platforms respectively, demonstrating its utility.
illuminaio is an R package available from the Bioconductor project (http://www.bioconductor.org) and from 10.5281/zenodo.7588.
KAB developed the first version of the IDAT reader for unencrypted IDAT files. This work was later improved by HB, MER and KDH. MLS developed the IDAT reader for encrypted files. All authors wrote and approved the manuscript.
This work was supported by the European Community’s Seventh Framework Programme under grant agreement No. 305626 (Project RADIANT) (MLS), NHMRC Project grant 1050661, Victorian State Government Operational Infrastructure Support and Australian Government NHMRC IRIISS (MER) and Grant Number R01 GM103552 from the National Institute of General Medical Sciences, National Institutes of Health (KDH).
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Competing Interests: No competing interests were disclosed.
Competing Interests: No competing interests were disclosed.
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 04 Dec 13 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)