Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Minimal Absent Words in Prokaryotic and Eukaryotic Genomes

  • Sara P. Garcia ,

    spgarcia@ua.pt

    Affiliation Signal Processing Laboratory, IEETA, University of Aveiro, Aveiro, Portugal

  • Armando J. Pinho,

    Affiliations Signal Processing Laboratory, IEETA, University of Aveiro, Aveiro, Portugal, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal

  • João M. O. S. Rodrigues,

    Affiliations Signal Processing Laboratory, IEETA, University of Aveiro, Aveiro, Portugal, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal

  • Carlos A. C. Bastos,

    Affiliations Signal Processing Laboratory, IEETA, University of Aveiro, Aveiro, Portugal, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal

  • Paulo J. S. G. Ferreira

    Affiliations Signal Processing Laboratory, IEETA, University of Aveiro, Aveiro, Portugal, Department of Electronics, Telecommunications and Informatics, University of Aveiro, Aveiro, Portugal

Abstract

Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we explore different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We investigate if the mutational biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely in minimal absent words, as well as to other organisms. We find that the compositional biases observed for the shortest absent words in vertebrates are not uniform throughout different sets of minimal absent words. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry from the similarity in dinucleotide relative abundances of different sets of minimal absent words, and find that this inheritance may be exclusive to vertebrates.

Introduction

The set of absent words of a sequence is the set of all words that cannot be found in the sequence. This set is too large and of limited interest for practical purposes. Hence, we have introduced the concept of minimal absent words that have the following property: the new word formed by removing the left- or rightmost character from a minimal absent word is no longer an absent word [1].

Minimal absent words are defined to have at least 3 characters and have been computed in genomes of organisms from all domains of life. The core of a minimal absent word, i.e. the word that remains if its left- and rightmost characters are removed, is a maximal repeat. A maximal repeat is a perfect repeat (without gaps or misspellings) that occurs at least twice and which cannot be further extended to either its left- or right-end side without loss of similarity.

Minimal absent words are a generalization of the short absent words introduced by Hampikian and Andersen under the term nullomers [2], and by Herold et al. as unwords [3]. For sequences with all letters and pairs of letters of the alphabet, the set of nullomers/unwords will correspond to the shortest minimal absent words.

For illustration, consider the sequence GCTAACCGATG and its reverse complement. The set of minimal absent words of this sequence concatenated with its reverse complement (with artificial words across the boundary ignored) is {AAA, AAG, AAT, ACA, ACG, ACT, AGA, AGG, AGT, ATA, ATT, CAA, CAC, CAG, CCA, CCC, CCT, CGC, CGT, CTC, CTG, CTT, GAA, GAC, GAG, GCA, GCC, GCG, GGA, GGC, GGG, GTA, GTC, GTG, TAC, TAT, TCA, TCC, TCT, TGA, TGC, TGG, TGT, TTC, TTG, TTT, AGCT, CATG, CCGG, CTAG, GATC, TCGA, TTAA}, whereas its set of nullomers/unwords solely includes the trinucleotides of the above set. Moreover, the set of maximal repeats in this sequence concatenated with its reverse complement is {A, C, G, T, AT, CG, GC, TA}.

The minimality constraint imposed on minimal absent words guarantees that amongst all absent words, minimal absent ones are the closest to the boundary of the set of all occurring words.

An important question concerning absent words is their biological relevance. Speculation of negative selection acting upon nullomers led Hampikian and Andersen to envisage a range of potential applications [2]. Herold et al. suggested that unwords may not have a functional meaning but might be useful for large scale mutagenesis experiments [3]. We previously hypothesized that minimal absent words might be used as biomarkers at the individual level, or for the comparison of genetic traits at the species or population level [1]. However, the most comprehensive analysis so far, to the best of our knowledge, on the biological implications of absent words is authored by Acquisti et al., who carefully analyzed the set of nullomers/unwords of length 11 base pairs (bp) in the human genome, and questioned the evidence for assuming those words to be under negative selective pressure [4]. Instead, they proposed that the mutational characteristics of the genome, namely the hypermutability (hence deficit) of CpGs in vertebrates, provides a reasonable explanation for the multiple CpGs observed in all of the shortest absent words in the human and other mammalian genomes [4]. Moreover, Acquisti et al. hypothesized that regular point mutations, in addition to hypermutable CpGs, are an important justification for the presence of nullomers/unwords [4]. They compared the list of nullomers/unwords in the human and other mammalian genomes and found that the human genome shares more nullomers/unwords with its closest evolutionary relative, the chimpanzee, than with more distantly related mammals, hence suggesting that the set of human nullomers/unwords contains nullomers/unwords inherited from the common ancestor of human and chimpanzee, in addition to those that have arisen within the human lineage [4].

Here, we complement their analysis by investigating if the compositional biases that may explain the deficit of the shortest absent words in vertebrates are also pervasive in other absent words, namely, in minimal absent words. Moreover, we compare sets of minimal absent words, and respective compositional biases, in organisms other than vertebrates. We further investigate the hypothesis of the inheritance of minimal absent words through common ancestry, in addition to lineage specific inheritance, from the similarity in dinucleotide compositional biases of different sets of minimal absent words. For estimating the compositional biases, we use the methodology of dinucleotide relative abundances pioneered by Brendel et al. [5], Pietrokovski et al. [6], and Karlin and collaborators (e.g. [7][9]).

Methods

Genomic data

We considered the genomes of one archaeota, thirteen bacteria and eight case-study eukaryotes (Table 1) as available from the NCBI database [10], the Saccharomyces Genome Database [11], the database in The Arabidopsis Information Resource [12], the WormBase database [13], and the FlyBase database [14]. For convenience, the scientific names in figures and tables are abbreviated to the first letter of the genus followed by the first letter of the epithet. Two exceptions include two additional letters as prefixes, namely for the methicillin-resistant Staphylococcus aureus (MRSa) and the methicillin-susceptible Staphylococcus aureus (MSSa). The reference assemblies of the reported NCBI builds are used for the chicken, mouse, chimpanzee and human genomes.

Finding minimal absent words

Consider a finite alphabet with cardinality . Let denote the length of a string over and its th character, with . A substring of starting at position and ending at position is denoted by , with . If , then . Moreover, () denotes the concatenation of character () to the left (right) end side of , with .

Let denote a substring of and denote the set of positions of in , i.e. and . A maximal repeated pair in is a pair of identical substrings such that the character to the immediate left (right) of one of the substrings is different from the character to the immediate left (right) of the other substring, i.e. a triple , such that , and , with [15]. A substring is a maximal repeat in if it occurs in a maximal pair, i.e. if there is at least a maximal repeated pair in of the form (), with [15].

A string is a minimal absent word of if and only if is not a substring of , but and are substrings of . For convenience, we consider .

Theorem 1.

(proof in [1]) If is a minimal absent word of , then is a maximal repeat in .

Theorem 2.

(proof in [1]) A string is a minimal absent word of if and only if but , where , and .

If is a minimal absent word of , then occurs at least twice in and these occurrences may partially overlap. It is easily verifiable that, as in DNA sequences, the maximum number of minimal absent words associated to a particular is twelve, and it occurs when , with and . This property implies that frequent repeats have a high probability of not generating minimal absent words, because for those frequent repeats is often equal to .

Minimal absent words are found by reading the information in a suffix array. A suffix array is an array of integers , with and , each pointing to the beginning of a suffix of , such that lexicographically precedes . Two auxiliary arrays are used, namely, the longest common prefix (lcp) array, and the left character (bwt) array, the latter corresponding to the Burrows and Wheeler transform [16]. The lcp-array contains the lengths of the longest common prefix between consecutive ordered suffixes, i.e. indicates the length of the longest common prefix between and , with . By convention, . The bwt-array is a permutation of such that if , and, by convention, if , where is a character that does not belong to the alphabet . Conceptually, the bwt-array does not provide any additional information, as the left character of any character of can be determined by direct access to . However, the bwt-array allows for sequential memory access, hence improving the performance due to enhanced use of cache [17].

The first part of the algorithm generates all lcp-intervals using the lcp-array and a stack, and is adapted from [18] and [17]. An lcp-interval of lcp-depth is the interval , with , if and only if ; ; , for at least one in ; and . Each lcp-interval delimits a subset of suffixes that start with a common -letter prefix , . The second part of the algorithm determines if an lcp-interval is left-diverse, i.e. if at least two characters of differ, for . In that case, is a maximal repeat, as all substrings are identical, . From these maximal repeats, all minimal absent words associated to each lcp-interval are computed and then output. See [1] for details on the algorithm.

Sets of minimal absent words are found by concatenating the genome with its reverse complement using a delimiting character that does not belong to the alphabet, to avoid the formation of artificial words across boundaries. The order by which the chromosomes are inserted is irrelevant. We solely consider unambiguous nucleotides (A, C, G or T) and have ignored all sequence ambiguities by replacing every subsequence of ambiguously sequenced nucleotides (e.g. K, M, N, R, S, W and Y) with a delimiting character that does not belong to the alphabet.

Compositional biases from dinucleotide relative abundances

Let denote the relative frequency of nucleotide in a given genomic sequence, and the relative frequency of dinucleotide . A standard assessment of nucleotide bias is through the odds-ratio(1)with values sufficiently larger (smaller) than one implying that the dinucleotide is considered of high (low) relative abundance compared to a random association of its component mononucleotides [19].

For double-stranded DNA molecules, (1) must be modified in order to account for the inherent complementary anti-parallel structure. Let define the string resulting from combining the DNA sequence with its reverse complement . In , the analogous strand symmetric functionals for the base frequencies are nowwith , where is the number of adenine (A) nucleotides in a sequence of length , with equivalent formulas for cytosine (C), guanine (G), and thymine (T). The analogous strand symmetric functionals for the dinucleotide odds-ratio are now(2)an example beingwithand , where is the number of AC dinucleotides in a sequence of length , with equivalent formulas for all other dinucleotides. The total number of dinucleotides in a set with cardinality of minimal absent words of word length is . The vector of values has remarkably low variance throughout the genome of a given organism, and can discriminate sequences from distinct organisms [20]. Dinucleotide relative abundances are estimated considering overlapping, i.e. word ACTAC may be segmented into four dinucleotides, namely two dinucleotides AC, one dinucleotide CT, and one dinucleotide TA.

Results and Discussion

The total number of minimal absent words increases with genome size

Figure 1 displays the distributions of the number of minimal absent words in the genomes of selected organisms for increasing word length. We sampled the distributions at word length 11 (the resulting set of minimal absent words being designated ), which roughly coincides with the beginning of the curves and allows for the comparison with previous studies [4]; at word length 14 (the resulting set of minimal absent words being designated ), as it is close to the peak of the distribution for most prokaryotic genomes surveyed; at word length 17 (the resulting set of minimal absent words being designated ), as it is close to the peak of the distribution for most genomes of higher eukaryotes surveyed; and at word length 24 (the resulting set of minimal absent words being designated ) for sampling the distributions at the beginning of the right-end tails. These right-end tails are the main differences to profiles obtained for artificially generated DNA strings with a random distribution of the four unambiguous nucleotides (A, C, G and T) [1].

thumbnail
Figure 1. Number of minimal absent words in genomic sequences:.

Distributions of the number of minimal absent words (MAWs) in the genomes of selected organisms for word length up to 50.

https://doi.org/10.1371/journal.pone.0016065.g001

Compositional biases are not uniform throughout different sets of minimal absent words

Table 2 reports the GC content (denoted by G+C) of the genome and respective sets of minimal absent words in each organism, with and . As a consequence of ignoring sequence ambiguities, the final (haploid) genome size in units of base pairs (bp) may differ slightly from values commonly reported in the literature. We also report the cardinality (size) of each set of minimal absent words, i.e. the total number of minimal absent words in the set. Table 3 displays the dinucleotide relative abundances of the sets of minimal absent words in the genome of each organism. The reported values are the strand symmetric functionals, with denoted by AA and TT, and so on. The counts were estimated for each word separately, and cumulative values were estimated over the entire set.

The compositional biases displayed in Tables 2 and 3 provide additional information for investigating the hypothesis of the hypermutability of CpGs explaining the absence of nullomers/unwords in vertebrate genomes, as proposed by Acquisti et al. [4]. We find that this hypothesis needs revision for longer absent words, as neither the base nor dinucleotide compositional biases are uniform throughout sets of minimal absent words of increasing word length. For example, the dinucleotide CG is over-represented in sets and for the vertebrate genomes considered, but under-represented in sets and . For quantifying the under- or over-representation of a dinucleotide in a given genome, we use the boundaries proposed by Karlin and collaborators, who proved that a conservative estimate of or , respectively, occurs for sufficiently long ( 5kb) random sequences, with probability approximately , and independent of genome base composition. The rationale follows that, for a random sequence, the values for all approach one, with deviations of about for sequences of length [21].

The inheritance of minimal absent words through common ancestry may be exclusive to vertebrates

Figures 2 and 3 display dendograms of the similarity in dinucleotide compositional biases amongst organisms and throughout different sets of minimal absent words. Dendograms are obtained from matrices with the pairwise Euclidean distances between distinct vectors (of length 16) of dinucleotide relative abundances ( values in Table 3), using the unweighted pair group method with arithmetic averages (UPGMA, also known as average linkage method [22]). UPGMA is a simple hierarchical clustering method that, by assuming a constant rate of evolution, hence no implicit evolutionary model, outputs a rooted tree where the sum of times down a path to the leaves from any node is the same, regardless of the chosen path. Dendograms were drawn using the PHYLIP package [23]. These dendograms based on dinucleotide relative abundances provide a very useful normalization of often very differently sized sets of minimal absent words, and they are preferred to dendograms resulting from multiple sequence alignments due to current algorithmic limitations that render practically infeasible to consider such large data sets as those in sets .

thumbnail
Figure 2. Similarity in dinucleotide compositional biases in prokaryotes:.

Dendograms from dinucleotide relative abundances in sets of minimal absent words of word length 11 (), 14 (), 17 () and 24 () for selected prokaryotic genomes.

https://doi.org/10.1371/journal.pone.0016065.g002

thumbnail
Figure 3. Similarity in dinucleotide compositional biases in eukaryotes.

Dendograms from dinucleotide relative abundances in sets of minimal absent words of word length 11 (), 14 (), 17 () and 24 () for selected eukaryotic genomes.

https://doi.org/10.1371/journal.pone.0016065.g003

The dendograms of similarity in dinucleotide relative abundances displayed in Figures 2 and 3 often do not recover the correct phylogenetic relationships, as dendograms based on whole genome data would, because sets of minimal absent words can have compositional biases very different from those of the genome (Table 2). Nevertheless, they are useful for exploring the hypothesis of the inheritance of minimal absent words through a common ancestor, in addition to lineage specific inheritance, as proposed by Acquisti et al. [4] in different sets of minimal absent words. We find that this hypothesis is not supported by our data for organisms other than vertebrates, as these represent the only clade whose phylogenetic relationships are often recovered in these dendograms.

As minimal absent words are intrinsically related to perfect repeats, they are closely dependent upon the overall repeats content in the genome, and distinct repeat classes will be associated to sets of minimal absent words of increasing word length. The small set of -proteobacteria considered here (E. coli, H. influenzae and X. campestris) have, on average, higher GC content than the -proteobacterium (H. pylori), the firmicutes (B. anthracis, B. subtilis, L. casei, L. lactis, M. genitalium, S. aureus, methicillin-resistant S. aureus, methicillin-susceptible S. aureus and S. pneumoniae), and even the euryarchaeota (M. jannaschii). Moreover, though the genomes of the -proteobacteria considered here are, on average, significantly larger than those of the other bacteria, the average percentage of generic repeats is smaller in this phylum than in the others (see [24] for statistics). The bacterium E. coli has one of the smallest repeat percentages of this set and its base compositional biases vary in opposition to the general trend (Table 2). This last feature is also observed in X. campestris, though its GC content is the highest in this set (Table 2), and its overall percentage of repeats is one of the highest.

The similarity in dinucleotide relative abundances in higher eukaryotes often recovers the phylogenetic relationships, except in set , where the human is more similar to the more distantly related mouse than to the evolutionary close chimpanzee (Figure 3). Apart from the fact that these are extremely small sets in very large genomes (Table 2), we believe part of the explanation to be related to DNA transposons, which have a significant presence in both the mouse and human sets (tough larger in the latter), and which are the class of repeats that exists in more similar percentage in both genomes [25]. The separation of the worm and fruit fly from the metazoan clade may be related to the more recent origin of repeats in the worm and fruit fly than those in the remaining group (the chicken, mouse, chimpanzee and human), specially in the human genome [26].

Conclusions

Minimal absent words, which are at a minimal distance of a single nucleotide (the left- or rightmost) from being an observed word, have been computed in the genomes of organisms from all domains of life. Here, we complement the work of Acquisti et al. by comparing the compositional biases of different sets of minimal absent words in the genomes of 22 organisms (one archaeota, thirteen bacteria and eight eukaryotes). We find that the mutational biases (namely, the hypermutability of CpGs) that were proposed to explain the absence of the shortest absent words in vertebrates do not explain the absence of minimal absent words, as these compositional biases are not uniform throughout different sets of minimal absent words of increasing word length. Moreover, the analysis of the similarity in dinucleotide relative abundances of different sets of minimal absent words supports the hypothesis of the inheritance of minimal absent words through a common ancestor, in addition to lineage specific inheritance, only in vertebrates.

Minimal absent words define a class of words that is closely related to perfect repeats in the genome, and not bound to protein-coding regions of the genome. Hence, we believe minimal absent words may be useful for inferring de novo genomic homology and potentially to uncover a plethora of new information on the evolution of genomes. Such strategy would overcome some of the major pitfalls of current genomic homology inference methods, which often fail to detect homology when there is considerable sequence divergence and mostly ignore the non-protein-coding regions of the genome [27][29]. This might prove to be a particularly useful methodology in genomes with high repeat content, such as the human genome, where more than half of the sequence remains ‘dark matter’, with only exons and repetitive sequences presently annotated.

Author Contributions

Conceived and designed the experiments: SPG. Performed the experiments: SPG. Analyzed the data: SPG. Wrote the paper: SPG AJP JMOSR CAB PJSGF. Designed the software used in analysis: AJP JMOSR CAB PJSGF.

References

  1. 1. Pinho AJ, Ferreira PJSG, Garcia SP, Rodrigues JMOS (2009) On finding minimal absent words. BMC Bioinformatics 10: 137.
  2. 2. Hampikian G, Andersen T (2007) Absent sequences: Nullomers and primes. In: Pacific Symposium on Biocomputing. 12: 355–366.
  3. 3. Herold J, Kurtz S, Giegerich R (2008) Effcient computation of absent words in genomic sequences. BMC Bioinformatics 9: 167.
  4. 4. Acquisti C, Poste G, Curtiss D, Kumar S (2007) Nullomers: really a matter of natural selection? PLoS ONE 2: e1022.
  5. 5. Brendel V, Beckmann J, Trifonov E (1986) Linguistics of nucleotide sequences: morphology and comparison of vocabularies. Journal of Biomolecular Structure and Dynamics 4: 11–21.
  6. 6. Pietrokovski S, Hirshon J, Trifonov E (1990) Linguistic measure of taxonomic and functional relatedness of nucleotide sequences. Journal of Biomolecular Structure and Dynamics 7: 1251–1268.
  7. 7. Karlin S (1998) Global dinucleotide signatures and analysis of genomic heterogeneity. Current Opinion in Microbiology 1: 598–610.
  8. 8. Karlin S, Mrázek J, Campbell A (1997) Compositional biases of bacterial genomes and evolutionary implications. The Journal of Bacteriology 179: 3899–3913.
  9. 9. Karlin S, Burge C (1995) Dinucleotide relative abundance extremes: a genomic signature. Trends in Genetics 11: 283–290.
  10. 10. NCBI website. Available: http://www.ncbi.nlm.nih.gov/. Accessed 2010 December 15.
  11. 11. SGD website. Available: http://www.yeastgenome.org/. Accessed 2010 December 15.
  12. 12. TAIR website. Available: http://www.arabidopsis.org/. Accessed 2010 December 15.
  13. 13. WormBase website. Available: http://www.wormbase.org/. Accessed 2010 December 15.
  14. 14. FlyBase website. Available: http://flybase.org/. Accessed 2010 December 15.
  15. 15. Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge: Cambridge University Press.
  16. 16. Burrows M, Wheeler DJ (1994) A block-sorting lossless data compression algorithm. Digital Systems Research Center.
  17. 17. Abouelhoda MI, Kurtz S, Ohlebusch E (2002) The enhanced suffix array and its applications to genome analysis. pp. 449–463.
  18. 18. Kasai T, Lee G, Arimura H, Arikawa S, Park K (2001) Linear-time longest-common-prefix computation in suffix arrays and its applications. pp. 182–192.
  19. 19. Karlin S, Cardon L (1994) Computational DNA sequence analysis. Annual Review of Microbiology 48: 619–654.
  20. 20. Gentles A, Karlin S (2001) Genome-scale compositional comparisons in eukaryotes. Genome Research 11: 540–546.
  21. 21. Karlin S, Campbell A, Mrázek J (1998) Comparative DNA analysis across diverse genomes. Annual Review of Genetics 32: 185–225.
  22. 22. Sokal R, Michener C (1958) A statistical method for evaluating systematic relationships. University of Kansas Scientific Bulletin 28: 1409–1438.
  23. 23. PHYLIP website. Available: http://evolution.genetics.washington.edu/phylip.html. Accessed 2010 December 15.
  24. 24. Genome Atlas website. 15: Available: http://www.cbs.dtu.dk/services/GenomeAtlas-3.0/. Accessed 2010 December.
  25. 25. Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520–562.
  26. 26. The International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921.
  27. 27. Simillion C, Vandepoele K, Van de Peer Y (2004) Recent developments in computational approaches for uncovering genomic homology. BioEssays 26: 1225–1235.
  28. 28. Ulitsky I, Burstein D, Tuller T, Chor B (2006) The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology 13: 336–350.
  29. 29. Margulies E (2008) Confidence in comparative genomics. Genome Research 18: 199–200.