Abstract
Many complex systems are modular. Such systems can be represented as “component systems,” i.e., sets of elementary components, such as LEGO bricks in LEGO sets. The bricks found in a LEGO set reflect a target architecture, which can be built following a set-specific list of instructions. In other component systems, instead, the underlying functional design and constraints are not obvious a priori, and their detection is often a challenge of both scientific and practical importance, requiring a clear understanding of component statistics. Importantly, some quantitative invariants appear to be common to many component systems, most notably a common broad distribution of component abundances, which often resembles the well-known Zipf’s law. Such “laws” affect in a general and nontrivial way the component statistics, potentially hindering the identification of system-specific functional constraints or generative processes. Here, we specifically focus on the statistics of shared components, i.e., the distribution of the number of components shared by different system realizations, such as the common bricks found in different LEGO sets. To account for the effects of component heterogeneity, we consider a simple null model, which builds system realizations by random draws from a universe of possible components. Under general assumptions on abundance heterogeneity, we provide analytical estimates of component occurrence, which quantify exhaustively the statistics of shared components. Surprisingly, this simple null model can positively explain important features of empirical component-occurrence distributions obtained from large-scale data on bacterial genomes, LEGO sets, and book chapters. Specific architectural features and functional constraints can be detected from occurrence patterns as deviations from these null predictions, as we show for the illustrative case of the “core” genome in bacteria.
- Received 27 July 2017
- Revised 29 January 2018
DOI:https://doi.org/10.1103/PhysRevX.8.021023
Published by the American Physical Society under the terms of the Creative Commons Attribution 4.0 International license. Further distribution of this work must maintain attribution to the author(s) and the published article’s title, journal citation, and DOI.
Published by the American Physical Society
Physics Subject Headings (PhySH)
Popular Summary
Many complex systems in very different contexts—from biology to linguistics—can be broken down to clearly defined basic building blocks or components. Books, for example, can be seen as sets of words, and genomes can be seen as sets of genes. Analysis of the component usage in these systems can reveal interesting quantitative laws, some of which are specific to that system, while others are shared across diverse systems. A common theoretical framework for component systems is needed to understand such similarities and differences between systems.
This work focuses on the statistics of shared components and asks the following basic questions: How many components (e.g., words or genes) are common to all realizations (e.g., books or genomes)? How many are, instead, very specific? What are the system-level features that set the probability of sharing components? Such questions are central in evolutionary genomics, and we extend them here to general component systems, using the examples of texts and LEGO toys.
Solving a model based on random sampling of components, we show that several universal aspects of the statistics of shared components are a direct consequence of the heterogeneity of component usage and of system parameters such as the size of realizations and the size of the component vocabulary. While this simple model can capture general properties of empirical systems, deviations from its predictions can be used to highlight system-specific architectural or functional constraints. We show the validity of this approach for detecting the core genome in prokaryotic genomes.
Bridging different areas of research in complex systems can open the way for developing and applying statistical null models in different contexts. We show, for example, how a modeling approach rooted in quantitative linguistics can shed light on the dynamics of genome evolution.