Non-parametric estimation of survival in age-dependent genetic disease and application to the transthyretin-related hereditary amyloidosis

Flora Alarcon; Violaine Planté-Bordeneuve; Malin Olsson; Grégory Nuel

doi:10.1371/journal.pone.0203860

Abstract

In genetic diseases with variable age of onset, survival function estimation for the mutation carriers as well as estimation of the modifying factors effects are essential to provide individual risk assessment, both for mutation carriers management and prevention strategies. In practice, this survival function is classically estimated from pedigrees data where most genotypes are unobserved. In this article, we present a unifying Expectation-Maximization (EM) framework combining probabilistic computations in Bayesian networks with standard statistical survival procedures in order to provide mutation carrier survival estimates. The proposed approach allows to obtain previously published parametric estimates (e.g. Weibull survival) as particular cases as well as more general Kaplan-Meier non-parametric estimates, which is the main contribution. Note that covariates can also be taken into account using a proportional hazard model. The whole methodology is both validated on simulated data and applied to family samples with transthyretin-related hereditary amyloidosis (a rare autosomal dominant disease with highly variable age of onset), showing very promising results.

Citation: Alarcon F, Planté-Bordeneuve V, Olsson M, Nuel G (2018) Non-parametric estimation of survival in age-dependent genetic disease and application to the transthyretin-related hereditary amyloidosis. PLoS ONE 13(9): e0203860. https://doi.org/10.1371/journal.pone.0203860

Editor: Wei Wang, Edith Cowan University, AUSTRALIA

Received: July 31, 2017; Accepted: August 29, 2018; Published: September 25, 2018

Copyright: © 2018 Alarcon et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data are not publicly available due to the sensitive status of the patients. These data have been described in two previously published studies. Both studies were approved by Regional Ethical Committee in Umeå in Sweden, and the National Commission on Informatics and Liberty (CNIL) in France, respectively. As required by these ethics committeees, geno-phenotypic information on families have to remain anonymous for ethical and medical reasons and cannot be disclosed. Concerning Swedish data, data requests may be sent to Regional Ethical Committee in Umeå, epn@adm.umu.se. Concerning French data, all computer data (including databases, in particular patient data) are protected by the CNIL, the national data protection authority for France. CNIL is an independent French administrative regulatory body whose mission is to ensure that data privacy law is applied to the collection, storage, and use of personal data. As the database of this study was authorized by the CNIL, we cannot make available data without prior agreement of the CNIL. Data requests from interested researchers may nevertheless be considered, subject to prior determination of the terms and conditions of such consultation and in respect for compliance with the applicable regulations. Interested researchers may contact Florence Favrel Feuillade: Florence.favrel-feuillade@aphp.fr.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared no competing interests exist.

Introduction

In monogenic diseases with variable age of onset, an accurate estimation of the survival function for the mutation carriers is essential. Since potential factors (e.g. genetic or environmental factors) can modify this age of onset, it is important to identify these factors and estimate their effects. These estimations are then usually combined into a proportional hazard model that is typically used to provide individual risk assessment as well as to establish prevention strategies.

In the context of genetic diseases with variable age of onset, geneticists usually focus on the penetrance function, that is the cumulative risk of being affected by a given age for mutation carriers defined by as the age-specific cumulative distribution function of the waiting time to disease diagnosis [1–3]. Since in this paper one aims at exploiting standard statistical survival analysis, we will rather consider the survival function defined by: However it is straigthforward to obtain the penetrance function from the survival one (and conversely) since F(t) = 1 − S(t). In order to avoid any confusion, please note that the survival function considered here corresponds to the cause-specific survival (disease diagnosis) and not to the overall survival. We do not consider any competing risk in the present work, and censoring (e.g. death) is always assumed to be independent from the waiting time of interest. Note that for severe disease (e.g. cancer), death is often affected by the disease status, but since this event usually occurs after diagnosis, which does not affect our model.

When estimating mutation carrier survival, the main challenge comes from the fact that most genotypes are not observed. Taking into account this uncertainty is then slightly different depending on whether the disease has sporadic cases or not. In complex diseases with monogenic sub-entities, in which only a minority of cases is due to rare mutations (e.g. breast cancer with BRCA mutations [4–6]) both non-carriers and mutation carriers might be affected. It is therefore necessary to provide a survival function for non-carrier which is typically obtained from the general population. In monogenic diseases such as the hereditary Tranthyretin Amyloidosis (hATTR) [2], all affected individuals are necessary carriers and thus, the disease incidence among non-carrier is equal to zero. Nevertheless the problem remains challenging since a non-affected individual at age t might either be a non-carrier or a carrier who “survived” until age t. For the sake of simplicity, we only consider in this article the monogenic diseases case; however the suggested method is straightforward to extend to complex diseases with monogenic sub-entities as long as the incidence or survival among non-carriers is available.

In the last decades, several methods have been proposed for estimating the penetrance or survival functions from pedigrees (see e.g., [1–3, 6]). All these methods rely on a parametric model, namely the Weibull function, to describe the penetrance function. In these papers, unknown genotypes are handled through the Elston-Stewart algorithm [7] and likelihood function is maximized with ad hoc implementations [8]. Probably due to their complexity, the resulting methods were never made publicly available and were therefore scarcely used. The main objective of this paper is to provide a unified and flexible publicly available methodology that can both provide a stable implementation of the previously published parametric estimators and more general non-parametric estimates. Such estimates were previously considered in [9] but only in the non-realistic case where all genotypes were observed.

In order to achieve this objective, we reformulate the problem in the Expectation-Maximization (EM) framework [10] which provides a general iterative algorithm for optimizing the likelihood of any statistical model with partially missing data (here the unobserved genotypes). In the EM algorithm we alternate two main steps: the E-step where we compute individual weights as posterior mutation carrier distributions using the current estimates; and the M-step where we update the estimates using the observations and the weights computed at the E-step. Unlike previous works [1–3, 6] we do not want to provide an ad hoc implementation of these two steps but rather taking advantage of well established and robust procedures. We use probabilistic computations in Bayesian networks for the E-step [11], and classical survival analysis methods for the M-step [12].

Our method can be either used with parametric estimation like previously done in the literature (e.g. Weibull or exponential waiting time distribution, etc.) or with non-parametric approaches (e.g. Kaplan-Meier or Nelson-Aalen). To the best of our knowledge, this is the first time that a non-parametric method estimate penetrance function with unknown genotype is proposed.

The paper is organized as follows: Section “Methods” contains the main contribution of this paper which includes the model formulation, the EM-framework and the detailed E- and M-steps. Then, Section “Validation on Simulated Datasets” presents several simulation analyses that validate the method while Section “Application to the hATTR” applies the proposed method to hATTR families from different origins (French, Portuguese, and Swedish). Finally, some conclusions are drawn in Section “Discussion”. A minimal R [13] source code demo is provided as supplementary material.

Methods

This section is devoted to the description of the proposed methodology. The objective is to estimate the cause-specific survival function for individuals carrying the disease mutation. We first introduce the model decomposed into a genetic-specific part and a survival-specific part. Then we present the EM framework and detail both the E-step using belief propagation in Bayesian networks and the M-step using existing tools from the survival analysis community.

The Model

Let us consider n individuals in set . We denote by the subset of founders (i.e. individuals without ancestors in the pedigree) and we denote by the set of non-founders (i.e. individuals with ancestors in the pedigree). Let us denote by X = (X₁, …, X_n) ∈ {00, 01, 10, 11}ⁿ the genotypic random vector defined such as X_i is the genotype of the individual i. The first entry (respectively the second entry) represents the number of paternal (resp. maternal) disease alleles. For instance X_i = 01 means that the individual i carries the mutation, is heterozygous and that his mutation has been transmitted by his mother. Also, we denote by (resp. ) the paternal (resp. maternal) genotype of any non-founder individual . Let us remind that the vector X is partially observed; first because individuals are rarely genotyped, secondly because the parental transmission pattern is only indirectly observed through the family relationship. Therefore, unobserved genotypes will be estimated according to genotypic information on the whole pedigree (see Section “E-step”). We denote by the random vector defined such as T_i is the time at diagnosis if the individual i is affected by the disease (i.e. δ_i = 1) while T_i is the time at last follow-up (censoring) if the individual i is not affected (i.e. δ_i = 0); where δ ∈ {0, 1}ⁿ is the censoring indicator. Finally, the model can be written as follows: where denotes the joint probability distribution of T and X and denotes the conditional distribution of T given X.

As an example, let us consider a simple nuclear family defined by two ancestors and three children. In Table 1, the first column corresponds to the index i of the individual, the second one to the paternal index (with the convention that we use 0 for founders), the third one to the maternal index (0 for founders), the fourth one to the censoring indicator (δ_i = 1 if the individual i is affected and δ_i = 0 if not), the fifth one to the time T_i and the last one to the genotype X_i.

Download:

Table 1. Example: A simple nuclear family.

https://doi.org/10.1371/journal.pone.0203860.t001

Genetic part.

We assume the Mendelian transmission of the alleles and the Hardy-Weinberg distribution of the founder’s alleles with allele frequency f. This means that for any founder we have , , and . In the case where the survival function for carriers is fully non-parametric (see Section “Survival Part”), the frequency f is non identifiable since the survival for carrier can easily account for any arbritrary mixture of carrier and non-carriers. This is a classical problem arising with mixture with non parametric components. A classical solution to this problem is to consider instead parametric components whose more constrained nature prevent identification issues (see [14] in the FDR context, and [15] in the survival context with cure models).

We will hence either assume that f is known (which is quite such genetic disease—e.g. BRCA mutations in breast cancer), or, in the extreme situation where this information is unknown, we will use a parametric model (e.g. Weibull, Gaussian, logistic, etc. [16]) to fit this parameter as a prior step before refining survival estimates using our non-parametric approach. Thus, the genetic part can be written as follows: Since the n individuals might belong to completely independent families, it is clear that the genetic likelihood function can be computed separately on these independent families. However, the notations are still valid but simpler by combining all families into a single pedigree file.

As an example, let us compute this probability for the family of Table 1 where the observed genotypic vector is x = (01, 00, 00, 10, 00):

However, in practice, the true genotype X_i is almost always either partially observed or not observed at all. Indeed, when a genotyped individual carries the disease mutation, we know that X_i = 11 in the (rare) homozygous case, but we only know that X_i ∈ {10, 01} in the heterozygous case. Similarly, a non genotyped but affected individual only implies that X_i ≠ 00 (since all affected individual are mutation carriers). Moreover, a non genotyped and non affected individual i implies that X_i ∈ {00, 01, 10, 11}. Finally, a non carrier genotyped individual implies that X_i = 00 (assuming a 100% sensitivity of the mutation search procedure). Moreover, note that Genotyping errors can easily be added to the model. This uncertainty will be later rigorously taken into account through probabilistic computations using belief propagation in Bayesian networks (see Section “E-step”).

Survival part.

We recall that δ ∈ {0, 1}ⁿ is the censoring indicator. The survival part is defined for any carrier i with X_i ≠ 00 as where λ(t) is the hazard function, S(t) the survival function defined by S(t) = exp(−Λ(t)) and the cumulative hazard. Note that for the sake of simplicity, we abusively use the probability symbol to actually denote a (conditional) density in the case where δ_i = 1. Since non-carrier cannot be affected, they do not appear in the log-likelihood. For simplification purpose it is nevertheless useful to make them appear in the expression with a null contribution by abusively writing:

Accounting for covariates.

Note that covariates can easily be added to the model through a proportional hazard model defining hereafter. Let be the covariate matrix, the model accounting for Z can be written as follows: where λ₀(t) is the baseline hazard, Λ₀(t) is the baseline cumulative hazard, the i^th row of Z and is the proportional effect coefficient.

The Expectation Maximization algorithm

As stated above, most of the genotypes X_i are not observed at all, and even for the genotyped individuals, we often only have partial information (e.g., we cannot distinguish between 01 and 10). We therefore consider the variable X as a latent variable and denote by the set of acceptable genotypes (e.g. if we have no information on X_i, if we know that X_i is heterozygous, for a non-carrier, etc.). We denote by “ev” the evidence corresponding to all the available information, i.e. the available genotype informations () as well as the partially censored T. Note that this notion of ‘evidence’ in Bayesian network context is similar but not exactly the same as the notion of ‘evidence’ in Bayesian statistics. In order to maximize the log-likelihood function of the model in the presence of incomplete data, we use the EM algorithm [10]. To that end, let us introduce the following auxiliary Q function: where θ (resp. θ_old) contains the current (resp. previous) version of the parametric (proportional effect coefficients) and non-parametric (survival functions) components of the model. Formally, the classical Q function of the EM algorithm is equal to the present function plus a constant term in θ. Therefore, maximizing our function instead of the original one does not affect our algorithm.

Since the genetic component of the model has no parameter (the allele frequency f is supposed to be known and a Mendelian transmission of the alleles is assumed—see Section “Genetic part”), by using the model properties it is straightforward to rewrite the Q function as follows (1)

Starting from an arbitrary value of θ = θ₀, the following two steps are iterated until the estimates converge:

E-step: for computing the weights using θ_old = θ (that are conditional probabilities);
M-step: for maximizing the Q function with respect to θ and obtaining a new estimate.

E-step.

In order to compute the conditional probabilities it is first necessary to compute their common denominator: Since X has 4ⁿ possible configurations in the worst case, it is clearly impossible to simply enumerate these configurations even for moderate size pedigrees. Therefore, one needs a computationally more efficient approach. When the pedigree has no loop (i.e. the pedigree is a tree), the Elston-Stewart algorithm [17] suggests to eliminate the variables X_i from the above sum-product by peeling individuals from the last generations up to the oldest common ancestor. The resulting algorithm has a complexity which allows to efficiently handle even large pedigrees as long as they have no loop. However, in practice, it is not rare to encounter loops in pedigree (e.g., consanguinity loops). Fortunately, Elston-Stewart can be adapted to the presence of loops by introducing the notion of cut-sets [18] which results in a complexity, where k ≥ 3 correspond to the size of the largest cut-set in the peeling sequence. Typically k = 4 to 6 for most pedigrees, but k can also grow very large resulting in intractable exact computations for highly complex pedigrees (e.g. inuit pedigree [19]). This cut-set version of Elston-Stewart (as well as variants of Lander-Green [20] for multi-point analysis) is implemented in the well-known Mendel software [21] which can efficiently perform likelihood computations in complex pedigrees.

As pointed out in [22], the distribution of genotypes in pedigree can also be described as a Bayesian network, a model that belongs to a wide class of probabilistic graphical models with strong mathematical background and well-known theory for efficiently performing sum-product computations [11]. The approach consists in sequentially eliminating variables from the graphical model taking into account the clique structures of the corresponding graph. This approach results in the construction of a junction tree whose tree-width (size of the largest clique) is precisely equivalent to k for cut-sets approaches. These algorithms are called sum-product, message passing, or belief propagation algorithm and they have been used by many authors in the context of genetics [22–26]. One interesting feature of belief propagation in pedigree is that, for the computational cost of two likelihood computation, this approach provides the full posterior distribution of the system, including the marginal posterior distribution of all genotypes (see [11, 22]). But as pointed out by [27], the Elston-Stewart peeling algorithm can be extended to obtain a similar feature. The resulting algorithm is in fact exactly the forward/backward equivalent of belief propagation for a peeling sequence (sequence of variable elimination).

In this paper, we use a ad hoc low performance R implementation of belief propagation in pedigree called bped (available as supplementary material). At each E-step of the EM algorithm, we provide to this command-line program two files:

a pedigree structure file as a classical .ped file;
an evidence file containing the evidence for all i ∈ {1, …, n} and for all X_i ∈ {00, 01, 10, 11}.

For a non-affected individual (δ_i = 0), one has: and for an affected individual (δ_i = 1) one has: Since the proportion factor S(T_i)λ(T_i) does not depend on X_i, its values will not affect in any way the posterior distribution P(X_i|ev; θ_old). Indeed, since we compute the posterior distribution of all X_i, any multiplicative factor that appears in the prior distributions will cancel out in the posterior. This is exactly the case for the S(T_i)λ(T_i) factor which can then be removed. Hence we can replace this proportion factor by 1 and simply use: in the evidence file. It is therefore clear that the knowledge of λ(t) is not required for this procedure which is of particular interest since non-parametric survival estimate like Kaplan-Meier usually provides only the expression of S(t) and not the one of λ(t).

Then, bped performs the BP and computes the posterior marginal distribution for all individual i, from which the weights are derived.

M-step.

Once the weights w_i have been computed (at the E-step), the model components can be updated by maximizing Eq (1) which is simply a weighted survival log-likelihood function where each individual observation receives the weight w_i. Since most statistical softwares allow for weighted observations, we can therefore rely on well-established existing survival tools for performing our M-step. Using the programming software R [13], we can for example take advantage of the robust survival package [12, 28] which provides non-parametric Kaplan-Meier estimation of the survival through the survfit() function. Note that the coxph() can also be combined with survfit() to provide non-parametric Nelson-Aalen survival estimates taking into account proportional hazard effects. In addition, using full parametric survival estimation procedures, such as the survreg() function, allows the method to provide alternative classical survival estimation (namely Weibull, exponential, Gaussian, logistic, log-normal, log-logistic) with no additional development costs. Even if the primary purpose and novelty of our method is to provide non-parametric survival estimate, the possibility to fit classical parametric survival estimates is also an interesting feature especially considering that few or none of the previously published methods provide any practical implementation.

Practical implementation.

EM initialization is performed by affecting random weights w_i to all individuals in each pedigree (e.g., drawn from a uniform distribution on [0, 1] and normalized to ensure the sum-to-one constraint). Then, a first M-step is performed using these weights in order to provide an initial value of θ. The EM iterations are run until numerical convergence is achieved. The usual convergence criterion is such that the absolute error between survival estimates (e.g., baseline survival at age 20, 40, 60, 80) decreases below a threshold (e.g., 10⁻¹⁰) between two consecutive iterations of the algorithm. The 95% pointwise confidence intervals are simply provided by the standard (weighted) Kaplan-Meier (or Nelson-Aalen if we consider covariates) estimation of the survival.

1 Validation on simulated datasets

For validation purposes we first consider the application of our method on simulated datasets. In order to simulate realistic pedigree structure (parental relationships and individual genders), we use 64 French and Portugese hATTR families from [2] totalizing 1,095 individuals. These 64 families were replicated three times resulting in a dataset of n = 3,285 individuals in 192 families. Genotypes were assigned using the Hardy-Weinberg distribution for the founders and respecting the Mendelian transmission for the non-founders. We have used an allele frequency of f = 0.20 in order to obtain enough informative families (without simulating any ascertainment process). The gender of the transmitting parent was not taken into account in this work (no distinction between X = 01 and X = 10). Thus, the genotype of individual i was binary and individual i was a mutation carrier if X_i ∈ {01, 10, 11} and non carrier if X_i = 00. The age at diagnosis was simulated according to a piecewise constant hazard rate function, λ(t), given as follows: A uniform censoring data between 15 and 80 years resulting in a censoring rate of roughly 30% (similar to real data censoring rates) was simulated. A total of 10% of the individuals (uniformly selected) was supposed to be genotyped (without error) while the 90% remaining individuals were not.

One can see on Fig 1 (left) the non-parametric Kaplan-Meier estimation obtained at the end of the EM algorithm. Despite the fact that only 10% of the individuals where genotyped, the method clearly manages to provide accurate estimates. Unsurprisingly, the size of the confidence intervals decrease when the sample or the number of affected individuals increases (data not shown).

Download:

Fig 1. Simulated dataset.

Reference and estimation of the survival function S(t) for carriers with 95% point-wise confidence intervals (dashed lines). A total of n = 3,285 (1,641 males and 1,644 females) individuals including 441 affected and 319 genotyped. Left: simulation and estimation without gender effect. Right: simulation and estimation with a proportional protective effect for females (gender = 2).

https://doi.org/10.1371/journal.pone.0203860.g001

In order to demonstrate the ability of our method to deal with semi-parametric estimation (non-parametric baseline survival and proportional hazard) we now consider the previous incidence λ(t) as a baseline incidence which is also the male incidence. Moreover, we assume that the females benefit from a protective effect and we use the relative hazard (RH) 0.55 = exp(−0.6) (which means that males have an instantaneous risk 1.8 higher than females). We denote by β = −0.6 the regression parameter. In our simulation, we hence generate the time to diagnosis with the survival S₁(t) = exp(−Λ(t)) for males and with the survival S₂(t) = exp(−Λ(t)e^β) for females. Censoring and genotyping remain unchanged.

Covariates can be taken into account by stratifying on these covariates. However, since proportional hazard models are commonly considered in this context, we also perform a simulation where we assume a PH effect of the gender.

At each M-step of the EM algorithm we fit both a Cox PH model using gender as factor (gender = 1 as default) and then perform a non-parametric (Nelson-Aalen) estimation of the baseline survival. At the end of the algorithm, estimation of the proportional effect can be combined with the baseline survival estimation to provide survival estimations for the two classes. Alternatively, a purely stratified approach is also possible and give very similar results (data not shown) but since our purpose was here to illustrate the semi-parametric approach, we only give its results. The final Cox fitting gives that the β parameter was estimated by (p-value < 0.01) which is very close to the true value β = −0.6, and one can see on Fig 1 (right) the survival estimates for the two classes. Like for the simpler case with no covariates, the estimations are quite consistent with the ground truth. Again, increasing the sample size or the number of affected individuals leads to sharper confidence intervals (data not shown).

Now that the method appears to be validated on simulated datasets, we can consider real datasets.

2 Application to the hATTR

In this section the proposed method is applied to the transthyretin hereditary amyloidosis (hATTR), a severe autosomal dominant disorder caused by a mutation of the transthyretin (TTR) gene. The disorder initially described in Portugal is now recognized across the world with areas of highest prevalence like in Sweden or in Japan [29]. The ATTR-Val30Met (denoted MET30 from now on) is the most frequent pathogenic variant in Europe and virtually the only one detected in Portugal and Sweden. For this particular variant, a wide range of age at onset is observed with an average 30 (resp. 56) in Portuguese (resp. Swedish) families.

In France, the population of hATTR is heterogeneous including families from Portuguese descent presenting alike those from Portugal and families from French descent. The latter are characterized by a heterogeneity of pathogenic TTR variants, including the MET30 in 40% and a later onset of symptoms averaging 58 years of age. Fortunately, significant therapeutic advances occurred in the recent years with the aim to stabilize the disease progression. In this setting, a better knowledge of the risk of being symptomatic for carrier is highly needed to guide their follow up and to manage patients at the very onset of symptoms. It may also give clues on our understanding of the pheno-genotypic variability observed.

Because of the low allelic frequency, random sampling is not a tractable approach to obtain informative samples. As a consequence, data are usually obtained from families ascertained through affected individuals. Indeed, as all affected individuals necessarily carry the mutation, families ascertained in this way are very informative for estimating survival function. The drawback of this procedure is that the survival function can be significantly overestimated if the ascertainment process is not taken into account [30]. Therefore, an adjustment for the ascertainment bias is required. Different adjustments for ascertainment bias have already been proposed in order to provide valid risk estimates of a genetic disease (see for instance [1, 3, 6]). In these applications, the ascertainment bias was corrected by a classical method that consists in simply removing the phenotypic information of the individual (called proband) who allowed his family to be selected. This ascertainment correction is a well-known (and validated) preprocessing technique whose relevance is not discussed here.

Here we considered three datasets (see Table 2): the French dataset totalized 46 families from French descent with as many as 12 different pathogenic TTR variants including the MET30 in 22; the Portuguese dataset included 33 MET30 families from Portugal; the 3rd dataset enrolled 77 MET30 kindreds from Northern Sweden. These data have been described in two previously published studies [2, 31]. Both studies were approved by local ethic commitees (EC) in France and in Sweden, respectively. In this setting, as required by the EC and stated in the two publications, geno-phenotypic information on families have to remain anonymous for ethical and medical reasons and cannot be disclosed.

Download:

Table 2. The three hATTR datasets.

https://doi.org/10.1371/journal.pone.0203860.t002

The frequency of mutated allele was set to f = 0.001 [6, 32]. This parameter is generally unknown in practice. In addition, it has been shown in [6] that the survival estimations are not highly sensitive to this parameter.

For each dataset, we provide a semi-parametric survival estimation with a gender proportional hazard effect. We provide p-values for the gender effect through Cox’s (partial-) likelihood ratio tests. For each dataset, the results are compared to previously published analyses.

Fig 2 shows the survival estimates by gender for the three datasets. For the French dataset (top-left Fig 2), one observes a later disease onset (median around 70) than in the Portuguese sample (Fig 2, top-right) showing a median around age 45 years. A significantly higher instantaneous risk is observed for men compared to women in both the French (RH 1.7, Cox’s p-value 0.03) and the Portuguese (RH 1.57, Cox’s p-value 0.033) datasets. In contrast, we found no gender effect in the Swedish dataset (Fig 2, bottom-left, Cox’s p-value 0.42) and hence present the estimate without gender effect in Fig 2 (bottom-right). The disease onset appears to be much later in the Swedish population in comparison with the French and Portugese populations.

Download:

Fig 2. Survival estimates.

Top-Left: French dataset with a gender PH effect (RH 1.7, Cox’s p-value 0.030); Top-Right: Portugese dataset with a gender PH effect (RH 1.57, Cox’s p-value 0.033); Bottom-Left: Swedish dataset with a non-significant gender PH effect (Cox’s p-value 0.42). Bottom-Right: Swedish dataset without any gender PH effect. 95% point-wise confidence intervals are given by the colored regions.

https://doi.org/10.1371/journal.pone.0203860.g002

These observations are highly consistent with the previously published analyses [2, 31]. In the previous stratified analysis, the gender difference was found lower and not significant in the whole French dataset. This difference can be explained by the additional power provided by the proportional hazard model used here. For comparison purposes we fitted on the French data a stratified non-parametric survival and tested for difference between genders using the log-rank test resulting in a non significant p-value of 0.122, which is consistent with the previous study. The previously reported heterogeneity in age of onset across the three datasets is confirmed in the present study.

3 Discussion

In the present article we introduced a flexible and robust framework to estimate survival function from familial data in cases of age-dependent genetic diseases. Our new method provides a unifying way to simply implement both previously published methods (parametric Weibull-based) as well as new interesting extension such as the non-parametric or semi-parametric extensions.

In order to tackle the challenging problem of the unknown genotypes in the family data, our method relies on the EM algorithm and decomposes the problem into two steps: the E-step which uses belief propagation in Bayesian networks to compute marginal individual posterior carrier distribution, and the M-step which estimates survival using weighted observations.

The key feature of our approach is that these two steps are handled by robust and validated implementations: the bped command-line program for the belief propagation, and the survival package (statistical software R [13]) for the survival estimates. We can therefore consider any baseline survival estimators, either parametric (e.g., Weibull, exponential, log normal, etc.) or non parametric (Kaplan-Meier). Moreover, these estimators can be easily combined with Cox’s proportional hazard models and with stratification.

Note that in the present paper we focused on the particular case where non-carriers cannot be affected (survival of 1.0) and where the genetic model is dominant. However, the method can be easily extended to more general models (sporadic cases, recessive model, etc.) as long as the incidence among non-carriers is known (i.e. estimated from the general population). Moreover, more complex models allowing for genotyping errors or even pedigree errors (for instance wrong filiation) can be incorporated, as done in [33], even if, in the present work, we have focused on the most basic (but reasonable) model.

In the application part, as pedigrees are ascertained through an affected individual, the proband’s phenotype exclusion method is used to avoid ascertainment bias. However, other ascertainment corrections can be used if the ascertainment process is more complex (e.g., ascertainment on family criteria in a complex disease with monogenic sub-entities, such as breast and ovarian cancers with the BRCA mutations). Again, this is in favor of the flexibility of the proposed method.

Concerning the perspectives, an interesting extension of this work would be to account for a possible correlation between members of the same family by including a frailty in the survival function. The familial frailty would typically represent an unknown shared exposure to some environmental factors or to some kinds of polygenic effect. However, the estimation of such models is known to be challenging, especially in the context of non-parametric survival estimation (see e.g., [34, 35]). Further investigations will be conducted on this important topic in a forthcoming work. However, in this work and particularly for applications to monogenic diseases (such as hATTR), this frailty aspect should not modify the estimation results. Moreover, the proposed method allows to take into account the parent of origin effect. Thus, it would be very interesting to study the robustness of the survival function estimation when the parent-of-origin effect is analyzed.

Supporting information

S1 File. R source code demo.

https://doi.org/10.1371/journal.pone.0203860.s001

(DOCX)

S2 File. CPP-recueil-data.

https://doi.org/10.1371/journal.pone.0203860.s002

(PDF)

Acknowledgments

We would like to thank Suhr Ole and Urban Hellman for making the data available for this analysis.

References

1. Le Bihan C, Moutou C, Brugieres L, Feunteun J, Bonaïti-Pellié C. ARCAD: a method for estimating age-dependent disease risk associated with mutation carrier status from family data. Genet Epidemiol. 1995;12(1):13–25. pmid:7713397
- View Article
- PubMed/NCBI
- Google Scholar
2. Plante-Bordeneuve V, Carayol J, Ferreira A, Adams D, Clerget-Darpoux F, Misrahi M, et al. Genetic study of transthyretin amyloid neuropathies: carrier risks among French and Portuguese families. J Med Genet. 2003;40(11):e120. pmid:14627687
- View Article
- PubMed/NCBI
- Google Scholar
3. Carayol J, Bonaïti-Pellié C. Estimating penetrance from family data using a retrospective likelihood when ascertainment depends on genotype and age of onset. Genetic Epidemiology. 2004;27(2):109–117. pmid:15305327
- View Article
- PubMed/NCBI
- Google Scholar
4. Easton D, Bishop D, Ford D, Crockford G. Genetic linkage analysis in familial breast and ovarian cancer: results from 214 families. The Breast Cancer Linkage Consortium. American journal of human genetics. 1993;52(4):678. pmid:8460634
- View Article
- PubMed/NCBI
- Google Scholar
5. Stoppa-Lyonnet D, Laurent-Puig P, Essioux L, Pages S, Ithier G, Ligot L, et al. BRCA1 sequence variations in 160 individuals referred to a breast/ovarian family cancer clinic. Institut Curie Breast Cancer Group. American journal of human genetics. 1997;60(5):1021. pmid:9150149
- View Article
- PubMed/NCBI
- Google Scholar
6. Alarcon F, Bourgain C, Gauthier-Villars M, Planté-Bordeneuve V, Stoppa-Lyonnet D, Bonaïti-Pellié C. PEL: an unbiased method for estimating age-dependent genetic disease risk from pedigree data unselected for family history. Genetic epidemiology. 2009;33(5):379–385. pmid:19089844
- View Article
- PubMed/NCBI
- Google Scholar
7. Elston RC, Stewart J. A general model for the genetic analysis of pedigree data. Human Heredity. 1971;21(6):523–542. pmid:5149961
- View Article
- PubMed/NCBI
- Google Scholar
8. Lalouel JM. GEMINI: a computer program for optimization of general nonlinear functions. University of Hawaii, Population Genetics Laboratory; 1979.
9. Alarcon F, Bonaıti-Pellié C, Harari-Kermadec H. A nonparametric method for penetrance function estimation. Genetic epidemiology. 2009;33:38–44. pmid:18618769
- View Article
- PubMed/NCBI
- Google Scholar
10. Dempster AP, Laird NM, Rubin DB, et al. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal statistical Society. 1977;39(1):1–38.
- View Article
- Google Scholar
11. Koller D, Friedman N. Probabilistic graphical models: principles and techniques. MIT press; 2009.
12. Therneau T. A package for survival analysis in S. R package version 2.37-4. URL http://CRAN.R-project.org/package=survival.Box. 2013;980032:23298–0032.
13. R Core Team. R: A Language and Environment for Statistical Computing; 2015. Available from: https://www.R-project.org/.
14. Guedj M, Robin S, Celisse A, Nuel G. Kerfdr: a semi-parametric kernel-based approach to local false discovery rate estimation. BMC bioinformatics. 2009;10(1):84. pmid:19291295
- View Article
- PubMed/NCBI
- Google Scholar
15. Li CS, Taylor JM, Sy JP. Identifiability of cure models. Statistics & Probability Letters. 2001;54(4):389–395.
- View Article
- Google Scholar
16. Preston DL, Clarkson DB. SURVREG: A program for the interactive analysis of survival regression models. The American Statistician. 1983;37(2):174–174.
- View Article
- Google Scholar
17. Elston RC, George VT, Severtson F. The Elston-Stewart algorithm for continuous genotypes and environmental factors. Human heredity. 1992;42(1):16–27. pmid:1555844
- View Article
- PubMed/NCBI
- Google Scholar
18. Lange K, Boehnke M. Extensions to pedigree analysis. Human Heredity. 1983;33(5):291–301. pmid:6654362
- View Article
- PubMed/NCBI
- Google Scholar
19. Edwards A. The structure of the Polar Eskimo genealogy. Human heredity. 1992;42(4):242–252. pmid:1512006
- View Article
- PubMed/NCBI
- Google Scholar
20. Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES. Parametric and nonparametric linkage analysis: a unified multipoint approach. American journal of human genetics. 1996;58(6):1347. pmid:8651312
- View Article
- PubMed/NCBI
- Google Scholar
21. Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM. Mendel: the Swiss army knife of genetic analysis programs. Bioinformatics. 2013;29(12):1568–1570. pmid:23610370
- View Article
- PubMed/NCBI
- Google Scholar
22. Lauritzen SL, Sheehan NA. Graphical models for genetic analyses. Statistical Science. 2003; p. 489–514.
- View Article
- Google Scholar
23. Lauritzen SL. Graphical models. vol. 17. Clarendon Press; 1996.
24. O’Connell JR, Weeks DE. PedCheck: a program for identification of genotype incompatibilities in linkage analysis. The American Journal of Human Genetics. 1998;63(1):259–266. pmid:9634505
- View Article
- PubMed/NCBI
- Google Scholar
25. Fishelson M, Geiger D. Exact genetic linkage computations for general pedigrees. Bioinformatics. 2002;18(suppl 1):S189–S198. pmid:12169547
- View Article
- PubMed/NCBI
- Google Scholar
26. Palin K, Campbell H, Wright AF, Wilson JF, Durbin R. Identity-by-descent-based phasing and imputation in founder populations using graphical models. Genetic epidemiology. 2011;35(8):853–860. pmid:22006673
- View Article
- PubMed/NCBI
- Google Scholar
27. Totir LR, Fernando RL, Abraham J. An efficient algorithm to compute marginal posterior genotype probabilities for every member of a pedigree with loops. Genetics Selection Evolution. 2009;41(1):52.
- View Article
- Google Scholar
28. Therneau TM, Grambsch PM. Modeling survival data: extending the Cox model. Springer Science & Business Media; 2000.
29. Planté-Bordeneuve V, Said G. Familial amyloid polyneuropathy. The Lancet Neurology. 2011;10(12):1086–1097. pmid:22094129
- View Article
- PubMed/NCBI
- Google Scholar
30. Carayol J, Khlat M, Maccario J, Bonaïti-Pellié C. Hereditary non-polyposis colorectal cancer: current risks of colorectal cancer largely overestimated. Journal of Medical Genetics. 2002;39(5):335–339. pmid:12011152
- View Article
- PubMed/NCBI
- Google Scholar
31. Hellman U, Alarcon F, Lundgren HE, Suhr OB, Bonaïti-Pellié C, Planté-Bordeneuve V. Heterogeneity of penetrance in familial amyloid polyneuropathy, ATTR Val30Met, in the Swedish population. Amyloid. 2008;15(3):181–186. pmid:18925456
- View Article
- PubMed/NCBI
- Google Scholar
32. Bonaïti B, Alarcon F, Bonaïti-Pellié C, Planté-Bordeneuve V. Parent-of-origin effect in transthyretin related amyloid polyneuropathy. Amyloid. 2009;16(3):149–150. pmid:19626480
- View Article
- PubMed/NCBI
- Google Scholar
33. Thomas A. GMCheck: Bayesian error checking for pedigreegenotypes and phenotypes. Bioinformatics. 2005;21(14):3187–3188. pmid:15879454
- View Article
- PubMed/NCBI
- Google Scholar
34. Therneau T. Mixed Effects Cox Models; 2015.
35. Rondeau V, Mazroui Y, Gonzalez JR. frailtypack: an R package for the analysis of correlated survival data with frailty models using penalized likelihood estimation or parametrical estimation. Journal of Statistical Software. 2012;47(4):1–28.
- View Article
- Google Scholar

[ref1] 1. Le Bihan C, Moutou C, Brugieres L, Feunteun J, Bonaïti-Pellié C. ARCAD: a method for estimating age-dependent disease risk associated with mutation carrier status from family data. Genet Epidemiol. 1995;12(1):13–25. pmid:7713397
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Plante-Bordeneuve V, Carayol J, Ferreira A, Adams D, Clerget-Darpoux F, Misrahi M, et al. Genetic study of transthyretin amyloid neuropathies: carrier risks among French and Portuguese families. J Med Genet. 2003;40(11):e120. pmid:14627687
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Carayol J, Bonaïti-Pellié C. Estimating penetrance from family data using a retrospective likelihood when ascertainment depends on genotype and age of onset. Genetic Epidemiology. 2004;27(2):109–117. pmid:15305327
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Easton D, Bishop D, Ford D, Crockford G. Genetic linkage analysis in familial breast and ovarian cancer: results from 214 families. The Breast Cancer Linkage Consortium. American journal of human genetics. 1993;52(4):678. pmid:8460634
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Stoppa-Lyonnet D, Laurent-Puig P, Essioux L, Pages S, Ithier G, Ligot L, et al. BRCA1 sequence variations in 160 individuals referred to a breast/ovarian family cancer clinic. Institut Curie Breast Cancer Group. American journal of human genetics. 1997;60(5):1021. pmid:9150149
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Alarcon F, Bourgain C, Gauthier-Villars M, Planté-Bordeneuve V, Stoppa-Lyonnet D, Bonaïti-Pellié C. PEL: an unbiased method for estimating age-dependent genetic disease risk from pedigree data unselected for family history. Genetic epidemiology. 2009;33(5):379–385. pmid:19089844
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Elston RC, Stewart J. A general model for the genetic analysis of pedigree data. Human Heredity. 1971;21(6):523–542. pmid:5149961
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Lalouel JM. GEMINI: a computer program for optimization of general nonlinear functions. University of Hawaii, Population Genetics Laboratory; 1979.

[ref9] 9. Alarcon F, Bonaıti-Pellié C, Harari-Kermadec H. A nonparametric method for penetrance function estimation. Genetic epidemiology. 2009;33:38–44. pmid:18618769
View Article
PubMed/NCBI
Google Scholar

[31] View Article

[32] PubMed/NCBI

[33] Google Scholar

[ref10] 10. Dempster AP, Laird NM, Rubin DB, et al. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal statistical Society. 1977;39(1):1–38.
View Article
Google Scholar

[35] View Article

[36] Google Scholar

[ref11] 11. Koller D, Friedman N. Probabilistic graphical models: principles and techniques. MIT press; 2009.

[ref12] 12. Therneau T. A package for survival analysis in S. R package version 2.37-4. URL http://CRAN.R-project.org/package=survival.Box. 2013;980032:23298–0032.

[ref13] 13. R Core Team. R: A Language and Environment for Statistical Computing; 2015. Available from: https://www.R-project.org/.

[ref14] 14. Guedj M, Robin S, Celisse A, Nuel G. Kerfdr: a semi-parametric kernel-based approach to local false discovery rate estimation. BMC bioinformatics. 2009;10(1):84. pmid:19291295
View Article
PubMed/NCBI
Google Scholar

[41] View Article

[42] PubMed/NCBI

[43] Google Scholar

[ref15] 15. Li CS, Taylor JM, Sy JP. Identifiability of cure models. Statistics & Probability Letters. 2001;54(4):389–395.
View Article
Google Scholar

[45] View Article

[46] Google Scholar

[ref16] 16. Preston DL, Clarkson DB. SURVREG: A program for the interactive analysis of survival regression models. The American Statistician. 1983;37(2):174–174.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref17] 17. Elston RC, George VT, Severtson F. The Elston-Stewart algorithm for continuous genotypes and environmental factors. Human heredity. 1992;42(1):16–27. pmid:1555844
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref18] 18. Lange K, Boehnke M. Extensions to pedigree analysis. Human Heredity. 1983;33(5):291–301. pmid:6654362
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref19] 19. Edwards A. The structure of the Polar Eskimo genealogy. Human heredity. 1992;42(4):242–252. pmid:1512006
View Article
PubMed/NCBI
Google Scholar

[59] View Article

[60] PubMed/NCBI

[61] Google Scholar

[ref20] 20. Kruglyak L, Daly MJ, Reeve-Daly MP, Lander ES. Parametric and nonparametric linkage analysis: a unified multipoint approach. American journal of human genetics. 1996;58(6):1347. pmid:8651312
View Article
PubMed/NCBI
Google Scholar

[63] View Article

[64] PubMed/NCBI

[65] Google Scholar

[ref21] 21. Lange K, Papp JC, Sinsheimer JS, Sripracha R, Zhou H, Sobel EM. Mendel: the Swiss army knife of genetic analysis programs. Bioinformatics. 2013;29(12):1568–1570. pmid:23610370
View Article
PubMed/NCBI
Google Scholar

[67] View Article

[68] PubMed/NCBI

[69] Google Scholar

[ref22] 22. Lauritzen SL, Sheehan NA. Graphical models for genetic analyses. Statistical Science. 2003; p. 489–514.
View Article
Google Scholar

[71] View Article

[72] Google Scholar

[ref23] 23. Lauritzen SL. Graphical models. vol. 17. Clarendon Press; 1996.

[ref24] 24. O’Connell JR, Weeks DE. PedCheck: a program for identification of genotype incompatibilities in linkage analysis. The American Journal of Human Genetics. 1998;63(1):259–266. pmid:9634505
View Article
PubMed/NCBI
Google Scholar

[75] View Article

[76] PubMed/NCBI

[77] Google Scholar

[ref25] 25. Fishelson M, Geiger D. Exact genetic linkage computations for general pedigrees. Bioinformatics. 2002;18(suppl 1):S189–S198. pmid:12169547
View Article
PubMed/NCBI
Google Scholar

[79] View Article

[80] PubMed/NCBI

[81] Google Scholar

[ref26] 26. Palin K, Campbell H, Wright AF, Wilson JF, Durbin R. Identity-by-descent-based phasing and imputation in founder populations using graphical models. Genetic epidemiology. 2011;35(8):853–860. pmid:22006673
View Article
PubMed/NCBI
Google Scholar

[83] View Article

[84] PubMed/NCBI

[85] Google Scholar

[ref27] 27. Totir LR, Fernando RL, Abraham J. An efficient algorithm to compute marginal posterior genotype probabilities for every member of a pedigree with loops. Genetics Selection Evolution. 2009;41(1):52.
View Article
Google Scholar

[87] View Article

[88] Google Scholar

[ref28] 28. Therneau TM, Grambsch PM. Modeling survival data: extending the Cox model. Springer Science & Business Media; 2000.

[ref29] 29. Planté-Bordeneuve V, Said G. Familial amyloid polyneuropathy. The Lancet Neurology. 2011;10(12):1086–1097. pmid:22094129
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref30] 30. Carayol J, Khlat M, Maccario J, Bonaïti-Pellié C. Hereditary non-polyposis colorectal cancer: current risks of colorectal cancer largely overestimated. Journal of Medical Genetics. 2002;39(5):335–339. pmid:12011152
View Article
PubMed/NCBI
Google Scholar

[95] View Article

[96] PubMed/NCBI

[97] Google Scholar

[ref31] 31. Hellman U, Alarcon F, Lundgren HE, Suhr OB, Bonaïti-Pellié C, Planté-Bordeneuve V. Heterogeneity of penetrance in familial amyloid polyneuropathy, ATTR Val30Met, in the Swedish population. Amyloid. 2008;15(3):181–186. pmid:18925456
View Article
PubMed/NCBI
Google Scholar

[99] View Article

[100] PubMed/NCBI

[101] Google Scholar

[ref32] 32. Bonaïti B, Alarcon F, Bonaïti-Pellié C, Planté-Bordeneuve V. Parent-of-origin effect in transthyretin related amyloid polyneuropathy. Amyloid. 2009;16(3):149–150. pmid:19626480
View Article
PubMed/NCBI
Google Scholar

[103] View Article

[104] PubMed/NCBI

[105] Google Scholar

[ref33] 33. Thomas A. GMCheck: Bayesian error checking for pedigreegenotypes and phenotypes. Bioinformatics. 2005;21(14):3187–3188. pmid:15879454
View Article
PubMed/NCBI
Google Scholar

[107] View Article

[108] PubMed/NCBI

[109] Google Scholar

[ref34] 34. Therneau T. Mixed Effects Cox Models; 2015.

[ref35] 35. Rondeau V, Mazroui Y, Gonzalez JR. frailtypack: an R package for the analysis of correlated survival data with frailty models using penalized likelihood estimation or parametrical estimation. Journal of Statistical Software. 2012;47(4):1–28.
View Article
Google Scholar

[112] View Article

[113] Google Scholar

Figures

Abstract

Introduction

Methods

The Model

Genetic part.

Survival part.

Accounting for covariates.

The Expectation Maximization algorithm

E-step.

M-step.

Practical implementation.

1 Validation on simulated datasets

2 Application to the hATTR

3 Discussion

Supporting information

S1 File. R source code demo.

S2 File. CPP-recueil-data.

Acknowledgments

References