Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Why Cohen’s Kappa should be avoided as performance measure in classification

Abstract

We show that Cohen’s Kappa and Matthews Correlation Coefficient (MCC), both extended and contrasted measures of performance in multi-class classification, are correlated in most situations, albeit can differ in others. Indeed, although in the symmetric case both match, we consider different unbalanced situations in which Kappa exhibits an undesired behaviour, i.e. a worse classifier gets higher Kappa score, differing qualitatively from that of MCC. The debate about the incoherence in the behaviour of Kappa revolves around the convenience, or not, of using a relative metric, which makes the interpretation of its values difficult. We extend these concerns by showing that its pitfalls can go even further. Through experimentation, we present a novel approach to this topic. We carry on a comprehensive study that identifies an scenario in which the contradictory behaviour among MCC and Kappa emerges. Specifically, we find out that when there is a decrease to zero of the entropy of the elements out of the diagonal of the confusion matrix associated to a classifier, the discrepancy between Kappa and MCC rise, pointing to an anomalous performance of the former. We believe that this finding disables Kappa to be used in general as a performance measure to compare classifiers.

Introduction

Classification is one of the cornerstones of Supervised Machine Learning. In parallel to the development of different methodologies that allow the construction of classifiers, the evaluation process of the classifiers to compare them, and the choice of the best among those available, has caught the attention of researchers.

Introduction of an adequate performance measure for classifiers is a subject no yet closed up to date (see [1]-[3]), and different metrics have been introduced. Some measures are naturally introduced in the binary case, such as Accuracy, Sensitivity, Specificity and Area Under the ROC Curve (AUC), among others, but not all of them can be well extended to the multi-class setting.

One of the ones that does is Accuracy (i.e. the fraction of well-predicted cases over the total), which seems the most natural measure and has been used for decades. Notwithstanding, Accuracy is not an effective measure since, among other things, it does not take into account the distribution of the misclassification among classes nor the marginal distributions. Other more subtle measures have been introduced in the multi-class setting to address this issue, improving efficiency and class discrimination power.

We will focus our attention in Matthews Correlation Coefficient (MCC) and Cohen’s Kappa. The former was introduced in the binary setting by Matthews ([4]), and generalized to the multi-class case in [5], being commonly used as a reference performance measure, especially for unbalanced data sets, in different fields as, for example, bioinformatics (see [5]-[7]). On the other hand, Kappa is a traditional measure originally designed as a measure of agreement between two judges, based on the Accuracy but corrected for chance agreement. At present, its use is not simply limited to medicine or psychology (see for instance, [8] and [9]), but is a measure widely used in other fields as ecology ([10] and [11]), neuroscience ([12]) or machine learning, where it is used to evaluate the agreement between the actual and the assigned classes by a classifier. In the classification literature, the discussion on Kappa is most focused on its suitability compared to other classifiers; for example, in [1] Kappa has been considered jointly with 17 other performance metrics in several scenarios.

It is not an overstatement to say that Kappa is one of the most widespread measures and of use in several fields and disciplines. Nevertheless, some authors, including the introducer of Kappa statistic himself, Jakob Cohen, alerted that Kappa could be inadequate in different circumstances, specifically when an imbalance distribution of classes is involved, i.e. the marginal probability of one class is much more (or less) greater than the others (leaving aside the literature below, on which we will deal more closely, see also [13]-[17]). According to them, some problems arise in such situations because it is not clear how the hypothetical probability of chance agreement should be defined. In [18] and [19], the so-called Kappa paradox is described. Roughly speaking, Kappa paradox arises since for a fixed agreement between judges, the Kappa statistic penalizes judges with similar marginals compared with judges with different ones. The authors show several examples where this happens.

This same obstacle is extensively studied in [20]-[22]. In the later, two separate causes of the paradox are considered; (1) the prevalence paradox arises from the fact that when the hypothetical probability of chance agreement among raters is high, even high values of the relative observed agreement (which is identical to Accuracy) produce low values of Kappa, and (2) the bias paradox, which is the consequence of the fact that imbalanced marginal distributions produce higher scores of Kappa. The authors claim that reporting a single agreement coefficient makes interpretation and comparison difficult. Hence, they suggest a corrected version of Kappa for bias and prevalence (PABAK), which should be used together with Kappa.

Similar conclusions emerge from [23], where the authors claim that Kappa is a relative measure of agreement, which is an inadequate characteristic for assessing in a clinical setting, specifically if a high agreement among experts leads to lower values of Kappa. Instead, they suggest using the proportion of specific agreement ([24]), which divides the agreement into a positive and a negative rate, allowing professionals to have an absolute measure and at the same time, information about the marginal distributions. Regarding the effect on estimation of the chance agreement, Albatine et al. ([25]) analysed 28 different similarity measures for clustering purposes; they suggest adding a correction for chance, in a specific family of coefficients, which makes some of them equivalent, regardless of how expectations are calculated. This work is extended by Warrens in [26], where more in-depth analysis is presented and several indices are generalized: Cohen’s kappa ([27]), Scott’s pi ([28]), Mak’s rho([29]), Goodman and Kruskal’s lambda ([30]), and Hamann’s eta ([31]).

On the other hand, there are several authors that defend that Kappa is a useful measure of agreement, when its limitations are taken into account. For example, in [32] the authors defend the use of Kappa in a previous study, and warn that it is a useful measure if marginal distributions are considered. A similar conclusion was reached in [33], where it is said that although Kappa is not suitable in certain circumstances, it is better than the raw proportion. In [34] the work of [22] expands and the Kappa pitfalls are explained for the agreement between judgments, concluding that if it is used and interpreted properly, the Kappa coefficient provides a valuable information. As in previous works, they propose to use corrected versions of the coefficient as well. In [16] the author argues that in the case of dichotomous variables, Kappa is satisfactory (although it is not for other cases); as we show in the present work, even in the binary case, Kappa can exhibit unexpected behaviour. Finally, there are some authors ([34]) who do not agree with the use of weighted versions of the statistics as PABAK, and suggest select the marginal distributions to be similar.

In general, the use of Kappa is not only extended but accepted, and its pitfalls are overcome by considering the marginal distributions and using weighted alternatives, as, for example the one suggested by Cohen ([15]), PABAK or other alternatives ([35] and [36]).

Despite the vast amount of existing literature, in the field of medicine and psychology, pointing out the threats of Kappa, when Classification Machine Learning methods experimented their boom Cohen’s Kappa was introduced as a reliable performance metric. Actually it is incorporated in the most extended software packages, such as SciKit Learn [37] for Python, and Caret [38] for R. What is more, in recent studies such as [39]-[42] and [12], Kappa is still used as if it were a reliable performance metric. In fact, the literature reviewed recognizes the difficulty of clinical professionals in interpreting Kappa because it is a relative measure, that is, Kappa itself is not enough to know if two professionals agree or disagree. This does not seem to be a problem in machine learning classification because the ground-truth is always compared with different methods in the same condition of marginal distributions. Therefore, it can be argued that we are not interested in the value of Kappa itself (as are the clinicians), but in the difference of the classifying pairs ground-truth, so Kappa is a reliable metric for this task. However, the reality is that this is not always the case. As we show, there are scenarios in which, given the same ground-truth, a better classifier can obtain a lower value of Kappa. It is important to mention that some authors also highlight the problems associated with Kappa when it is used as a performance metric in classification (see for instance [43]-[45]), although they do not perform an exhaustive analysis like the one presented here.

Clearly, marginal distributions seem to play a key role in the problems surrounding Kappa. However, there is a lack of a consistent and satisfactory description of the cases in which the unwanted behaviour of Kappa appears, and how this affects its use as a performance metric for classification.

In our paper, we deepen the study of the pitfalls discussed above by analysing in detail the unwanted behaviour of Kappa from a novel perspective. Our point of view is the identification of situations in which discrepancies in its behaviour, with respect to that of MCC, become evident, going in the opposite direction. Indeed, we study varied scenarios of misclassification in settings with different marginal probabilities of the categories, and how this scenarios affect the statistics Kappa and MCC, by analysing both the asymmetry and the entropy of the confusion matrix. Considering Kappa as a relative measure of agreement, we provide a mathematical framework to understand the associated problems with it when dealing with extreme unbalanced marginal distributions, which is frequent in machine learning problems.

Our goal is to present a systematic study, both analytical and by means of empirical experimentation, to compare the two performance measures. For that, we investigate the similarities and differences in the behaviour of MCC and Kappa in different scenarios. In some of them, they are strongly correlated, and we show some mathematical relations and study some limit cases. But in others, they exhibit very different behaviour, being that of Kappa contrary to common sense, to the point that we join the detractors of its use for the assessment of classifiers. This paper is an attempt to shed some light on the identification of the latter.

The paper is organized as follows: first, we introduce some definitions and state some notations. Next, we prove that if the confusion matrix, which allows visualization of the performance of a classifier, is symmetric, then Kappa and MCC coincide. Each column in the confusion matrix represents the cases in any predicted class, while each row represents the cases in any actual class. In the sequel, we study in some detail the binary case, in which classes are named “positive” and “negative” and the confusion matrix has a general form , where a = true positive, b = false negative, c = false positive and d = true negative, splitting the study according to whether c = 0, the scenario in which Kappa has a behaviour consistent with that of MCC, and c > 0, in which the opposite happens. For each of these cases, we consider particular sub-cases and we deepen in their study. We also consider a pathological multi-class unbalanced situation, in which one of the classes is much more common than the others, and it is mainly misclassified (family of confusion matrices ZA introduced in [2]). We also perform empirical experimentation in dimension 3, considering some families of confusion matrices, and finish with a few concluding words.

Definitions and notations

Given a generic matrix M, let MT denote its transpose, that is, the matrix obtained from M by interchanging columns and rows. The same notation applies to vectors, which by default are column vectors. We say that matrix Q is equivalent to M, and denote it by QM, if Q can be obtained from M by multiplying it by a positive constant.

Classification

Classification consists of assigning a case to a class (category or label) on the basis of a known set of features or characteristics. This is usually done by a classifier learned from a training dataset. From the validation process of the classifier with a testing dataset, we obtain a confusion matrix C, which takes into account actual and predicted classes of the cases in the testing dataset. To fix ideas, assume that there are N different classes labeled {1, …, N}. Then, C = (Cij)i,j=1,…,N is a N × N matrix defined by: Cij is the number of cases in the testing dataset that belong to class i and have been assigned to class j by the classifier. Note that Cij ≥ 0. Let S denote the sum of all the elements of C (the number of cases in the testing dataset), that is, . In the binary case N = 2, to abbreviate notation we preferably denote by , as previously mentioned in the Introduction.

In the context of classification, Accuracy (Acc for brief) is the fraction of correctly classified cases in the testing dataset, that is, . This performance measure is one of the most intuitive, and it is naturally extended to multi-class from binary classification. Acc mainly considers the diagonal of the confusion matrix, and does not take into account how the off-diagonal elements, corresponding to misclassification, are distributed.

Other more subtle performance measures based on the confusion matrix have been introduced to compare classifiers. We here compare two of the most commonly used. Note that these measures are invariant for equivalent confusion matrices.

Matthews correlation coefficient

The binary case.

Matthews Correlation Coefficient MCC was first introduced in the binary case by B.W. Matthews [4] to assess the performance of protein secondary structure prediction, as the ϕ-coefficient, which is the measure of association obtained by discretization of the Pearson’s correlation coefficient for two binary vectors. That is, in the binary case, MCC = ϕ = ρ(x, y), where x = (x1, …, xS)T and y = (y1, …, yS)T are the S-dimensional binary vectors defined in this way: and ρ is Pearson’s correlation coefficient defined by (1) where, as usual, and , and Cov(x, y) denotes the statistical covariance of x and y, that is, , and when x = y, Cov(x, x) = Var(x) is the statistical (uncorrected) variance of x. Note that the square of the ϕ-coefficient is related to the chi-squared statistic for the 2 × 2 contingency table, χ2, by means of . Then, using some algebra and taking into account that, by definition of vectors x and y, the elements of the confusion matrix are we obtain that and then using and for any i = 1, …, S, we can rewrite (1) as (2)

The multi-class case.

In [5] the problem of evaluation of prediction of RNA secondary structure in cases where some predicted pairs go into the category of “unknown” due to lack of reliability, is considered. By introducing an extended correlation coefficient that applies to any number of categories, the author facilitates addressing the problem of predicting base pairs of RNA secondary structure as a three-category problem instead of artificially force it to fall into the binary case by fixing one of the categories, and then considering which cases belong and which do not belong to that category, leading to a loss of information and a suboptimal procedure. Indeed, MCC is generalized in [5] to classification with N > 2 classes based on considering the expected covariance of all categories and constructing the following extension of Pearson’s correlation coefficient ρ from a pair of binary vectors to a pair of binary matrices: (3) where if X and Y are two matrices S × N, is defined as the average of the N covariances between the different pairs of S-dimensional binary vectors given by the same column in matrices X and Y, that is, , where xk = (X1k, …, XSk)T and yk = (Y1k, …, YSk)T are the columns k of matrices X and Y, respectively. Therefore, by defining S × N matrices X = (Xij)i,j and Y = (Yij)i,j in the following way: for i = 1, …, S and j = 1, …, N, we finally introduce the multi-class extension by , and by using some algebra and that by definition of matrices X and Y, , we obtain the known expression (4)

We give below a sketch of the proof of the equivalence between (3) and (4). Indeed, the numerator of (3) can be developed as follows: using that , which is a consequence of the fact that by definition, since (note that by definition of Y, ), and analogously with .

We also used that , and that . Now we develop the term in the denominator of (3) corresponding to X (analogous development would be obtained for Y):

Note that in the binary case, expression (4) matches (2). Indeed, when N = 2, numerator of (4) can be written as 2(C11 C22C21 C12) = 2(adbc), while the first term in the denominator is , and the second one coincides with .

Software provided by the author of [5] allowing to perform the calculations easily is available at http://rk.kvl.dk/.

Cohen’s Kappa

Cohen’s Kappa statistic, or simply Kappa (henceforth, also denoted by ), was originally introduced by J. A. Cohen [27] in the field of psychology as a measure of agreement between two judge, and later it has been used in the literature as a performance measure in classification, as for example in [46]. More concretely, Kappa is used in classification as a measure of agreement between observed and predicted or inferred classes for cases in a testing dataset. Its definition is: (5) where Pe is the hypothetical probability of chance agreement, using the values of the confusion matrix to estimate the probabilities of randomly choose each class, that is, , where as usual, we use the notations (the sum of row i), and (the sum of column j).

Both MCC and Kappa assume their theoretical maximum value of +1 when classification is perfect, the larger the metric value, the better the classifier performance. MCC ranges between −1 and +1 while Kappa does not in general, although it does in the cases considered in this work. Moreover, it is straightforward to see that they are symmetric, that is, and MCC(CT) = MCC(C).

The symmetric case

In the case of a symmetric confusion matrix, it is known that Kappa statistic is equivalent to Scott’s pi ([28], [47]), which is a special case of Krippendorff’s alpha ([48]). Scott’s pi is a statistic with the same structure as Kappa but that differs from it in the definition of Pe. Hereunder, we will show that if C is a symmetric matrix, Kappa and MCC not only are consistent with each other but they coincide exactly. Although this result seems to be known, we could not find a reference for it and therefore, we provide its proof here.

Proposition 1 Let C = (Cij)i,j=1,…,N be a symmetric confusion matrix in the general multi-class setting. That is, C = CT. Then, .

Proof. By (4) and taking into account that Cij = Cji by symmetry, we can write (6) On the other hand, by symmetry we can write , and therefore, which coincides with MCC(C) by (6).

The binary case

Let C be a generic confusion matrix in dimension 2, . By (2) and (5), we have that and it turns out that is the harmonic mean of α and β, while MCC(C) is their geometric mean, being That is, and . As a direct consequence of the known relationship between these two means, we have that in the binary case: (7)

Now we delve a little deeper into the relationship between the two performance measures. By the property of invariance for equivalent confusion matrices, we can split the study of the binary case into two different scenarios: c = 0 and c = 1 (the latter corresponding to c > 0). These two cases cover all the possibilities, determining a partition of the set of binary confusion matrices into two subsets with clearly differentiated behaviour. As we will see next, when c = 0 there is an agreement between MCC and Kappa. What is more, MCC and Kappa are linked by means of a functional relationship (see Proposition 2 below) that easily shows the relationship of monotony between them, which implies that when one of them grows or decreases, the other also does the same, that is, they have a consistent behaviour. On the contrary, when c = 1 an important disagreement between the two measures highlights in different particular scenarios (see Corollaries 4, 5 and 6). Indeed, in all of them it is shown that while MCC monotonically decreases as the task done by the classifier is getting worse, Kappa does not.

Moreover, as the row sums are the actual number of cases in the testing dataset belonging to each class, we assume that they are both strictly positive, that is, a + b > 0 and c + d > 0. We also must ensure that MCC can be calculated, i.e, that we do not divide by zero. For that, the sum of the columns must also be strictly positive, that is, we additionally assume that a + c > 0 and b + d > 0.

The c = 0 case: Agreement between MCC and Kappa

This case corresponds to perfect classification of the negative class, since there are no cases of the negative class in the testing dataset that have been classified as belonging to the positive class. Then, we assume a > 0 and d > 0. Moreover, we assume b > 0 since b = 0 corresponds to the symmetric case already studied in the previous section, in which . We use notation We have, then,

We will show that in this case there is agreement between the behaviour of the two measures. Indeed, they are linked by means of a functional relationship, as can be seen in the next proposition.

Proposition 2 and the following properties hold:

  1. Since MCC(C0) > 0, is a monotonically increasing function of MCC(C0), so they are consistent performance measures.
  2. .
  3. The maximum distance between them is achieved when MCC(C0) ≈ 0.3, and is ≈ 0.13.

Moreover,

  • Fixed a, d, which corresponds to an scenario in which the negative class is underrepresented and cases actually in the positive class are mainly misclassified. On the other hand, corresponding to perfect classification (see Fig 1(a)).
  • Fixed b, d, which corresponds to an scenario in which the negative class is underrepresented but cases actually in the positive class are mainly well classified. Note that as b → 0, both and lima→+∞ MCC(C0), tend to be 1.
    On the other hand, corresponding to complete misclassification of the positive class (see Fig 1(b)).
  • The case with a, b fixed, considering MCC(C0) and as function of d, is symmetric to the previous one, and then omitted.
thumbnail
Fig 1. Agreement between MCC and Kappa for C0.

Unbalanced case with underrepresentation of the negative class, which is perfectly classified. (a) With a = d = 1, as function of b: positive class mainly misclassified. (b) With b = d = 1 as function of a: positive class mainly well classified.

https://doi.org/10.1371/journal.pone.0222916.g001

The c = 1 case: Disagreement between MCC and Kappa

This case corresponds to not-completely perfect classification of the negative class, since there is at least one case in the testing dataset belonging to this class that has been classified as being in the positive class. We assume b > 0 since if b = 0 we are in the previous situation, by symmetry of MCC and Kappa. Although b = 1 corresponds to a symmetric confusion matrix already studied, we include it in this section for the sake of completeness. We use the notation Then,

Proposition 3 If .

If .

Otherwise,

Next we consider some particular scenarios of this case that should be explored.

  1. a = d > 0.
    We use notation . Fixed a > 0, if b > 1, the negative class is underrepresented, and the positive class is mainly misclassified, while if b < 1, say b = 1/h with h > 1, , which is a confusion matrix that corresponds to underrepresentation of the positive class while it is mainly well classified (if b → 0, which is equivalent to h → +∞). Then,
    From these expressions and Proposition 3, we obtain:
    Corollary 4 If .
    If .
    Otherwise, where
    Fixed a > 0, , while and , as a function of b, is monotonically decreasing when b increases, which agrees with the intuition, since when b monotonically increases, the task done by the classifier is clearly getting worse, while is not. Indeed, fixed a > 0, has a global minimum at b = b0 with
    See Fig 2 to observe the behaviour of MCC and Kappa fixed a = 0.2, as function of b.
    Remark 1 Corollary 4 explains the behaviour of MCC and Kappa for a confusion matrix equivalent to , according to the values of a = “true positive” = “true negative”, and b = “false negative”/“false positive”. In particular, fixed “true positive” = “true negative” and “false positive”, we observe a contradictory behaviour between these two performance measures as b increases. Indeed, as “false negative”/“false positive” is increasing (implying that the negative class is underrepresented, and the positive class is mainly misclassified), MCC monotonically decreases, what is reasonable, but Kappa does not. In fact, Kappa decreases for low values of b (b < b0) but increases otherwise. This unreasonable behaviour of Kappa goes in the direction of the thesis defended in this work. Fig 2 graphically shows this fact for the particular case a = 0.2, corresponding to a confusion matrix equivalent to .
    Case b > 1, with a = 1, corresponds to matrix ZA with A = b and dimension N = 2, which is a pathological situation that will be studied in the next section.
  2. a > 0, d = 0.
    We use notation . In this case, and application of Proposition 3 allows obtaining the following result:
    Corollary 5
    Although fixed a > 0, is a monotonically decreasing function of b, coinciding with intuition, is not, achieving its global minimum when . Moreover, fixed a > 0,
    See Fig 3 to observe the behaviour of MCC and Kappa, fixed a = 1, as function of b.
    Remark 2 In Corollary 5 we can observe the behaviour of MCC and Kappa for a confusion matrix equivalent to , corresponding to a scenario in which the negative class is underrepresented and the classifier systematically misclassifies this class, and generally also misclassifies the positive class if b = “false negative”/“false positive” is big. In particular, fixed “true positive” and “false positive”, we observe a contradictory behaviour between MCC and Kappa as b increases: while MCC monotonically decreases, what is expected, Kappa decreases for but increases otherwise. Again, we observe here an unreasonable behaviour of Kappa, which is graphically showed in Fig 3 for the particular case a = 1, corresponding to a confusion matrix equivalent to .
  3. d = 1, a ≥ 0.
    We use notation Classification of negative class is entirely done by random, that is, with the same probability a case actually in the negative class is classified as belonging to any of the two classes. If a, b > 1, negative class is underrepresented. We have that and application of Proposition 3 gives:
    Corollary 6
    As in the previous cases with c = 1, although if we fix a > 0, then is a monotonically decreasing function of b, coinciding with intuition, we can see that is not, achieving its global minimum when . Moreover, fixed a > 0,
    In Fig 4 we can observe the behaviour of MCC and Kappa, fixed a = 0.2, as function of b.
    Remark 3 Finally, Corollary 6 is dedicated to confusion matrices equivalent to , which correspond to an unbalanced database set if a, b > 1, with minority class the negative one, which is randomly classified, that is, each class is imputed with the same probability to a case actually in the negative class. In addition, if fixed a = “true positive”/“true negative”, when b = “false negative”/“false positive” increases the positive class is mainly misclassified. While MCC in this situation behaves as expected and monotonically decreases, Kappa does not, increasing for . As in the previous corollaries, an unreasonable behaviour of Kappa is observed, which is shown in Fig 4 for the particular case a = 0.2, that is, for a confusion matrix equivalent to .
thumbnail
Fig 2. Disagreement between MCC and Kappa for with a = 0.2, as function of b ≥ 0.

If b > 1, the negative class is underrepresented and quite misclassified, and the positive class is mainly misclassified. (a) A zoom of the detail for b ≤ 2. (b) For b ≤ 30.

https://doi.org/10.1371/journal.pone.0222916.g002

thumbnail
Fig 3. Disagreement between MCC and Kappa for with a = 1, as function of b ≥ 0.

If b > 1, the negative class is underrepresented and systematically misclassified, and the positive class is also mainly misclassified. (a) A zoom of the detail for b ≤ 2. (b) For b ≤ 30.

https://doi.org/10.1371/journal.pone.0222916.g003

thumbnail
Fig 4. Disagreement between MCC and Kappa for with a = 0.2, as function of b ≥ 0.

The negative class is classified at random. If b > 1 the positive class is mainly misclassified, and the negative class is underrepresented. (a) A zoom of the detail for b ≤ 2. (b) For b ≤ 30.

https://doi.org/10.1371/journal.pone.0222916.g004

The ZA family

Finally, we consider another situation that highlights the incoherent behaviour of Kappa. {ZA, A ≥ 0} has been introduced in [2] as a family of confusion matrices useful to analyse performance measures in unbalanced situations. The definition of ZA is as follows: . We denote by MCC(A) and , respectively, the MCC and Kappa values of matrix ZA. Note that when N = 2, this family is a particular case of iii) with a = 1 and b = A. Then, we obtain from Corollary 6 the following result:

Corollary 7 If N = 2, We have that Although MCC(A) is a monotonically decreasing function of A, coinciding with intuition, is not, achieving its global minimum when . Moreover,

We generalize the previous result to any N ≥ 2 in the following proposition:

Proposition 8 and the following properties hold:

  1. ,
  2. and then,
  3. ,
  4. ,
  5. MCC(A) is monotonically decreasing, while is not. Indeed, is a convex function of A, achieving the global minimum, which is a negative value, when .
  6. The divergence between MCC(A) and increases monotonically as A → ∞.

Fig 5 shows the behaviour of MCC and Kappa as functions of A, in cases N = 2 (both for A ≤ 5 and for A ≤ 100), and for N = 5 and N = 10. A desirable property of any measure of performance is its internal coherence, which implies that if the classifier moves gradually towards a worsening of the classification process, as is the case when A increases for the family ZA, the measure must reflect this fact with the consequent monotonous decrease (or increase, depending on the interpretation of the measure). Fig 5 highlights the incoherent behaviour of Kappa, since as we monotonically increase A, it does not exhibits a monotonic decreasing (as MCC does), and this anomaly not only happens in the binary case (N = 2), but continues to occur when we increase N above 2, although at a different scale. Therefore, we have seen that MCC shows internal coherence, unlike Kappa, which after decreasing in accordance with the worsening of the classification by increasing A, shows a monotonic growth that goes just in the opposite direction by continuing to increase A, which is clearly inconsistent.

thumbnail
Fig 5. Disagreement between MCC and Kappa for ZA, for different values of N.

(a) N = 2, a zoom of the detail for A ≤ 5. (b) N = 2, A ≤ 100. (c) N = 5, A ≤ 500. (d) N = 10, A ≤ 1000.

https://doi.org/10.1371/journal.pone.0222916.g005

Experimental results

If we recapitulate, we have seen that both in the binary case with c = 1, and with the multidimensional ZA family, as the asymmetry of the confusion matrix increased (b → +∞ and A → +∞, respectively), while its diagonal stays constant, the behaviour of Kappa and MCC differed more and more. This would be in line with the proven fact that if there is perfect symmetry, therefore these measures match (Proposition 1). It seems natural to ask if it is only the asymmetry that plays a determining role in the discrepancy observed in their linked behaviour (it seems that it should not be like that, since asymmetry of matrix C0 also increases as b → +∞, and yet the behaviour of Kappa and MCC agree). Or, on the contrary, there is any other characteristic of the matrix that drives in this circumstance. To try to shed some light on this issue, we have carried out some empirical experimentation in dimension N = 3.

We start by introducing a measure of the asymmetry of a matrix , say Asy(M), by means of the Frobenius norm of the difference between the matrix and its transpose. That is to say, we define

Example (a) Let us consider matrix , with A ≥ 1. Obviously, M1(A) is not symmetric, with Asy(M1(A)) = 2A, which increases with A, achieving the minimum = 2 when A = 1. We can make a graph showing the evolution of Kappa and MCC when increasing A, as shows Fig 6, where it can be observed that the behaviour of Kappa is very similar to that of MCC. Then, asymmetry has not been enough to generate a different behaviour of them. What, then?

thumbnail
Fig 6. Experimental agreement between MCC and Kappa for M1(A).

Increasing asymmetry but constant entropy.

https://doi.org/10.1371/journal.pone.0222916.g006

Think about the entropy generated by the values of the matrix that are outside the main diagonal. In general, given a set of non-negative numbers, say {n1, …, nr}, the Shannon’s entropy generated by the set can be defined by , with if , where log usually denotes logarithm in base 2. With this definition, Ent(M1(A)) = Ent({2A, A, A, 2A, A, A}) = 2.5, which is independent of A, so for the family of matrices M1(A), entropy can not play any role since it remains constant when A varies. The same happens with matrix C0, for which asymmetry increases as b → +∞ but entropy remains constant. In other words: increasing asymmetry but constant entropy does not produce the phenomenon of inappropriate behaviour of Kappa in which we are interested.

Example (b) Consider now matrix with A > 1. Then , which increases with A, and decreases, converging to 0 as A → +∞. The corresponding plots of Kappa, MCC and the difference, with respect to A are shown in Fig 7.

thumbnail
Fig 7. Experimental disagreement between MCC and Kappa for M2(A).

Decreasing to zero entropy, which implies increasing asymmetry.

https://doi.org/10.1371/journal.pone.0222916.g007

MCC(M2(A)) is a decreasing function of A but is increasing for A ≥ 4. Then, we can observe a contradictory behaviour of the two measures. Let us see this with numerical examples in Table 1: as A increases (and then, asymmetry increases while entropy decreases to zero), MCC decreases but Kappa increases.

thumbnail
Table 1. Comparing MCC, Kappa, Asy and Ent for M2(A).

A = 10, 25, 50, 75, 100.

https://doi.org/10.1371/journal.pone.0222916.t001

Remark 4 Note that for matrix M2(A), MCC and Kappa diverge as A increases, as it happens with the family of matrices ZA and with the confusion matrix considered in Proposition 3 (binary case with c = 1 in which the behaviour of Kappa appears as contrary to common sense when b increases). In the three scenarios, entropy decreases to zero and the asymmetry of the confusion matrix grows to +∞. Indeed, for matrices ZA (as A → +∞) and C1 (as b → +∞) we have that

In general, entropy of the elements outside the main diagonal and asymmetry are related in the sense given by the following lemma.

Lemma 9 Let C(A) = (Cij(A))i,j=1,…,N be a matrix of non-negative integers depending on a parameter A ∈ ℕ, and such that Ent(C(A)) > 0 for any A. Therefore, if the entropy of C(A) decreases to zero, asymmetry must grow to infinity, that is, Proof: By definition of Shannon’s entropy, if Ent(C(A)) converges to zero, then in the limit there is no uncertainty outside the main diagonal, that is, there must exist a pair (i, j), with ij, such that Then, with (r, s) = (j, i), we can write since and .

Finally, from the fact that Asy(C(A)) ≥ |Cij(A) − Cji(A)| → +∞ we finish the proof.

Lemma 9 confirms that what we have observed in different examples (confusion matrices C1 as function of b, ZA and M2(A)), in which entropy tended to zero and asymmetry grew towards infinity, is not a coincidence but the rule.

It is still necessary to ask whether the role of asymmetry in observing the phenomenon of the discrepancy between the behaviours of Kappa and MCC is canceled out by entropy. That is, if the phenomenon still can be observed if the asymmetry remains constant while the entropy does not decrease to zero. The negative answer is given by the following example, in which asymmetry is constant and entropy decreases to a positive limit but the phenomenon of discrepancy between MCC and Kappa is no longer observed.

Example (c) Be matrix with B = 1000 − A, A = 0,…, 999. The corresponding plot of MCC, Kappa and the difference in absolute value is shown in Fig 8. In this setting, as with Example (a), there is an agreement in the behaviour of MCC and Kappa. However, in this case there is no decrease of entropy to zero as in Example (b). Indeed, with B = 1000 − A, is a monotonically decreasing function of A that converges to log(300) − log(100) > 0 as A → 1000, while remains constant.

thumbnail
Fig 8. Experimental agreement between MCC and Kappa for M3(A).

Decreasing entropy to a positive limit and constant asymmetry.

https://doi.org/10.1371/journal.pone.0222916.g008

Previous examples, in which the diagonal stays constant, show that it is not enough that the asymmetry grows to infinity, or that the entropy is constant or simply decreasing, for the phenomenon of discrepancy between Kappa and MCC to occur, but heuristically it seems that entropy must decrease to zero, which implies that at the same time asymmetry grows to infinity by Lemma 9. At least it is what experimentation has shown in the cases already commented. To finish, two more examples in the same vein, the first corresponding to the situation of discrepancy, and the latter to the similarity, in the behaviours of MCC and Kappa.

Example (d) Let be the confusion matrix , with B = 100 − A and A = 50,…, 100. In this case, as function of A ∈ [50, 100], monotonically increases with A, and with g(A) = A(A + 1) + (100 − A)(101 − A) + 2, monotonically decreases (to zero if we increase the parameter 100). We can observe in Fig 9 that in this case the appearance of the described phenomenon of behaviour against the common sense of Kappa is confirmed: for A > 50, MCC decreases and Kappa increases as A increases. By symmetry, for A < 50 we observe just the same when A decreases.

thumbnail
Fig 9. Experimental disagreement between MCC and Kappa for M4(A).

Entropy decreases to zero, which implies that asymmetry increases, for A increasing from 50 to 100, and from A decreasing from 50 to 0, by symmetry.

https://doi.org/10.1371/journal.pone.0222916.g009

Table 2 illustrates this example numerically through a particular case in which we compare different values of A. We observe that when entropy decreases and asymmetry increases (A > 50) MCC decreases and Kappa increases, while a completely symmetrical behaviour is observed for A < 50, according to Fig 9.

thumbnail
Table 2. Comparing MCC, Kappa, Asy and Ent for M4(A).

A = 50, 60, 70, 80, 90, 100.

https://doi.org/10.1371/journal.pone.0222916.t002

Example (e) Let be the confusion matrix . As function of A ≥ 1, and is increasing, while decreases to log(7) − 2/7 > 0 when A → +∞. In this case, MCC and Kappa agree in behaviour as A increases.

Conclusion

Accuracy is one of the most intuitive and widely used performance metrics for classification although it is not appropriate when considering unbalanced cases. MCC and Kappa seem to correct this bias: the former was initially designed to deal with very unbalanced data, while the latter, which was not created to be a classification performance metric but that, however, is widely used for this, takes into account the probability of getting the classification by pure chance. These two measures have a similar behaviour in some situations. In fact, we show that they coincide precisely when the confusion matrix is perfectly symmetric. In other situations, however, their behaviour can diverge to the point that Kappa should be avoided as a measure of behaviour to compare classifiers in favor of more robust measures as MCC.

In the present work, similarities and differences among MCC and Kappa have been discussed and illustrated with synthetic confusion matrices, both in the binary and in the multi-class setting. Our mathematical analysis and heuristic study show that in situations in which the diagonal of the confusion matrix stays constant and at the same time there is a decrease to zero of the entropy of the elements outside the diagonal, which implies an increase in the asymmetry of the confusion matrix, the phenomenon of qualitative differentiation in the behaviour of Kappa and MCC appears clearly. Notwithstanding, neither increasing nor constant asymmetry when entropy is not decreasing to zero, does not seem to be enough to produce this phenomenon. As far as we know, this kind of conclusions have not been reached before, so they represent a novelty in the study of Kappa.

From a clinical perspective, the fact that Kappa is a relative measure of agreement is problematic since it is hard to set a threshold for a good agreement. This does not seem to be a problem when it is used as a performance metric, because Kappa values are compared for each classifier given a unique ground-truth, being the relative difference and not the value itself, which determines the best classifier. Notwithstanding, we have shown that if marginal probabilities are really small, the distribution of the misclassification also affects the value of Kappa, to the extent that worse classification results can obtain, however, higher values of the statistic. This is especially dramatic when the entropy of the elements outside the main diagonal of the confusion matrix decreases to zero.

A summary of the examples that have been considered in this work according to the agreement/disagreement between the behaviour of MCC and Kappa, can be found in the Table 3.

thumbnail
Table 3. Summary of the obtained results: Examples and agreement/disagreement between the behaviour of MCC and Kappa in terms of the asymmetry of the confusion matrix and of the entropy associated to the elements outside the main diagonal.

Disagreement scenario corresponds to entropy decreasing to zero, which implies by Lemma 9 that asymmetry must grow to infinity.

https://doi.org/10.1371/journal.pone.0222916.t003

The standard problems associated with Kappa are mainly related to unbalanced datasets (see for instance [36] and [17]). We show that an unbalanced situation can make Kappa not comparable between different situations, but to achieve counter-intuitive results, it is also necessary that the entropy of the elements outside the main diagonal to decrease to zero.

Nowadays, in the field of machine learning such situations, in which the number of observations of one of the classes far exceed the quantity of the others, or when the marginal distributions are small, are very common. Machine learning algorithms automatically scrutinize huge amount of data, classifying it into hundreds of categories or look for an unlikely relevant event. In that framework, the finding of a dependable performance measure to be robust and reliable becomes of the utmost importance. Hence, we believe that it has been sufficiently justified that, unfortunately, Cohen’s Kappa can no longer play this role, especially considering the existence of solid alternatives.

Acknowledgments

The authors wish to thank the anonymous referees for careful reading and helpful comments that resulted in an overall improvement of the paper.

References

  1. 1. Ferri C., Hernández-Orallo J., Modroiu R.: An experimental comparison of performance measures for classification. Pattern Recognition Letters 30(1), 27–38 (2009)
  2. 2. Jurman G., Riccadonna S., Furlanello C.: A comparison of mcc and cen error measures in multi-class prediction. PloS one 7(8), e41882 (2012)
  3. 3. Sokolova M., Lapalme G.: A systematic analysis of performance measures for classification tasks. Information Processing & Management 45(4), 427–437 (2009)
  4. 4. Matthews B.W.: Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure 405(2), 442–451 (1975)
  5. 5. Gorodkin J.: Comparing two k-category assignments by a k-category correlation coefficient. Computational biology and chemistry 28(5-6), 367–374 (2004) pmid:15556477
  6. 6. Stokić D., Hanel R., Thurner S.: A fast and efficient gene-network reconstruction method from multiple over-expression experiments. BMC bioinformatics 10(1), 253 (2009) pmid:19686586
  7. 7. Supper, J., Spieth, C., Zell, A.: Reconstructing linear gene regulatory networks. In: European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics, pp. 270–279. Springer (2007)
  8. 8. Blair E., Stanley F.: Interobserver agreement in the classification of cerebral palsy. Developmental Medicine & Child Neurology 27(5), 615–622 (1985)
  9. 9. Cameron M.L., Briggs K.K., Steadman J.R.: Reproducibility and reliability of the outerbridge classification for grading chondral lesions of the knee arthroscopically. The American journal of sports medicine 31(1), 83–86 (2003) pmid:12531763
  10. 10. Monserud R.A., Leemans R.: Comparing global vegetation maps with the Kappa statistic. Ecological modelling 62(4), 275–293 (1992)
  11. 11. Allouche O., Tsoar A., & Kadmon R.: Assessing the accuracy of species distribution models: prevalence, kappa and the true skill statistic (TSS). Journal of applied ecology 43(6), 1223–1232 (2006)
  12. 12. Tian Y., Zhang H., Pang Y., Lin J.: Classification for single-trial N170 during responding to facial picture with emotion. Front. Comput. Neurosci. 12:68. pmid:30271337
  13. 13. Donker D., Hasman A., Van Geijn H.: Interpretation of low Kappa values. International journal of bio-medical computing 33(1), 55–64 (1993) pmid:8349359
  14. 14. Forbes A.D.: Classification-algorithm evaluation: Five performance measures based onconfusion matrices. Journal of Clinical Monitoring 11(3), 189–206 (1995) pmid:7623060
  15. 15. Brennan R.L., Prediger D.J.: Coefficient Kappa: Some uses, misuses, and alternatives. Educational and psychological measurement 41(3), 687–699 (1981)
  16. 16. Maclure M., Willett W.C.: Misinterpretation and misuse of the Kappa statistic. American journal of epidemiology 126(2), 161–169 (1987) pmid:3300279
  17. 17. Uebersax J.S.: Diversity of decision-making models and the measurement of interrater agreement. Psychological bulletin 101(1), 140–146 (1987)
  18. 18. Feinstein A.R., Cicchetti D.V.: High agreement but low Kappa: I. the problems of two paradoxes. Journal of clinical epidemiology 43(6), 543–549 (1990) pmid:2348207
  19. 19. Cicchetti D.V., Feinstein A.R.: High agreement but low Kappa: Ii. resolving the paradoxes. Journal of clinical epidemiology 43(6), 551–558 (1990) pmid:2189948
  20. 20. Krippendorff K.: Reliability in content analysis: Some common misconceptions and recommendations. Human communication research 30(3), 411–433 (2004)
  21. 21. Warrens M.J.: A formal proof of a paradox associated with Cohen’s Kappa. Journal of Classification 27(3), 322–332 (2010)
  22. 22. Byrt T., Bishop J., & Carlin J. B.: Bias, prevalence and kappa. Journal of clinical epidemiology 46(5), 423–429 (1993) pmid:8501467
  23. 23. de Vet H.C., Mokkink L.B., Terwee C.B., Hoekstra O.S., Knol D.L.: Clinicians are right not to like Cohen’s Kappa. BMJ 346, f2125 (2013) pmid:23585065
  24. 24. Dice L. R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)
  25. 25. Albatineh A. N., Niewiadomska-Bugaj M., & Mihalko D.: On similarity indices and correction for chance agreement. Journal of Classification 23(2), 301–313 (2006)
  26. 26. Warrens M. J.: On similarity coefficients for 2 × 2 tables and correction for chance. Psychometrika 73(3), 487 (2008) pmid:20037641
  27. 27. Cohen J.: A coefficient of agreement for nominal scales. Educational and psychological measurement 20(1), 37–46 (1960)
  28. 28. Scott W.A.: Reliability of content analysis: The case of nominal scale coding. Public opinion quarterly pp. 321–325 (1955)
  29. 29. Mak T. K.: Analysing intraclass correlation for dichotomous variables. Journal of the Royal Statistical Society: Series C (Applied Statistics) 37(3), 344–352 (1988)
  30. 30. Goodman L. A., & Kruskal W. H.: Measures of association for cross classifications III: Approximate sampling theory. Journal of the American Statistical Association, 58(302), 310–364 (1963)
  31. 31. Brennan R. L., & Light R. J.: Measuring agreement when two observers classify people into categories not defined in advance. British Journal of Mathematical and Statistical Psychology 27(2), 154–163 (1974)
  32. 32. Bexkens R., Claessen F. M., Kodde I. F., Oh L. S., Eygendaal D., & van den Bekerom M. P.: The kappa paradox. Shoulder & Elbow, 10(4), 308–308 (2018)
  33. 33. Viera A. J., & Garrett J. M.: Understanding interobserver agreement: the kappa statistic. Fam med 37(5), 360–363 (2005) pmid:15883903
  34. 34. Sim J., & Wright C. C.: The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical therapy 85(3), 257–268 (2005) pmid:15733050
  35. 35. Warrens M.J.: On association coefficients, correction for chance, and correction for maximum value. Journal of Modern Mathematics Frontier 2(4), 111–119 (2013)
  36. 36. Andrés A.M., Marzo P.F.: Delta: A new measure of agreement between two raters. British journal of mathematical and statistical psychology 57(1), 1–19 (2004) pmid:15171798
  37. 37. Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al.: Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct), 2825–2830 (2011)
  38. 38. Kuhn M., et al.: Caret package. Journal of statistical software 28(5), 1–26 (2008)
  39. 39. Huang C., Davis L., Townshend J.: An assessment of support vector machines for land cover classification. International Journal of remote sensing 23(4), 725–749 (2002)
  40. 40. Duro D.C., Franklin S.E., Dubé M.G.: A comparison of pixel-based and object-based image analysis with selected machine learning algorithms for the classification of agricultural landscapes using spot-5 HRG imagery. Remote Sensing of Environment 118, 259–272 (2012)
  41. 41. Passos A.N., Kohara V.S., Freitas R.S.d., Vicentini A.P.: Immunological assays employed for the elucidation of an histoplasmosis outbreak in São Paulo, SP. Brazilian Journal of Microbiology 45(4), 1357–1361 (2014) pmid:25763041
  42. 42. Claessen F. M., van den Ende K. I., Doornberg J. N., Guitton T. G., Eygendaal D., van den Bekerom M. P., … & Wagener M.: Osteochondritis dissecans of the humeral capitellum: reliability of four classification systems using radiographs and computed tomography. Journal of shoulder and elbow surgery 24(10), 1613–1618 (2015) pmid:25953486
  43. 43. Powers, D.M.W.: The problem with Kappa. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 345–355. Association for Computational Linguistics (2012)
  44. 44. Jeni, L.A., Cohn, J.F., De La Torre, F.: Facing imbalanced data–recommendations for the use of performance metrics. In: Affective Computing and Intelligent Interaction (ACII), 2013 Humaine Association Conference on, pp. 245–251. IEEE (2013)
  45. 45. Zhao X., Liu J.S., Deng K.: Assumptions behind intercoder reliability indices. In Salmon Charles T. (ed.) Communication Yearbook 36, 419–480. New York: Routledge (2013)
  46. 46. Witten I.H., Frank E., Hall M.A., Pal C.J.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann (2016)
  47. 47. Krippendorff K.: Association, agreement, and equity. Quality and Quantity 21(2), 109–123 (1987)
  48. 48. Krippendorff K.: Content analysis: An introduction to its methodology (1980)