Abstract

Software defects prediction at the initial period of the software development life cycle remains a critical and important assignment. Defect prediction and correctness leads to the assurance of the quality of software systems and has remained integral to study in the previous years. The quick forecast of imperfect or defective modules in software development can serve the development squad to use the existing assets competently and effectively to provide remarkable software products in a given short timeline. Hitherto, several researchers have industrialized defect prediction models by utilizing statistical and machine learning techniques that are operative and effective approaches to pinpoint the defective modules. Tree family machine learning techniques are well-thought-out to be one of the finest and ordinarily used supervised learning methods. In this study, different tree family machine learning techniques are employed for software defect prediction using ten benchmark datasets. These techniques include Credal Decision Tree (CDT), Cost-Sensitive Decision Forest (CS-Forest), Decision Stump (DS), Forest by Penalizing Attributes (Forest-PA), Hoeffding Tree (HT), Decision Tree (J48), Logistic Model Tree (LMT), Random Forest (RF), Random Tree (RT), and REP-Tree (REP-T). Performance of each technique is evaluated using different measures, i.e., mean absolute error (MAE), relative absolute error (RAE), root mean squared error (RMSE), root relative squared error (RRSE), specificity, precision, recall, F-measure (FM), G-measure (GM), Matthew’s correlation coefficient (MCC), and accuracy. The overall outcomes of this paper suggested RF technique by producing best results in terms of reducing error rates as well as increasing accuracy on five datasets, i.e., AR3, PC1, PC2, PC3, and PC4. The average accuracy achieved by RF is 90.2238%. The comprehensive outcomes of this study can be used as a reference point for other researchers. Any assertion concerning the enhancement in prediction through any new model, technique, or framework can be benchmarked and verified.

1. Introduction

Software engineering (SE) is a discipline that is worrisome with all qualities of software development from the beginning period of software specification over to keeping up to the software maintenance after it has gone into practice [1]. In the field of SE, software defect prediction (SDP) is one of the most significant and dynamic research zones that assumes a significant job in the software quality assurance (SQA) [2, 3]. An SD is a flaw or insufficiency in a software system that roots the development of a spontaneous result. The rising convolutions as well as dependencies of software systems have increased the difficulty in delivering software with minimal effort, high caliber, and maintainability as well which increases the chances of introducing software defects (SDs) [4]. Generally, SDs are found in the testing stage of the Software Development Life Cycle (SDLC) [5]. An SD can moreover be the situation when the finalized software product does not meet the client’s desire or client prerequisite [6] which causes the diminution of the software product quality and increases the development cost.

SDP is a momentous commotion to assure the substances of a software system that leads to adequate development cost and recover the quality by identifying defect-prone instances before testing [4]. It moreover embraces categorizing software components in new varieties of a software system, which constructs the testing progression supplementary by concentrating on testing and evaluating the components classified as defective [7]. Defects adversely affect software reliability and quality [8].

SDP in the primary period of SDLC is measured as the utmost thought-provoking aspect of SQA [9]. In SE, bug fixing and testing are very costly which also requires a massive amount of resources. Forecasting software defects in software development have been observed by numerous studies in the last decades. Amid all these studies, machine learning (ML) techniques are considered as the best approach toward SDPs [7, 10, 11].

Keeping the above issue related to SDP, various researchers evaluated and built SDP models utilizing diverse classification techniques. Still, it is quite challenging to sort any broad-spectrum preparation to inaugurate the usability of these techniques. Comprehensively, it was originated that despite some uniqueness in the studies, no specific SDP procedure conveys a higher precision to different methods slantingly on different datasets. Most of the researchers have utilized different evaluation measures to achieve a higher accuracy, but to the best of our knowledge, no one has worked on reducing error rate which is also an important factor for any prediction model [12, 13].

However, this study focuses on exploring Tree Family (TF) ML techniques for SDP. These TF-ML techniques include Credal Decision Tree (CDT), Cost-Sensitive Decision Forest (CS-Forest), Decision Stump (DS), Forest by Penalizing Attributes (Forest-PA), Hoeffding Tree (HT), Decision Tree (J48), Logistic Model Tree (LMT), Random Forest (RF), Random Tree (RT), and REP-Tree (REP-T). These techniques are employed on ten different datasets including AR1, AR3, CM1, KC2, KC3, MW1, PC1, PC2, PC3, and PC4. All the experiments are validated using mean absolute error (MAE), relative absolute error (RAE), root mean squared error (RMSE), root relative squared error (RRSE), specificity, precision, recall, F-measure (FM), G-measure (GM), Matthew’s correlation coefficient (MCC), and accuracy.

A question may be raised for the reason of selecting TF-ML techniques. The motive for the selection of TF-ML techniques is well-thought-out to be one of the finest and ordinarily used supervised learning methods. Tree-based techniques empower predictive models with ease of interpretation, stability, and high accuracy [14]. Disparate linear models such as TF-ML techniques map nonlinear relationships pretty well. They are flexible at resolving several kinds of problems at hand (regression or classification). These techniques also work for both categorical and continuous input and output variables [15, 16]. TF is one of the wildest ways to categorize the utmost momentous variables and relation between two or more variables. TFs can produce new features that have improved power to forecast object variable. It involves fewer data cleaning contrasted to several other modeling techniques. It is not prejudiced via outliers and missing values to a rational amount [1719].

The main contributions of this research are as follows:(i)We exploited ten benchmarked TF-ML techniques (CDT, CS-Forest, DS, Forest-PA, HT, J48, LMT, RF, RT, and REP-T) for SDP.(ii)We demeanor a series of tryouts on different datasets; i.e., AR1, AR3, CM1, KC2, KC3, MW1, PC1, PC2, PC3, and PC4.(iii)To gain insights into the experimental outcomes, evaluation is accomplished using MAE, RAE, RMSE, RRSE, specificity, precision, recall, FM, GM, MCC, and accuracy.(iv)To show the significance of the results, Friedman’s two-way analysis of variance by ranks is performed.

Hereinafter, Section 2 presents the literature survey, Section 3 comprises the methodology and techniques, while experimental outcomes have conversed in Section 4, and Section 6 covers the inclusive conclusion.

2. Literature Survey

This section delivers an ephemeral study about existing techniques in the field of SDP. Several researchers have employed ML techniques for SDP at the initial phase of software development. Several particular studies have conversed here. Czibula et al. [11] presented a model grounded on Relational Association Discovery (RAD) for SDP. They applied all the investigations on the NASA dataset including MC2, KC1, KC3, JM1, MW1, PC3, PC4, CM1, PC1, and PC2. To assess the model by comparing it to other models using accuracy, probability of detection (PD), specificity, precision, and area under cover (ROC) assessment measure, the acquired outcomes presented that RAD performed well rather than other employed techniques.

Li et al. [20] recommended a framework for SDP named Defect Prediction through Convolutional Neural Network (DP-CNN). They evaluated DP-CNN on seven different open source projects that are camel, jEdit, Lucene, xalam, Xerces, synapse, and poi in terms of FM in defect predictions. Overall outcomes illustrate that, on average, DP-CNN enhanced the state-of-the-art method by 12%.

Jacob and Raju [21] introduced a hybrid feature selection (HFS) method for SDP. They also performed their analysis on NASA datasets including PC1, PC2, PC3, PC4, CM1, JM1, KC3, and MW1. The outcomes of HFS are benchmarked with Naïve Bayes (NB), neural networks (NNs), RF, Random Tree (RT), and J48. Benchmarking is carried out using sensitivity, specificity, accuracy, and Matthew’s correlation coefficient (MCC). The analyzed outcome shows that HFS outperforms by improving classification accuracy from 82% to 98%.

Bashir et al. [22] presented a joined framework to improve the SDP model using data sampling (DS), ranker feature selection (FS) techniques, and iterative partition filter (IPF) to conquest class imbalance, high dimensionality, and noise correspondingly. Seven ML techniques including NB, RF, KNN, MLP, SVM, J48, and decision stump are employed on CM1, JM1, KC2, MC1, PC1, and PC5 datasets for evaluations. The outcomes are carried out utilizing receiver operating characteristic (ROC) performance evaluation. Overall experimental outcomes of the proposed model outperformed other models.

In another study [7], the author projected a new approach for SDP utilizing hybridized gradual relational association and artificial neural network (HyGRAR) to classify the defective and nondefective objects. Experiments were achieved based on ten different open-source datasets that are JEdit 4.0, JEdit 4.2, JEdit 4.3, Anr 1.7, Tomcat 6.0, AR1, AR3, AR4, AR5, and AR6. For module evaluation, accuracy, sensitivity, specificity, and precision measures were utilized. The author concluded that HyGRAR achieved better outcomes compared to most of the foregoing projected approaches.

Alsaeedi and Khan [8] performed the comparison of supervised learning techniques including Bagging, SVM, Decision Tree (DT), and RF and ensemble classifiers on different NASA datasets that are KC2, KC3, PC1, PC3, PC4, PC5, JM1, CM1, MC1, and MC2. The basic learning and ensemble classifiers are evaluated using GM, specificity, F-score, recall, precision, and accuracy. The experimental results showed that RF, AdaBoost with RF, and DS with bagging outperformed other employed techniques.

A comparative exploration of several ML techniques for SDP is performed [9] on twelve NASA datasets that are PC1, PC2, PC3, PC4, PC5, MC1, MC2, JM1, CM1, MW1, KC1, and KC3, while the classification techniques include One Rule (OneR), NB, MLP, DT, RBF, kStar (K∗), SVM, KNN, PART, and RF. The performance of each technique is evaluated via MCC, ROC area, recall, precision, FM, and accuracy. It is imitated from the outcomes that neither the accuracy and nor the ROC can be utilized as an operative performance measure as both of these did not respond to the class imbalance problem.

Malhotra and Kamal [6] evaluated the efficiency of ML classifiers for SDP on twelve excessive NASA datasets by employing sampling methods and cost-sensitive classifiers. They examine five prevailing methods including J48, RF, NB, AdaBoost, and Bagging, as well as the SPIDER3 method for SDP. They have compared the performance based on accuracy, sensitivity, specificity, and precision. The outcomes show improvement in the prophecy competence of ML classifiers with the usage of oversampling methods. Moreover, the projected SPIDER3 method shows hopeful results.

Manjula and Florence [23] developed a hybrid model based on the genetic algorithm (GA) and deep neural network (DNN). GA is used for feature optimization while DNN is for classification. The performance of the projected technique is compared with NB, RF, DT, Immunos, ANN-artificial bee colony (ABC), SVM, majority vote, AntMiner+, and KNN. All the performances are carried out on datasets that include KC1, KC2, CM1, PC1, and JM1 and assessed via recall, F-score, sensitivity, precision, specificity, and accuracy. The experimental outcomes showed that the recommended technique outperformed as compared to other techniques in terms of achieving better accuracy.

Researchers have used various techniques to incredulous the boundaries of SDP on a variety of datasets. In each study, different evaluation measures are accomplished to evaluate the proposed techniques. The overall summary of the literature discussed above is given in Table 1 As shown in Table 1, each study has used different evaluation measures to achieve a higher accuracy, but no one made an effort to decrease the error rate which is a significant feature.

3. Research Methodology

This research aims to present the performance analysis of TF-ML techniques for SDP on various datasets including AR1, AR3, CM1, KC2, KC3, MW1, PC1, PC2, PC3, and PC4. The complete research is prepared via the procedure shown in Figure 1. All experimentation is performed using open source ML and DM tool Weka version 3.9 (https://machinelearningmastery.com/use-ensemble-machine-learningalgorithms-weka/). After the selection of datasets, a preprocessing step is applied on each dataset for two main purposes: replacing the missing values and changing the class attribute from numerical to categorical as some of the techniques do not work on numerical class attributes. Then, ML techniques are applied to each dataset, and the outcomes are assessed using different assessment measures to show the better performance of an individual technique. Eleven assessment measures named MAE [13, 24, 25], RMSE [8, 26, 27], RAE [22, 28, 29], RRSE [28], specificity [3032], precision [9, 15, 33], recall [9, 10, 31], FM [9, 15, 20], GM [8, 34], MCC [9, 35, 36], and accuracy [3739] are utilized to evaluate the performance of ML techniques on SDP datasets.

3.1. Datasets Description

Each dataset consists of some attributes along with a known output class. Datasets contain numerical data, while the total number of attributes and instances are different, as presented in Table 2, where the first column presents the datasets and second and third columns present the number of attributes and number of instances, respectively. The fourth and fifth columns represent the number of defective modules and the number of nondefective modules, while the last column shows the type of data in each dataset. However, Table 3 shows the list of all attributes (software metrics) according to each dataset utilized in this research, where “-” means that this attribute is not part of the dataset while “Y” represents the presence of an attribute in the dataset.

3.2. Performance Measurement Parameters

This section describes the calculation mechanism of each performance measurement parameter with a short description, where is the absolute error, n is the number of errors, is the goal value for record Ji, is the value of forecast by the specific model I for record j (out of n records), TP is the number of true-positive classification, FN is the number of false-negative classification, TN is the number of true-negative classification, and FP is the number of false-positive classifications:(A)MAE is the average of all absolute errors. It can be calculated as(B)RMSE is the quadratic scoring statute that similarly measures the average magnitude error that can be calculated as(C)RAE is the same as a modest predictor, which is simply the average of the real values and can be found as(D)RRSE is known as the square root of the relative squared error (RSE) that mostly decreases the error to similar dimensions as the quantity being predicted. It can be found as(E)Specificity (also called the true-negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy instances that are correctly identified as not having the condition). It can be calculated as(F)Precision is the number of positive predictions divided by the total number of positive class values expected. It is also called the positive predictive value (PPV). It can be calculated as(G)Recall is defined as the ratio of TP modules with high opinion to the total number of positive modules. It can be found as(H)F-measure is also called the F-score. F1-score conveys the balance between precision and recall. It can be measured as(I)G-measure conveys the balance between the specificity and the recall. It can be calculated as(J)MCC is a correlation coefficient calculated using all four values in the confusion matrix. This can be found as(K)Accuracy points to how much the forecast is accurate and can be calculated as

3.3. Summarization of Employed Techniques

ML techniques are currently extensively used to excerpt significant knowledge commencing a massive volume of data in diverse areas. ML applications embrace numerous real-world situations such as cybersecurity, bioinformatics, detecting communities in social networks, and software process enhancement to harvest high-quality software systems [7]. ML as well as TF-ML-based solutions for SDP have also been investigated [6, 10, 34]. The following subsections briefly discuss TF-ML techniques employed in this research.

3.3.1. Credal Decision Tree

Credal Decision Tree (CDT) is a technique to design classifiers grounded on inexact possibilities and improbability measures [18]. Throughout the creation procedure of a CDT, to sidestep producing a too-problematical decision tree, a new standard was presented: stop once the total improbability increases due to the splitting of the decision tree. The function used in the total uncertainty dimension can be fleetingly articulated as in [14, 19].

3.3.2. Cost-Sensitive Decision Forest

CS-Forest practices cost-sensitive pruning as a substitute for the pruning used by C4.5. C4.5 prunes a tree if the probable number of misclassification for forthcoming records does not increase expressively due to the pruning. However, CS-Forest prunes a tree if the probable classification cost for forthcoming records does not increase expressively due to the pruning. Moreover, unlike Cost-Sensitive Decision Tree (CS-Tree), CS-Forest tolerates a tree to first completely develop and then get pruned [40].

3.3.3. Decision Stump

DS is utilized as a base learner to construct ensemble models. DS is an ML model encompassing a one-level decision tree. This ensemble learning performs 1000 repetitions for accomplishing optimal performance [41]. DS is essentially decision trees with a solitary label. A stump is divergent to a tree that has various layers. It mostly stops after the first split. DS is commonly utilized in large data. Almost not, they also serve to create modest yes/no decision models for smaller datasets [39].

3.3.4. Forest by Penalizing Attributes

Forest-PA technique uses bootstrap samples and penalized attributes. It purposes to construct a set of highly accurate decision trees by manipulating the strong point of all nonclass attributes presented in a data set, not like certain current techniques that use a subset of the nonclass attributes. At a similar time to support robust assortment, Forest-PA enforces disadvantages (disadvantageous weights) to an individual’s attributes that contributed to the newest tree to produce the subsequent trees. Forest-PA, moreover, has a contrivance to increase weights step by step from the attributes that have not been tested in the subsequent tree(s) [42].

3.3.5. Hoeffding Tree

HT is identified as the streaming decision tree generation. The term is resulting from the Hoeffding bound that is utilized in tree generation. The elementary idea is Hoeffding bound delivers a specific level of assurance on the finest attribute to riven the tree, which can be the baseline to create the finest model [39].

3.3.6. Decision Tree (J48)

This is the basic C4.5 Decision Tree (DT) used for classification problems [37]. It is the deviation of information gain (IG), usually utilized to stun the result of biasness. An attribute with a maximum gain ratio is nominated in direction to shape a tree as a splitting attribute. Gain ratio- (GR-) based DT performs well as compared to IG, in terms of accuracy [43].

3.3.7. Logistic Model Tree

LMTs are classification trees using logistic regression functions at the leaves. This technique can compact with dualistic and multiclass objective variables, nominal and numeric attributes, and missing values. LMT  is a classification model with an attendant supervised training technique. It syndicates decision tree learning and logistic forecasts. Logistic model trees use a decision tree that has linear regression models at its leaves to deliver fragment-wise linear regression model [39].

3.3.8. Random Forest

RF produces a set of techniques that involve constructing an ensemble or so-termed as a forest of decision trees from a randomized variation of tree induction techniques [44]. RF works by forming a mass of decision trees at the training period and outputting the class in the approach of the classes output by a single tree. It is deliberated as one of the utmost techniques which are extremely proficient for both classification and regression problems [45].

3.3.9. Random Tree

RT is the supervised and collective learning technique that creates numerous solitary learners. It uses a grasping idea to construct a set of random data for building a tree. In the standard tree, nearby every node is split using the best split amid all variables. In the RF, each node is divided utilizing a best amid the subset of predictors randomly selected at that node [46].

3.3.10. REP-Tree

REP-T uses a regression tree that produces numerous trees in diverse repetitions. Afterward, it chooses the best one from all created trees. REP-T constructs a decision/regression tree using entropy as an adulteration measure and prunes it employing reduced-error pruning. It merely sorts of values for numeric attributes [46].

4. Experimental Study

This section provides an experimental study for SDP employing ten ML techniques using a 10-fold cross-validation method which is a standard approach for assessment [47]. 10-fold cross-validation is the process that splits the complete data into ten subsets of equal sizes; one subset is used for testing, while others are used for training. This process is continued until each subset has been used for testing. The overall experiments are divided into three phases; these are experimental scenarios 1, 2, and 3. Experimental scenario 1 represents the analysis of CCI, ICI, TPR, and FPR, while experimental scenario 2 presents the performance analysis of absolute and squared error rates that are MAE, RAE, RMSE, and RRSE accordingly. However, experimental scenario 3 describes the performance achieved using accurate measurements that are specificity, precision, recall, F-measure, G-measure, MCC, and accuracy.

4.1. Experimental Scenario 1: Instances Classification and True-Positive and False-Positive Rates

Here, in this section, experiments carried out to find correctly classified and incorrectly classified instances are presented, as well as the true-positive and false-positive rate of each classifier over each solitary dataset. Tables 4 and 5 show the CCI and ICI analyses achieved, while Tables 6 and 7, respectively, present the TPR and FPR values of each technique on an individual dataset. In each of the mentioned tables, the first row represents the dataset utilized, while the first column represents the techniques employed. The rest of the table represents the outcome of CCI, ICI, TPR, and FPR, respectively.

The observation from CCI shows that RF correctly classified the instances on five datasets that are AR3, PC1, PC2, PC3, and PC4; CDT and HT do the same for three datasets, DS and REP-T do the same for two datasets, while other rest of CF-Forest and RT performs the same for only one dataset individually. In the case of ICI, each technique showed the same performance as CCI and ICI are contrasting each other. Figure 2 shows the inclusive analysis of CCI and ICI. However, the situation has changed for TPR and FPR. Calculating TPR, RF shows the best performance on five datasets, CDT, DS, HT, and LTM outperforms on three datasets individually, REP-T on two datasets, while Forest-PA and J48 do the same only on one individual dataset. However, measuring FPR, Forest-PA performs well on four datasets, DS, HT, and REP-T outperform on three datasets, respectively, CDT and RF show the best performance on two individual datasets, while J48 and LMT outperform only on one individual dataset. The overall analysis showed that RF performs well while calculating CCI, ICI, and TPR, while calculating FPR, Forest-PA outperformed other techniques. Figure 3 shows the inclusive analysis of TPR and FPR.

4.2. Experimental Scenario 2 (Error Rate Analysis and Results Discussion)

This section describes all the error rates achieved utilizing TF-ML techniques on different datasets. Tables 8 and 9 show the absolute errors MAE and RAE, respectively. Firstly, we consider MAE to measure MAE where J48 outperformed other techniques and achieved best results on five datasets; RT achieved best results on four datasets and RF on two datasets, while CDT, DS, Forest-PA, and HT performed well only on one solitary dataset. Secondly, if we consider RAE, likewise MAE J48 outperformed other techniques achieving best results on four datasets, RT on three datasets, RF on two, while HT surpassed other techniques on one dataset. Figures 4 and 5 show the error bar with a standard deviation of MAE and RAE, respectively.

Error bar is a graphical demonstration of the inconsistency of data and used on graphs to specify the uncertainty or error in a described measurement. It provides an overall indication of how accurate a measurement is or, on the other hand, how distant from the described value the true (error free) value might be. Here, these analyses show the best performance of J48 and RT while reducing absolute error rates.

Tables 10 and 11 present the analysis of squared errors that are RMSE and RRSE. In each table, the first row represents the datasets, while the first column represents the employed techniques. The rest of the table cells shows the outcomes of each employed technique on individual datasets. On both squared error measures RRSE and RMSE, RF achieved better results on five datasets that are AR3, KC3, PC1, PC3, and PC4. LMT outperforms other techniques on two datasets, while CDT, DS, Forest-PA, and REP-T do the same only on one individual dataset. Figures 6 and 7 represent the error bar with a standard deviation of RMSE and RRSE. These outcomes present the best performance of RF to reduce the squared error rate.

4.3. Experimental Scenario 3 (Accuracy Analysis and Results Discussion)

To measure the performance of any technique, accuracy is considered as one of the most important evaluation measures. Here, in this section, we present different measurements for accuracy that are specificity, precision, recall, F-measure, G-measure, MCC, and accuracy. All these measurements depend on the values of the confusion matrix shown in Table 12. There are two types of classes in which prediction is possible, i.e., class 1 (positive) and class 2 (negative). Class 1 represents that there is defect in the software, while Class 2 represents that there is no defect in the software. Here, TP is the case in which the software has positive (they have the defect) and FP is also the case of positive, but they do not actually have the defect, and it is also called type 1 error. FN is the negative cases, but they actually do have the defect, and it is also known as type 2 error. TN is a negative case, which shows that they do not have the defect.

Table 13 presents the specificity assessment of all employed techniques on various datasets. In this table, column one presents the list of techniques, while the rest of the columns represents the specificity achieved on an individual dataset. Instead of some values, there is a categorical message “#DIV/0!,” which is due to the “0” value in the confusion matrix. According to different equations, if there is a need to divide some values and that value becomes “0,” at that time as we know that “0” is not divisible, we will have this message. Here, in the tables, we used “?” instead of “#DIV/0!.”

The analysis of Table 13 shows that measuring specificity, DS outperforms on AR3 and KC3, J48 outperforms on AR1 and PC4, LMT outperforms on KC2 and MW1, while RF shows better performance on CM1 and PC3. CS-Forest and Forest-PA outperform other techniques on PC2 and PC1, respectively.

Table 14 presents the overall analysis of precision achieved by TF-ML techniques on each dataset. The outcomes show that, on one individual dataset, several techniques perform better. As we consider the AR1 dataset, three techniques perform well that are CDT, HT, and REP-T, while on CM1, PC1, and PC3 datasets, DS and HT produce the same results, whereas on the PC2 dataset, seven techniques show the same results outperforming the rest of the three techniques. However, on AR3, KC2, KC3, MW1, and PC4 datasets, CDT, LMT, DS, HT, and J48, respectively, outperform other techniques.

Table 15 shows the recall assessments of each technique. The analysis shows that CDT produces better results only on the KC3 dataset while RF does the same only on the AR3 dataset. However, CS-Forest outperforms other techniques and shows the best performance on eight datasets that are AR1, CM1, KC2, MW1, PC1, PC2, PC3, and PC4. This analysis recommended the CS-Forest technique for calculating recall. For better understanding of analysis taken over specificity, precision, and recall, Figures 810 represent the error bar and standard deviation lines correspondingly for specificity, precision, and recall.

The F-measure analysis is presented in Table 16 where RF generates better results on four datasets, DS and HT generate better outcomes on three datasets, respectively, while CDT, J48, LMT, and REP-T do the same for two datasets, respectively. However, Forest-PA outperforms other techniques only on one dataset. If we consider the PC2 dataset, seven techniques produce the same results and likewise in the case of precision and MAE too. Now a question may arise here that why more techniques present the same and good results on the PC2 dataset? The riposte here is that this is due to the PC2 dataset which contains a very less number of defective modules that are only 0.5%. If we consider the rest of the datasets, no one can have less than 6.9% defective modules which is why the performance of each technique is different on these datasets.

Table 17 shows the G-measure results. The outcome assessments show the best performance of CS-Forest, LMT, and RF for two individual datasets that are PC2 and PC4, KC2 and MW1, and CM1 and PC3 correspondingly, while CDT, DS, Forest-PA, and J48 beat other techniques on KC3, AR3, PC, and AR1 datasets, respectively. Furthermore, Table 18 represents the analysis of MCC measurement achieved using utilized TF-ML techniques. The outcomes show that utilizing CM1, KC2, PC2, and PC3 datasets, CS-Forest outperforms other techniques producing better results. Going on the KC3 dataset, the performance of the DS technique is better than the rest of the utilized techniques, while proceeding AR1 and PC4 datasets, the performance of J48 beats other employed techniques. Moreover, the performance of LMT is better then other techniques employing on MW1 datasset, while on AR3 and PC1 dataset, RF outperform well instead of other techniques.

Accuracy assessments of all employed TF-ML techniques are shown in Table 19. The outcome achieved on AR3, KC3, and PC2 datasets presents that CDT outperforms other techniques in terms of achieving a higher accuracy. Moreover, on AR1 and PC2 datasets, REP-T also outperforms other techniques while achieving the same results as CDT. Going on the PC2 dataset, CDT, REP-T well DS, Forest-PA, J48, and RF also produce the same results. This is due to the very less number of defective modules in this dataset. However, on CM1 and KC3 datasets, DS beats other employed techniques in tenure of achieving better accuracy results, as well as proceeding CM1 dataset; HT also outperforms other techniques that turn out the same results such as DS. Furthermore, on AR3, PC1, PC3, PC4, and PC2 as well, RF performs better results to achieve a higher accuracy than other utilized techniques. The overall accuracy analysis suggests RF for a better measurement of accuracy. Moreover, the error bar and standard deviation line for the better understanding of F-measure, G-measure, MCC, and accuracy are presented in Figures 1114 individually. The average accuracy is demonstrated in Figure 15 for the better understanding of outcomes analysis of each technique going through each utilized dataset.

5. Discussion on Overall Performance

A popular way to compare the overall performances of classifiers is to count () the number of data sets on which an algorithm is an overall winner, also known as the Count of Wins test. We have used 10 datasets, and no technique has given the best results for at least 10 datasets at α = 0.05, according to the critical values in Table 3 of [48]. Since the Count of Wins test is also considered to be a weak testing procedure, therefore, we have a detailed matrix in Table 20, keeping in view Scenario 2. Table 20 shows the determined analysis of all evaluated absolute and squared error rates. It represents the measurement of absolute errors that are MAE and RAE on the employed datasets and TF-ML techniques. The results concluded that J48 and RT outperform other techniques. However, evaluating on squared errors that are RMSE and RRSE, the RF technique beats the rest of the techniques in terms of reducing squared error. Moreover, Table 21 shows the inclusive outcomes of experimental scenario 3. For specificity, the outcomes show the best performance of J48, DS, RF, and LMT for two datasets individually. On precision, HT beats other techniques on five datasets and DS on four datasets, while CDT beats others on only two datasets. At the moment of considering recall, CF-Forest shows the best performance on eight datasets, while on F-measure, RF outperforms the rest of the techniques, and on G-measure, RF, LMT, and CS-Forest perform better results on two individual datasets. Making an allowance for MCC, CS-Forest outperforms on four datasets and J48 and RF on two individual datasets. Finally, if we discuss the accuracy, RF produces the best results on five datasets, CDT on three datasets, while DS on two datasets. However, the performance of RF for error rate as well as for accuracy is better than the rest of the utilized TF-ML techniques.

Generally, we can say that the more trees in the forest the more robust the forest looks like. In the same way in the RF classifier, the higher the number of trees in the forest gives the high accuracy results. In other words, it is believed that RF ensures as an ensemble method of numerous trees, enhanced to knob categorical data when gaining the ultimate solution in the widely held voting system for the outcomes of respective trees is umpired [33, 45, 49]. RF not only delivers a dual classification of data facts and nevertheless also delivers the prospects for each factor to be appropriate to defective or nondefective categories [50]. It is deliberated as one of the utmost dominant techniques as it is extremely proficient in the accomplishment of both regression and classification [44].

5.1. Friedman Two-Way Analysis of Variance by Ranks

To compare all applied ML techniques on multiple data sets, we have applied the statistical procedure as described by Sheskin [51], García, and Herrera (2008) [52]. The Friedman two-way analysis of variance by ranks (Friedman (1937) [53] is adopted with rank-order data in a hypothesis testing situation. A significant test indicates that there is a significant difference between at least two of the techniques in the set of k techniques. Friedman test checks whether the measured average ranks are significantly different from the mean rank (in our case, Rj = 3.96). The chi-square () distribution is used to approximate the Friedman test statistic [51]. Friedman’s statistic is  = 139.7985.

To reject the null hypothesis, the computed value must be equal to or greater than the tabled (table of the chi-square distribution) critical chi-square value at the prespecified level of significance [51]. The number of degrees of freedom df = k − 1; thus, df = 10 − 1 = 9. For df = 9, the tabled critical α = 0.05 chi-square values are = 16.92. Since the computed value = 139.7985 is greater than  = 16.92, the alternative hypothesis is supported at α = 0.05. It can be concluded that there is a significant difference among at least nine of the ten ML techniques. This result can be summarized as follows: (9) = 139.7985, .

Since the critical value is lower than the , we can continue with the post hoc tests to spot the significant pairwise differences among all the techniques. The results are shown in Table 22, where z is the corresponding statistics and values for each hypothesis. Z is computed using the following equation:where Ri is the ith technique and the standard error is  = 0. 0.175. Columns 5th and 6th represent Nemenyi’s and Holm’s statics procedure. The second last column lists the differences between the average ranks of ith and jth techniques. However, the last column shows the critical difference (CD), and it states that the performance of the two techniques is significantly different if the corresponding average ranks differ by at least the CD. CD can be calculated using the following equation:where critical values is given in Table 5(b) in (Demsar 2006) [48]. The notations “significant” and “insignificant” represent whether the difference in the average rank (Ri-Rj) is greater or less than the value of CD, respectively. Greater means a significant difference between two means. Here, the value of CD is = 0.485.

In Table 22, the family of hypotheses is ordered by their values. As can be seen, Nemenyi’s procedure rejects the first 28 hypotheses, whereas Holm’s procedure also rejects the next 3 hypotheses since the corresponding values are smaller than the adjusted NMα’s and Holm. Therefore, we conclude that the performance of HT and LMT is comparable, and RT resulted in a lower performance. Besides, the obtained value CD = 0.485 indicates that any difference between the average ranks of two techniques that is equal to or greater than 0.485 is significant. Concerning the pairwise comparisons in Table 22, the difference between the average ranks of two techniques which are greater than CD = 0.485 is the first 33. Thus, it can be concluded that there is a significant difference between the average ranks of the first 33 pairs of techniques.

5.2. Threats to Validity

In this section, we converse the effects that may anguish the validity of this research work.

5.2.1. Internal Validity

The exploration of this paper is grounded on diverse very familiar evaluation standards that are used in the past in various studies. Amid these standards, several methods are used to assess the error rate, while certain approaches are used to assess accuracy. So, the treat can be that the renewal of new evaluation standards as a replacement for utilized standards can decrease the accuracy. Furthermore, the techniques used in this research can be supplanted with some newer techniques that can be hybridized with each other and can harvest enhanced outcomes than the employed techniques.

5.2.2. External Validity

We piloted investigations on various datasets. A threat to validity may arise if we relate the projected techniques in the other real data composed from the different software development organizations using surveys etc. or replace these datasets with some other datasets, which may distress the outcomes while growing the error rates. Likewise, the projected technique may not be capable of harvesting better forecasts in outcomes using some other SDP datasets. Hence, this study concentrated on AR1, AR3, CM1, KC2, KC3, MW1, PC1, PC2, PC3, and PC4 datasets to measure the performance of the utilized techniques.

5.2.3. Construct Validity

In this study, diverse TF-ML techniques are benchmarked with each on various datasets based on several assessment measures. The assortments of techniques utilized in this study are at the center of their progressive features over the other techniques that have been exploited by the researchers in the last decades. However, the threat can be that if we put on some other new techniques, then it can be the probability that these new techniques can exhaust the projected techniques. Furthermore, any change in the dataset splitting (increasing or decreasing the number of K-Folds) may change the current outcomes. It also can be promising that using the newest evaluation standards creates improved outcomes that can beat the current accomplished outcomes.

6. Conclusion

Nowadays, SDP using ML techniques is dignified as one of the emerging research areas. As the identification of software defects at the primary stage of SDLS is a challenging task, nevertheless it can subsidize the provision of high-quality software systems. This paper considered ten extensively used publically available datasets to compare ten famous TF-ML techniques: CDT, CS-Forest, DS, Forest-PA, HT, J48, LMT, RF, RT, and REP-T, which are broadly used for SDP. The performance is evaluated utilizing different measures such as MAE, RAE, RMSE, RRSE, specificity, precision, recall, FM, GM, MCC, and accuracy. The inclusive results of this paper recommended RF technique by providing the best results in terms of reducing error rates as well as increasing accuracy on five datasets that include AR3, PC1, PC2, PC3, and PC4, where the accuracy rate for each of these datasets is 92.0635%, 93.688%, 99.5885%, 90.1472%, and 90.6722%, respectively. However, CDT and DS are best in terms of increasing accuracy on three individual datasets. CDT accuracy outcomes are 92.562%, 81.9588%, and 99.5885% correspondingly going on AR1, KC3, and PC2, while DS shows an accuracy performance of 90.1606%, 81.9588%, and 99.5885% individually on CM1, KC3, and PC2.

The outcomes obtainable in this research can be recycled as a baseline for other studies and researchers so that the outcomes of any projected technique, model, or framework can be benchmarked and simply confirmed. For future work, class imbalance matters ought to be committed to these datasets. Furthermore, to increase the enactment, feature selection and ensemble learning techniques should also be explored.

Data Availability

The datasets used in this research are taken from UCI ML Learning Repository available at https://archive.ics.uci.edu/.

Conflicts of Interest

The authors declare that they have no conflicts of interest related to this study.

Acknowledgments

The authors would like to thank and acknowledge the support provided by the King Saud University, Riyadh, Saudi Arabia, through researchers supporting project number RSP-2020/184.