Abstract

The Adaptive Boosting (AdaBoost) classifier is a widely used ensemble learning framework, and it can get good classification results on general datasets. However, it is challenging to apply the AdaBoost classifier directly to pulmonary nodule detection of labeled and unlabeled lung CT images since there are still some drawbacks to ensemble learning method. Therefore, to solve the labeled and unlabeled data classification problem, the semi-supervised AdaBoost classifier using an improved sparrow search algorithm (AdaBoost-ISSA-S4VM) was established. Firstly, AdaBoost classifier is used to construct a strong semi-supervised classifier using several weak classifiers S4VM (AdaBoost-S4VM). Next, in order to solve the accuracy problem of AdaBoost-S4VM, sparrow search algorithm (SSA) is introduced in the AdaBoost classifier and S4VM. Then, sine cosine algorithm and new labor cooperation structure are adopted to increase the global optimal solution and convergence performance of sparrow search algorithm, respectively. Furthermore, based on the improved sparrow search algorithm and adaptive boosting classifier, the AdaBoost-S4VM classifier is improved. Finally, the effective improved AdaBoost-ISSA-S4VM classification model was developed for actual pulmonary nodule detection based on the publicly available LIDC-IDRI database. The experimental results have proved that the established AdaBoost-ISSA-S4VM classification model has good performance on labeled and unlabeled lung CT images.

1. Introduction

Pulmonary nodule detection belongs to the category of classification. The pulmonary nodule detection based on lung CT images is the key to diagnose lung cancer for doctors. In the field of actual lung CT image recognition, classification accuracy and acquisition of lung CT image labels are crucial issues. To solve the classification accuracy problem, ensemble learning is introduced. After 30 years of development, ensemble learning has been applied in many fields of machine learning [1, 2] and is considered to be one of the effective ways to improve classification accuracy problem. In 1996, Breiman proposed the Bagging algorithm [3] which is similar to the Boosting algorithm [4]. This algorithm is one of the many algorithm families in the field of ensemble learning. But, the Boosting algorithm is difficult to apply in practical problems, because that it must know the generalization lower bound of the “weak” learning algorithm. To solve this problem, Freund and Schapire proposed the famous adaptive boosting (AdaBoost) algorithm in 1997 [5]. Compared with the Boosting algorithm, this algorithm has stronger robustness and applicability and further promotes the development of ensemble learning. In many researches, the combination of AdaBoost classifier is optimized by the method of “weak” classifier, such as support vector machine (SVM) method [6] and long short-term memory (LSTM) network [7].

Although AdaBoost classifier has been successfully applied in many fields with its competitive accuracy, the AdaBoost classifier in classification problem of missing part of lung CT image labels show the main weakness: when dealing with labeled and unlabeled lung CT images, the classification cannot be classified well alone. To solve this problem, a hybrid ensemble learning approach is proposed for pulmonary nodule detection combining AdaBoost algorithm and safe semi-supervised support vector machine (S4VM) [8]. In addition, its performance is also mainly affected by the key parameters in the model. The corresponding weight of each weak classifier needs to be optimized in AdaBoost-S4VM. And the S4VM optimization involves two main hyper-parameters: regularization trading off the complexity parameter and the empirical error on label and unlabeled data parameter . The parameter of S4VM is usually optimized by the 10-fold cross-validation method that cannot adapt to the data automatically and set the parameter range difficultly. And, it is easy to fall into the local optimum. In order to overcome the above shortcomings, many researches have proposed the use of intelligent optimization algorithms to optimize the key parameters of the S4VM model. These algorithms include quasi-Newton algorithms [9], cuckoo search algorithm (CS) [10], and beetle antennae search (BAS) [11]. For the advantages of sparrow search algorithm (SSA) [12], such as simple principle, strong mining capacity, and few adjustable parameters, the improved SSA that improves the classification performance of AdaBoost-S4VM is used to optimize the key parameters of AdaBoost-S4VM and S4VM in this research. As well as, hybrid strategy which is one of the main research directions to improve the performance of swarm intelligence algorithms has become a research hotspot in machine learning. Rasoulizadeh et al. [13] modified local RBF-generated finite difference method (RBF-FD) based on local stencil nodes which has a sparsity system to overcame the dense and ill-condition. Rashidinia et al. [14] also has proposed the two meshless collocation methods based on radial basis function-generated finite difference (RBF-FD) and global RBF(GRBF) methods, and the simulation results have shown that the proposed approach was viable and effective. Can et al. [15] modified the idea of the interpolation by radial basis function, and the obtained results show that the proposed method is able to provide valid and accurate results and outperform other counterparts.

In this study, in view of the shortcomings of the AdaBoost algorithm, the S4VM weak classifier is introduced. Then, SSA is used to optimize the parameters of AdaBoost-S4VM. But, it has disadvantages such as easy to fall into local optimum and poor performance in solving complex optimization problems. After that, because the sine cosine algorithm (SCA) [16] has the characteristics of achieving high search and avoiding local optimization, we first introduced the SCA algorithm to improve the global search capability of the SSA algorithm. Additionally, in order to enhance the convergence ability of the SSA algorithm, the labor cooperation structure of the sparrow in the SSA algorithm is redefined. Finally, based on the new labor cooperation structure and SCA algorithm, the improved cooperative sparrow search algorithm based on sine cosine algorithm (SCA-CSSA) is proposed. The SCA-CSSA algorithm is used to optimize the weight of AdaBoost-S4VM and the key parameters of S4VM to improve the accuracy of the AdaBoost-S4VM model for semi-supervised lung CT classification. And, the improved semi-supervised AdaBoost classification model using an improved sparrow search algorithm (AdaBoost-ISSA-S4VM) was established. In order to verify the effectiveness of the proposed AdaBoost-ISSA-S4VM model, first it compared with several hybrid algorithms and popular algorithms on CEC2017 functions and 12 benchmark functions, including unimodal and multimodal functions. In addition, in order to evaluate the effectiveness of the AdaBoost-ISSA-S4VM model, it is also compared with the supervised classifiers and semi-supervised classifiers. Experimental results show that the improved machine learning model proposed has better stability and higher prediction accuracy.

2. The Basic Method

2.1. Adaptive Boosting (AdaBoost)

The AdaBoost classifier algorithm is implemented by changing the data distribution. It determines the weight of each sample based on whether the classification of each sample in each training dataset is correct and the accuracy of the last overall classification accuracy. It sends the new dataset with modified weights to the lower classifier for training, and so on. Finally, according to the calculated corresponding weights, we will get the final desired “strong classifier” named AdaBoost-S4VM by stacking a series of “weak classifiers.” The strong learning algorithm is defined as follows:where is the basic weak classifier, is the corresponding weight of each weak classifier, and is the total number of basic weak classifiers.

The calculation method of the weight coefficient of iswhere is the error rate of calculating on the training set, that is, .

Since we explore the discrimination of features by training weak classifiers and organize the AdaBoost classifiers in a cascade way, we ask for simple weak classifiers, with which the target of the cascade-AdaBoost classifier can be easily controlled. Thus, simple threshold classifiers are chosen as weak classifiers as follows:where and are thresholds for the weak classifier , which can be obtained by using a semi-exhaustive search technique.

2.2. Safe Semi-Supervised Support Vector Machines (S4VM)

The core design idea of S4VM is that it optimizes the classification of unlabeled sample data when there are many different situations that meet the larger “interval” dividing line. So that the performance improvement relative to the support vector machine that only uses labeled samples is maximized in the worst case. And, the objective function that S4VM needs to optimize is as follows:

Its goal is to find multiple large-margin low-density separators and the corresponding label assignments such that the following functional is minimized:where is the number of separators, is a quantity of penalty about the diversity of separators, and is a large constant enforcing large diversity.

And, is as sum of pairwise terms. Here it is defined as , where is the identity function and is a constant, but note that other penalty quantities are also applicable. Without loss of generality, suppose that is a linear model, and it is defined as f(x) = ϕ(x) + b Where is a feature mapping induced by the kernel . Then, the optimization problem to be solved can be expressed as follows:where refers to the th entry of . Formula (6) is nonconvex, and in the following, we will present two solutions. It is evident that this can also be implemented by other solutions, especially those based on efficient S3VMs.

2.3. Sparrow Search Algorithm (SSA)

Sparrow search algorithm (SSA) was originally proposed by Xue et al. The algorithm imitates the unique predation method of sparrows in nature to solve the optimization problem. In SSA, the position of the sparrow in the population is the candidate solution for a given optimization problem. According to the mathematical model of SSA, the behavior of the sparrows is mainly divided into three divisions of labor: producers, scroungers and sparrows at the edge of the group.

According to the rules of producers and once the sparrow detects the predator, the producers can search for food in a broad range of the places than that of the scroungers. The location of the producer is updated as follows:where indicates the current iteration, . represents the value of the dimension of the sparrow at iteration . is a constant with the largest number of iterations, is a random number, and represent the alarm value and the safety threshold, respectively. is a random number which obeys normal distribution, and shows a matrix of for which each element inside is 1.

According to rules of producers and the scroungers, the position update formula for the scrounger is described as follows:where is the optimal position occupied by the producer, denotes the current global worst location, represents a matrix of for which each element inside is randomly assigned , and . When , it suggests that the scrounger with the worse fitness value is most likely to be starving.

According to rule of the sparrows at the edge of the group, the mathematical model of the sparrows at the edge of the group can be expressed as follows:where is the current global optimal location. , as the step size control parameter, is a normal distribution of random numbers with a mean value of and a variance of , and is a random number. Here, is the fitness value of the present sparrow, and are the current global best and worst fitness values, respectively, and is the smallest constant so as to avoid zero-division-error.

3. The Proposed Method

The sparrow search algorithm and the principle of AdaBoost classifier have been clarified, and the basic S4VM has also been discussed. However, the pulmonary nodule detection classifying process based on lung CT images is complex and challenging.

Although the AdaBoost classifier is novel and superior, there are still some shortcomings when utilized to the pulmonary nodule detection classifying problem based on lung CT images. Thus, two improving strategies called “S4VM algorithm” and “parameters optimization” or “weight optimization” are introduced to the original AdaBoost classifier and AdaBoost-S4VM, respectively, to help them jump out of local optima.

There are also some shortcomings in SSA to optimize them, so a sine cosine algorithm is used as hybrid algorithm to help it jump out of local optima. And, a new division of labor structure is introduced to the original SSA to help it converge to the global optimal solution faster and more stably. In this section, the proposed SCA-CSSA algorithm, AdaBoost-S4VM algorithm, and AdaBoost-ISSA-S4VM algorithm will be discussed in detail.

3.1. The Proposed AdaBoost-S4VM Model (AdaBoost-S4VM)

The AdaBoost classifier algorithm is implemented by changing the data distribution. It determines the weight of each sample based on whether the classification of each sample in each training dataset is correct and the accuracy of the last overall classification accuracy. It sends the new dataset with modified weights to the lower classifier for training and so on. Finally, according to the calculated corresponding weights, we will get the final desired “strong classifier” named AdaBoost-S4VM by stacking a series of “weak classifiers.” The strong learning algorithm is defined as follows:where is the basic weak classifier, is the corresponding weight of each weak classifier and it is defined as , and is the total number of basic weak classifiers.

According to the characteristic of formulas (6) and (10), we can get the improved AdaBoost-S4VM formula combined with “weak” classifier S4VM as follows:where is the corresponding weight of each S4VM classifier and it is defined as and is the total number of basic S4VM classifiers.

3.2. The Proposed Sparrow Search Algorithm (SCA-CSSA)

It can be seen from Section 3.1 that, in order to obtain the optimal classification effect and efficiency, parameters , , and weight need to be optimized. But, due to the disadvantages of being easy to fall into the local optimum and slow convergence speed of the original SSA, it is difficult to guarantee the quality of the obtained solution. In response to these problems, SCA algorithm and a new labor cooperation structure is used to improve the global and local search capability of SSA algorithm.

The sine cosine algorithm (SCA) position can be written mathematically as follows:where is the position of the current solution in -th dimension at -th iteration, are random numbers, is position of the destination point in -th dimension, indicates the absolute value, and is a random number in .

In order to balance exploration and exploitation, the range of sine and cosine in formula (12) is changed adaptively using the following equation:where is the current iteration, is the maximum number of iterations, and is a constant.

To increase the search speed and jump out of local optimization, the SCA algorithm is introduced in SSA. Because SCA has the characteristics of increasing the search speed and jumping out of local optimization, it can well avoid the problem of premature sparrows. The following formula is the update SSA formula that combines the SCA algorithm:

Then, in the improved SSA algorithm, a new labor cooperation structure is first used to converge to the global optimal solution faster and more stably. In the new labor cooperation structure, three divisions of sparrows are also defined: producer, scrounger, and sparrows at the edge of the group. Since the producer and scrounger determine the location range and convergence performance of the group, they share their locations to achieve cooperation. Then, cooperation could make both producer and scrounger great, thereby achieving the effect of improving convergence. The location of the producer is remarked as follows:

The position remarked formula for the scrounger is described as follows:

This process can be expressed by the following formula:where refers to the global optimal minimum solution.

The pseudocode of the whole SCA-CSSA is given below in Algorithm 1.

Input: the maximum iterations: ; the number of producers: ; the number of sparrows who perceive the danger: ; the number of sparrows: ; dynamic parameter: ;
Output: global optimal position: ; fitness of global optimal position: ;
Begin:
(1)Initialize a population of sparrows and define its relevant parameters.
(2)while
(3)Rank the fitness values and find the current best individual and the current worst individual.
(4)
(5)  for
(6)   Using formula (16), update the sparrow’s location;
(7)  end for
(8)  for
(9)   Using formula (17), update the sparrow’s location;
(10)  end for
/ the new division of labor structure scheme /
(11)Using formulas (18) and (19), update the producer and scrounger’s cooperative location;
(12)If the new location is better than before, update it use formula (20);
(13)  for
(14)   Using formula (9) update the sparrow’s location;
(15)  end for
(16)Get the current new location;
(17)  If the new location is better than before, update it;
/ sine cosine algorithm scheme /
(18)Using formula (14), update the SCA sparrow’s location;
(19)  If the new location is better than before, update it use formula (15);
(20)
(21)end while
(22)return, .
3.3. The Proposed AdaBoost-S4VM Model Improved by the Improved Sparrow Search Algorithm (AdaBoost-ISSA-S4VM)

After getting the improved SCA-CSSA, it can be seen from Section 3.1 that, in order to obtain the optimal classification effect and efficiency, parameters C1 and C2 and weight need to be optimized.

Finally, the AdaBoost-S4VM parameter is optimized using SCA-CSSA. And, the pseudocode of AdaBoost-ISSA-S4VM is shown in Algorithm 2.

Input: weak classifier type: S4VM; train data set, train label set, test data set, test label set; the maximum iterations: ; kernel: ; parameters of S4VM: weight for the hinge loss of labeled instance , weight for the hinge loss of unlabeled instance , and the sampling times for each trial .
Output: prediction label of test data set
(1)set the weights of the training data set , ;
(2)for
(3)  If there are misclassification points
/ Parameter selection based on SCA-CSSA /
(4)   According to SCA-CSSA, find the optimal hyper-parameters of weak classifier S4VM;
/Weight of AdaBoost selection based on SCA-CSSA /
(5)   According to SCA-CSSA, find the optimal weight of weak classifier S4VM;
(6)   Using the weight distribution , calculate the weak classifier ;
(7)   Update the weight distribution of the training set ;
(8)   ;
(9)else
(10)   jump out of the loop;
(11)  end
(12)end for
(13)According to formula (11), groups of weak classifiers are linearly combined, and the final classifier is output;
(14)Use the final classifier to predict the training set classification.

4. Experimental Studies

In this section, in order to evaluate the performance of the proposed SCA-CSSA and AdaBoost-ISSA-S4VM model, a series of experiments on test functions and CT images are used in this paper. All experiments in this paper are implemented using the following: MATLAB R2014b; Win 10 (64 bit); Inter (R) Core (TM) i5-10210M CPU @1.60 GHz 2.11 GHz.

4.1. Function Optimization Experiment

This section presents the evaluation of SCA-CSSA using a series of experiments on benchmark functions [17] and CEC2017 test functions [18]. To obtain fair results, all the experiments were conducted under the same conditions. The number of the population size is set as 30 in these algorithms. And, each algorithm runs 30 times independently for each function.

4.1.1. Benchmark Functions and CEC 2017 Test Functions

When investigating the effective and universal performance of SCA-CSSA compared with several hybrid algorithms and popular algorithms, 12 benchmark functions and CEC2017 test functions are applied. In order to test the effectiveness of the proposed SCA-CSSA, 12 benchmark functions are adopted, all of which have an optimal value of 0. The benchmark functions and their searching ranges are shown in Table 1. In this test suite, are unimodal functions. These unimodal functions are usually used to test and investigate whether the proposed algorithm has a good convergence performance. Then, are multimodal functions. These multimodal functions are used to test the global search capability of the proposed algorithm. The smaller the fitness value of functions, the better the algorithm performs. Furthermore, in order to better verify the comprehensive performance of SCA-CSSA in a more comprehensively manner, another 30 complex CEC2017 test functions are used. The CEC2017 test functions are simply described in Table 2.

4.1.2. Parameter Settings

In order to verify the effectiveness and generalization of the proposed SCA-CSSA, the improved SCA-CSSA is compared with several hybrid algorithms and popular algorithms. These algorithms are SSA, SCA, SCA_SSA, and SCA_CSSA. Another 4 popular intelligence algorithms, such as particle swarm optimization (PSO) [19], bird swarm algorithm (BSA) [20], crow search algorithm (CSA) [21], whale optimization algorithm (WOA) [22], grasshopper optimization algorithm (GOA) [23], are used to compare with SCA-CSSA. These algorithms represented state-of-the-art can be used to better verify the performance of SCA-CSSA, in a more comprehensively manner. For fair comparison, the number of populations of all algorithms is set to 30, respectively, and other parameters of all algorithms are set according to their original papers. The initial controlling parameters of all algorithms are shown in Table 3.

4.1.3. Comparison on Benchmark Functions with Hybrid Algorithms and Popular Algorithms

According to the Section 2.2, the basic SSA method has been improved by two strategies. To investigate the effectiveness of SCA_CSSA, it has been compared with several popular algorithms and hybrid algorithms, such as PSO, BSA, CSA, WOA, GOA, SSA, SCA, and SCA-SSA, on 12 benchmark functions. Compared with SCA-CSSA, new labor cooperation structure is not used in SCA-SSA. In this experiment, the dimension’s size of these functions is 10. Dim = 10 is the typical dimensions for the benchmark functions. The number of function evaluations (FEs) is 1000. We selected two different function evaluations (FEs), such as FEs = 1000 and FEs = 10,000.

The fitness value curves of a run of several algorithms on about eight different functions when FEs = 10,000 are shown in Figures 1 and 2, where the horizontal axis represents the number of iterations and the vertical axis represents the fitness value. We can obviously see that the convergence speeds of several different algorithms. The maximum value (Max), the minimum value (Min), the mean value (Mean), and the variance (Var) obtained by several benchmark algorithms are shown in Tables 4 and 5, where the best results are marked in bold. Table 4 shows the performance of the several algorithms on unimodal functions when FEs = 1000, and Table 5 shows the performance of the several algorithms on multimodal functions when FEs = 1000.

(1) Unimodal Functions. The evolution curves of these algorithms on 3 unimodal functions , , and are given in Figure 1. It can be detected from the figure that the curve of SCA-CSSA descends fastest in the number of iterations that are far less than 10,000 times. For and case, SCA-CSSA has the fastest convergence speed compared with other algorithms. But, on functions and , the original CSA and GOA got the worst solution because it is trapped in the local optimum prematurely. For function , these algorithms did not find the value 0. However, SCA-CSSA continues to decline and the convergence speed of it is significantly faster than other algorithms in the early stage; the solution eventually found is the best. Overall, owing to enhance the diversity of population, SCA-CSSA has a relatively excellent convergence speed when FEs = 10,000.

From the numerical testing results on 7 unimodal functions in Table 4, we can see that SCA-CSSA can find the minimum value on , , , ,, and . And, SCA-CSSA can find the optimal solution for all unimodal functions and get the minimum value of 0 on , , , and . It illustrates that the SCA-CSSA has best performance on unimodal functions compared to the other algorithms when FEs = 1000. Moreover, SCA-CSSA has the best maximum value (Max), the minimum value (Min), the mean value (Mean), and the standard deviation (Std) on , , , and . Obviously, the SCA-CSSA has a relatively good convergence speed. In summary, compared with these popular algorithms and hybrid algorithms, SCA-CSSA is a competitive algorithm for solving several functions and has the best performance on the most test benchmark functions.

(2) Multimodal Functions. The evolution curves of these algorithms on 3 multimodal functions , , and when FEs = 10,000 are depicted in Figure 2. We can see that SCA-CSSA can find the optimal solution in the same iteration. For and cases, SCA-CSSA continues to decline and got the best value 0. But, the original PSO and GOA get parallel straight lines because of their poor global convergence ability on these 3 functions. For function , although SCA-CSSA is also trapped the local optimum, it finds the minimum value compared to other algorithms. Obviously, the convergence speed of the SCA-CSSA is significantly faster than other algorithms in the early stage, and the solution eventually found is the best. In general, owing to enhance the diversity of population, SCA-CSSA has a relatively balanced global search capability when FEs = 10,000.

From the numerical testing results on 5 multimodal functions in Table 5, we can see that SCA-CSSA can find the optimal solution for all multimodal functions and get the minimum value of 0 on and . The SCA-CSSA has relatively well performance on multimodal functions compared to the other algorithms. Moreover, SCA-CSSA has the best maximum value (Max), the minimum value (Min), the mean value (Mean), and the standard deviation (Std) on , , , and . Obviously, the SCA-CSSA has a relatively well global search capability. The main reason is that SCA-CSSA has a stronger global exploration capability based on the SCA method. In summary, the SCA-CSSA has a superior global search capability on most multimodal functions when FEs = 1000.

4.1.4. Comparison on CEC2017 Test Functions with Hybrid Algorithms and Popular Algorithms

In order to further verify the universality of the proposed SCA_CSSA algorithm, it has been compared with PSO, BSA, WOA, GOA, SSA, SCA, and SCA_SSA on the latest CEC2017 test functions. In this experiment, the dimension’s size (Dim) is set to 10. The number of function evaluations (FEs) is 10,000. Experimental comparisons included the maximum value (Max), the minimum value (Min), the mean value (Mean), and the standard deviation (Std) and are given in Tables 6 and 7, where the best results are marked in bold.

SCA-CSSA gets the minimum value on F3, F4, F6, F10, F11, F12, F13, and F14 in Table 6 and on F16, F17, F18, F19, F20, F22, F23, F24, F25, F27, F28, F29, and F30 in Table 7. According to the results, we can observe that SCA-CSSA does well on 21 CEC2017 test functions. Further, SCA-CSSA has the best maximum value (Max), the minimum value (Min), the mean value (Mean), and the standard deviation (Std) on F3, F4, F6, F10, F11, F12, F13, F14, and F15 in Table 6 and on F17, F18, F20, F22, F23, F24, and F25 in Table 7. Therefore, SCA-CSSA can not only find the optimal solution but also has stability on 16 CEC2017 test functions. In summary, it can be observed that SCA-CSSA obtains optimal value. It can be concluded that SCA-CSSA has better global search ability and better robustness on these test suites.

4.2. Application to Practical Pulmonary Nodule Detection Classification Problem Based on Lung CT Images

In this section, in order to evaluate the performance of SCA-CSSA in optimizing real-world optimization problem, the proposed AdaBoost-ISSA-S4VM model is used for Pulmonary Nodule Detection Classification. The CT images from LIDC/IDRI database were used for the AdaBoost-ISSA-S4VM classification. In order to obtain fair results, all the implementations, such as SVM [24], S4VM, AdaBoost-SVM, and AdaBoost-S4VM, are conducted under the same conditions. The experimental environment for all experiments in this section is the same as in Section 4.1. And, each algorithm runs 30 times independently for each classification model. Population size and maximum generation are set to 30 and 100, respectively.

4.2.1. Design of Pulmonary Nodule Detection Classification System

In order to identify and classify the lung nodules and non-nodules, the processing module includes the preprocessing of DICOM image, the extraction of the lung parenchyma and lung nodule, the interception of the ROI (region of interest) image, the acquisition of ROI feature vectors, and the dimensionality reduction and classification of the vectors. Block diagram of the pulmonary nodule detection classification system based on the improved AdaBoost-ISSA-S4VM is shown in Figure 3.

The pulmonary nodule detection classification can be simplified by the four steps (Figure 4):Step 1 (Image Selection). The selection of the CT images of lung is solitary lung nodule. At the same time, the datasets should be closely related to lung cancer sample analysis. The dataset should be relatively independent. The dataset is randomly divided into two parts of training and testing samples.Step 2 (Picture Preprocessing). Read the CT images of lung first, as shown in Figure 4(a), and then use RPCA method which is improved by weighted nonconvex regularization for image denoising [25]. After enhancing the contrast ratio of the images through binarization processing, this study uses optimal threshold segmentation (OTSU) method to sharpen the image, as shown in Figure 4(b).Step 3 (Lung Parenchyma Extraction). In order to narrow the range of the lung parenchyma and reduce the difficulty of detection, thus improving the accuracy of detection, we fill the lung parenchyma, as shown in Figure 4(c). Then, do XOR to the figure; the area of lung parenchyma is obtained as shown in Figure 4(d). After deleting external and small area in lung parenchyma, the morphological method is used to repair the edge of the image, as shown in Figure 4(e). Finally, the lung parenchyma template and the image after pretreatment are multiplied to obtain the required lung parenchyma, as shown in Figure 4(f).Step 4 (ROI Region Extraction). The optimal threshold segmentation method is used again in order to extract the pathological part. After eliminating the linear structures, the small area is removed by removing smaller connected components, as shown in Figure 4(g). Finally, we can remove the false positives by the dot filter method which can remove the linear structure effectively and get the ROI regions as shown in Figure 4(h). The ROI regions include lung nodules and non-nodules as shown in Figures 4(i) and 4(j).Step 5 (Feature Vector Extraction). In order to avoid the influence of the particularity, heterogeneity, texture, and complexity of lung nodules on the selection of feature vectors, we introduce the Curvelet transform with rigorous mathematical theory based on the conventional feature extraction methods [24] to supplement the feature vectors.Step 6 (Classifier Training and Feature Classification). In the AdaBoost-ISSA-S4VM classifier, input the actual feature vector of lung node dataset after feature vector extraction, use AdaBoost-ISSA-S4VM classification algorithm to train, and finally get the AdaBoost-ISSA-S4VM classification model. The train dataset is identified by the trained AdaBoost-ISSA-S4VM classification model and the classification results are obtained as output.

4.2.2. Practical Application

In Section 2.3, the performance of the proposed ISSA is simulated and analyzed on benchmark functions. In order to test the application effect of the improved AdaBoost-ISSA-S4VM classification model, the CT images of lung from LIDC/IDRI database is selected for experiments. According to the description of the XML annotation file of the case nodule information in the database, the solid solitary lung nodule was analyzed. ROI region extraction on the DICOM image is performed before feature vector extraction, as shown in Figure 4. Figure 4 shows some partial steps of lung nodule extraction, where the (a) is the original CT image, the (f) is a lung parenchyma after being processed, and the (i) and (j) are the suspected lesion areas. Finally, 200 RIO regions were extracted in this experiment, including 80 lung nodules and 195 non-nodules. After group them randomly, 125 of them are as the training set and 200 are as the test set for supervised learning model. And, 125 of them are as the labeled set and 200 are as the unlabeled set for semi-supervised learning model. Figure 5 shows some lung nodules and non-nodules gained from the experiment.

Feature vector extraction on the ROI regions datasets is performed before the training of AdaBoost-ISSA-S4VM classification model on the training dataset, as shown in Table 8. 715 feature vector parameters are extracted, including 12 morphological feature parameters, 10 gray-scale feature parameters, 7 texture feature parameters, and 686 Curvelet transform coefficients. Then, these feature vectors are normalized to prevent features with large dynamic range from affecting the characteristic of features with small one, as shown in Table 9.

The CT image after preprocessing, extracting lung parenchyma and lung nodes, and extracting the feature of the lung nodes is used to train the pulmonary nodule detection classification model. In order to measure the performance of the AdaBoost-ISSA-S4VM classification model, we compare the improved classification model with several popular pulmonary nodule detection classification models. These classification models are the SVM model [26], standard S4VM model, AdaBoost-SVM model, AdaBoost-S4VM model, and AdaBoost-ISSA-S4VM model. In order to evaluate the performance of the recognition model, the following performance indicators are selected in this paper. The formula for evaluating the classification indexes is shown in Table 10.

ACC is used to evaluate the accuracy of each classification model. SEN and SPE are used to refer the ability to detect the true positive and true negative, respectively. FPR and FNR are, respectively, the misdiagnosis rate and missed diagnosis rate. Table 11 records the performance indicator data of each classification model, and the best results are marked in bold. The larger the SEN and SPE are, the better the classification model performs. On the contrary, the smaller the FPR and FNR are, the better the classification model performs.

From the performance indicators data of each classification model in Table 11, we can see that the classification accuracy of AdaBoost-ISSA-S4VM classification model can be comparable to or even better than supervised classifiers such as SVM. First of all, it can be seen that the classification accuracy of the S4VM classifier is quite poor and far lower than the SVM. The reason is that the SVM is a supervised classifier whose input dataset is labeled, while S4VM is a semi-supervised classifier whose input dataset contains unlabeled dataset, which will reduce the accuracy of the classifier. Secondly, the S4VM classifier optimized by ensemble learning is better than SVM combined with ensemble learning. Then, the classification accuracy of the established AdaBoost-ISSA-S4VM, which is the S4VM classifier optimized by ensemble learning and SCA-CSSA and can get 94.22% on labeled and unlabeled lung CT images which is much higher than the original supervised classifier on labeled samples. At the same time, the false positive rate and false negative rate of the established AdaBoost-ISSA-S4VM can get 0.1234 and 0.0146 on these lung nodule images. The false positive rate and false negative rate of AdaBoost-ISSA-S4VM also performs well. Based on above results of data, the proposed classification model is better than the traditional supervised classifiers such as SVM model on lung nodule classification.

5. Conclusion

In summary, the improved semi-supervised ensemble classifier (AdaBoost-ISSA-S4VM) is proposed by combining AdaBoost classifier, semi-supervised SVM, and improved sparrow search algorithm for semi-supervised problem. The proposed algorithm is employed in lung CT images for pulmonary nodule detection, and a detailed performance comparison and analysis are presented based on the publicly available LIDC-IDRI database. Better experimental results are obtained with the improved algorithm compared to that with the SVM, S4VM, AdaBoost-SVM, and AdaBoost-S4VM algorithms. In particular, the proposed AdaBoost-ISSA-S4VM is able to improve 21% more accuracy than standard SVM and 26% more accuracy than S4VM. This study demonstrates that the established AdaBoost-ISSA-S4VM classifier can solve the problem of pulmonary nodules detection of labeled and unlabeled lung CT images. In other words, the proposed AdaBoost-ISSA-S4VM classifier has the potential for improving the performance of the lung CT image classification by labeled and unlabeled lung CT images with a high detection probability of being cancers at its early stage.

Although the proposed AdaBoost-ISSA-S4VM has been proven to be effective in solving general optimization problems, AdaBoost-ISSA-S4VM has some shortcomings that warrant further investigation. And, in AdaBoost-ISSA-S4VM, due to the improvement strategies, AdaBoost-ISSA-S4VM has needed more time than the classical S4VM and most of supervised classifiers. Therefore, deploying the proposed algorithm to increase recognition efficiency is a worthwhile direction. In the future research work, the method presented in this paper can also be extended to solving discrete optimization problems and multiobjective optimization problems. Furthermore, applying the proposed AdaBoost-ISSA-S4VM model to other fields such as financial prediction and biomedical science diagnosis is also an interesting future work.

Data Availability

All data included in this study are available upon request by contact with the corresponding author. All the lung CT images for pulmonary nodule detection in this study can be found in the free publicly available LIDC/IDRI Database created by the National Cancer Institute and the Foundation for the National Institutes of Health.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Key Supporting Project of the Joint Fund of the National Natural Science Foundation of China (no. U1813222), Tianjin Natural Science Foundation (no. 18JCYBJC16500), and Key Research and Development Project from Hebei Province (no. 19210404D).