Abstract

The use of machine learning techniques to predict material strength is becoming popular. However, not much attention has been paid to instance-based learning (IBL) algorithms. Therefore, in order to predict material strength, as the direct method by conducting tests is time-consuming and expensive and experimental errors are inevitable, an indirect method based on elementary instance-based learning algorithm was proposed. The standard k-nearest neighbors (k-NN) with cross-validation were utilized to develop compressive strength prediction models for some concretes and rocks by considering indirect parameters such as physical and mechanical parameters. Results on applying this method to datasets from literature studies show that the values of RMSE for k-NN are modest, indicating adequacy to predict compressive strength with comprehensive range values of predictors. Additionally, the R2-values of the k-NN models were high. In other words, the models were able to explain the variance in compressive strength for data with a wide range of input values.

1. Introduction

The relationship between material strength and its mixture and process can be complex. As such, the relationship which is usually determined empirically using experiments is problematic. Moreover, physical, mineralogical-petrographic, index, and mechanical tests are time-consuming and expensive, and experimental errors are inevitable. Machine learning (ML) techniques are increasingly used to model the strength of materials, such as concrete and rock and have become an important research area [110]. Additionally, the results of these studies can help engineers and practitioners determine the key components related to material strength performance.

Ensemble models can provide higher performance in predicting material strength compared to individual models. Nevertheless, no model has been proven to be superior all the time [11]. Moreover, if the link between input and output is important for description, then the ensemble models can lead to difficult interpretative problems [11, 12]. The ensemble model is designed based on specific and limited samples in relation to the nature and volume of the dataset; hence, the direct use of the ensemble model for strength prediction should be avoided to be used for other material types, until establishing after further complementary studies [13]. The main disadvantage of an ensemble is the resources it requires: calculations, software availability, and analyst’s skills and time investment.

In general, it is better to use a simpler model rather than a more complex model, and selecting tuning parameters based on numerically optimal value may result in an overly complex model. Other options for choosing less complex models should be investigated, as they might lead to simpler models that provide acceptable performance. The use of ML techniques to predict material strength is becoming popular. However, not much attention has been paid to instance-based learning (IBL) algorithms [14, 15]. Therefore, the objective of the present study is to investigate the potential of an IBL algorithm for predicting material strength, with data obtained from the literature. The k-nearest neighbors (k-NNs) are among the simplest of all ML algorithms. The standard k-NN was utilized to develop compressive strength prediction models for some concretes and rocks by considering indirect parameters such as physical and mechanical parameters. To verify and validate the standard k-NN models, prediction results of models reported in the literature were compared using six datasets via the cross-validation method to minimize the bias.

The remainder of this paper is organized as follows. Section 2 describes the standard k-NN approach and resampling methods. Section 3 presents data description and summarization. Section 4 describes the results and discussion for prediction of compressive strength before our conclusions are provided in Section 5.

2. Methods

The model development procedure is presented in this section. In the first part, standard k-NN models with performance statistics are discussed. After that, in the second part, the validation procedures of the standard k-NN models are presented. The models used for this study were implemented by the high-level programming language R, an open-source statistical software [16].

2.1. k-Nearest Neighbors

The standard k-NN approach, which is an IBL algorithm, simply predicts a new sample using the k-closest samples from the training set [17]. The construction of the model is solely based on the individual samples from the training data. To predict a new response value (i.e., CS for compressive strength, and E for Young’s modulus) for regression, k-NN identifies only k-closest neighbors in the space of the data attributes (i.e., mixture factors and process factors) and using a predefined function (i.e., average function) of the response values of the k-nearest neighbors [18].

The Euclidean distance is the most commonly used as a measure of closeness between observations and and is defined as follows:where D represents the number of attributes, and are components of vectors and , and N is the number of observations. Minkowski distance is a generalization of Euclidean distance and is defined aswhere q > 0 [19]. It is easy to see that when q = 2, then Minkowski distance is the same as Euclidean distance. When q = 1, the Minkowski distance is equal to the Manhattan distance, which is a general metric used for samples with binary predictors. There are many other distance measures, such as Tanimoto, Hamming, and Cosine and are more suitable for specific types of predictors. For example, when using binary fingerprints to describe molecules, Tanimoto distance is often used in computational chemistry problems [20]. To show that the elementary version of k-NN is intuitive and straightforward and can produce decent predictions, the Euclidean distance was used in this paper. The average of the response values of the k-nearest neighbors is used for calculating the unknown response value :

Since distances between the observations are used as a measure of closeness, the data have to be preprocessed to have the same mean and variance for each predictor. All predictors are centered and scaled prior to performing k-NN. To center the predictors, all values minus the average predictor value. Because of centering, the mean of the predictors is zero. Similarly, to scale the data, each value of the predictor variable is divided by its standard deviation. Scaling the data will force the values to have a common standard deviation of one [17].

In order to evaluate the prediction performances of the model, the coefficient of determination (R2) and the root mean squared error (RMSE) were used:where and are the observed and predicted values, respectively; is the mean of the observed values; M is the number of data samples. R2 provides information about the strength of correlation between observed and predicted values. RMSE evaluates the residual between the observed and predicted values. When predicting numeric values, RMSE is often used to evaluate the model. Quantitative evaluation of statistical information (i.e. RMSE) using resampling can help users understand how each technique performs on new data. The RMSE corresponding to the values of k within a range was used to select the optimal number of neighbors using the smallest RMSE. Furthermore, the observed and predicted values were plotted to discover areas of the data where the k-NN model did particularly good or bad.

2.2. Resampling Methods

When there is a large amount of data, the data can be split into training and test sets. The former is used to create a model, and the latter is used to evaluate the model performance. However, some researchers [21, 22] showed that validation using a single test set can be a poor choice. When the number of samples is small, a test set should be avoided, because each sample may be needed to build the model. In addition, the size of the test set may not have enough power or precision to make a reasonable judgment. Resampling methods can produce reasonable predictions about the model’s performance in future samples. In this study, resampling methods, such as 10-fold cross-validation (CV), leave-one-out cross-validation (LOOCV), and repeated 10-fold cross-validation (RepeatedCV), were used to minimize bias (overfitting and underfitting) associated with random sampling of training and hold out data samples and to determine the optimal number of neighbors to retain that minimize RMSE.

By evaluating how the model fits points that were not used to perform regression, we can understand how the model will function in future observations. It can evaluate the overall operation of the model, not just the observed data [23]. In CV, the sample was randomly divided into 10 groups of roughly equal size. A model was fit with all samples except for the first subset (called the first fold). The held-out samples were predicted by the model and used to estimate RMSE. The first subset was returned to the training set, the process was repeated with the second subset held out, and so on. The average RMSE of the 10 performance estimates would be the cross-validation estimate of model performance. Kohavi [24] confirmed that ten-fold validation testing yields the optimal computational time and reliable variance. In LOOCV, the number of fold was the number of samples (M). Because only one sample was held-out at a time, the final RMSE was calculated from the M individual held-out predictions. Some research studies [22, 25] indicated that RepeatedCV can effectively improve the accuracy of the estimation while still maintaining a small bias.

3. Data Description and Summarization

The datasets used in this study have been confirmed in some studies of predictive models (Table 1). Based on the standard k-NN model, this study used six experimental datasets to investigate the prediction performance of the elementary model. Table 2 lists the six datasets with descriptive statistics, i.e., maximum (Max), minimum (Min), average (Ave), and standard deviation (Std). The response/target was CS, and the predictor variables were the remaining attributes. The minimum and maximum values given in Table 2 of the different sample properties function as the boundary conditions of the some models, e.g., artificial neural networks (ANN).

Dataset 1. The test data of 144 different concrete mix-designs were gathered from Lam et al. [31]. The high-performance concrete (HPC) mixes were prepared at different ratios of water to cementitious materials, with low and high volumes of fly ash, and with or without addition of small amounts of silica fume. CS was 7.8–107.8 MPa. These samples consisted of 24 different mixes. In each mix series, the percentage of cement replacement by fly ash varied from 0% to 55%. The cementitious materials were Portland cement equivalent to ASTM type I, low-calcium fly ash equivalent to ASTM Class F, and a condensed silica fume commercially available in Hong Kong.

Dataset 2. Siddique et al. [32] collected the data of 80 concrete mixes with comparable physical and chemical composition properties from various studies. The self-compacting concrete mixes were made with water/powder ratios of 0.33–0.87 that contain from 0 to 261 kg/m3 of fly ash. Coarse aggregate content varied from 621 to 923 kg/m3. Fine aggregate content varied from 478 to 1079 kg/m3. CS was 10.2–73.5 MPa. The content of superplasticizer was 0–100%.

Dataset 3. A database of 104 concrete mixes used in this experiment that was carried out by Lim et al. [26] was produced in South Korea. The water to binder ratio of the HPC varies between 0.30 and 0.45, and the amount of fly ash used varied from 0% to 20% of the total binder, and the content of superplasticizer and air-entraining agent were 0.5–1.5% and 0.010–0.013%, respectively. Portland cement in accordance with ASTM type I was used. The coarse aggregate used was crushed granite (specific gravity, 2.7; fineness modulus, 7.2; maximum particle size, 19 mm). The fine aggregate was quartz sand (specific gravity, 2.61; fineness modulus, 2.94). CS was 38–74 MPa.

Dataset 4. The rocks were collected from the factories, outcrops, and quarries in different locations of Turkey [27]. A series of laboratory tests including physical test, ultrasonic velocity test, point load strength test, Schmidt hammer test, Brazilian tensile strength test, Shore hardness test, and uniaxial CS test were conducted on blocks or pieces taken from fresh parts of 93 different rocks from 32 rock types. CS was 6.64–303.67 MPa.

Dataset 5. The granite samples were taken from the face of the Pahang–Selangor raw water transfer tunnel in Malaysia [13]. A series of laboratory tests including physical test, ultrasonic velocity test, point load strength test, Schmidt hammer test and uniaxial CS test were conducted on 71 samples of granite. CS was 28.0–211.9 MPa. E was 22.0–183.3 GPa.

Dataset 6. A variety of sedimentary rocks including grainstone, wackestone-mudstone, boundstone, gypsum, and silty marl were collected from quarries in Qom Province, central Iran [33]. A series of laboratory tests including physical test, ultrasonic velocity test, point load strength test, and uniaxial CS test were conducted on 106 data sets. CS was 6.21–160.32 MPa.

4. Results and Discussion

This section depicts predictive accuracy of proposed models by comparing different models from literature studies. There are some strong between-predictor correlations, as shown in Tables 3 and 4. However, the percent of variance accounted for by each principal component is more than 37% of the variance, as shown in Table 5, indicating that there are no redundant predictors. In order to evaluate and compare the performance of different models, the prediction performances of the standard k-NN models and the models developed in the literature studies are presented in Table 6. A plot of the observed values against the predicted values helps one to understand how well the model fits. Also, a plot of the residuals versus the predicted values can help uncover systematic patterns in the model predictions, such as the trend. These plots for k-NN models are shown in Figures 1 and 2.

Dataset 1. Pala et al. [35] studied the impact of fly ash and silica fume replacement content on the long-term strength of concrete by ANN. Chou and Pham [15] used six data mining techniques, ANN, classification and regression trees (CART), chi-squared automatic interaction detector (CHAID), multiple linear regressions (MLR), generalized linear model (GENLIN), and support vector machines (SVM) to construct individual and ensemble models. The number of parameter settings for the aforementioned six single models varies from 5 to 10. Table 6 shows that top three performing models are ANNs, CART, and CHAID. The standard k-NN shows modest result as MLR and GENLIN. The observed CS values against those predicted from the 4-NN (RepeatedCV), as shown in Figure 1, shows modest concordance between the observed and predicted values, and there are about as many positive as negative residuals and they do not show any strong patterns. There are some observations fairly far from the horizontal axis, but many more close to it. The majority of the residuals are within ±10 MPa for CS ranging from 7.8 to 107.8 MPa.

Dataset 2. Table 6 shows that top two performing models are MLR and GENLIN. The standard k-NN shows modest result as ANN and CART. The observed CS values against those predicted from the 7-NN (CV), as shown in Figure 1, shows modest correlation between the observed and predicted values, and there are about as many positive residuals as negative residuals, and they do not show any strong patterns. The majority of the residuals are within ±10 MPa for CS ranging from 10.2 to 73.5 MPa.

Dataset 3. Table 6 shows that top two performing models are ANN and k-NN. The observed CS values against those predicted from the 2-NN (LOOCV), as shown in Figure 1, show good concordance between the observed and predicted values, and there are about as many positive residuals as negative residuals, and they do not show any strong patterns. The majority of the residuals are within ±2 MPa for CS ranging from 38 to 74 MPa.

In the study of Chou and Pham [11], the IBM SPSS modeler [36] was used in ANN analyses with the standard feedforward backpropagation, the gradient descent algorithm, and three hidden layers (20, 15, and 10 neurons). The best individual model in predicting HPC compressive strength using three experimental datasets was ANN, which achieved 45.1%, 4.5%, and 11.4% better error rates than those of the standard k-NN model for datasets 1 to 3, respectively. The standard k-NN model achieved 18.7%, 32.1%, and 13.2% better error rates than those of the lowest performing model, SVM, for datasets 1 to 3, respectively. One significant limitation of the work done by Chou and Pham is that it used the IBM SPSS modeler with default settings in the single and ensemble models. Therefore, further studies are needed to determine optimum values of parameters.

Siddique et al. [32] adopted the procedure for partitioning the neural-network connection weights proposed by Garson [37] to determine the relative importance of the various inputs. The enhanced ANN model yielded an RMSE of 5.557 MPa and achieved 21.8% better error rates than those of the standard ANN model for dataset 2.

Ahmadi-Nedushan [38] used differential evolution algorithm to find the optimal k-NN model parameters, such as number of neighbors, distance function, and attribute weights. The best enhanced model, with optimal attribute weighting, yielded an RMSE of 1.174 and achieved 32.8% better error rates than those of the standard k-NN model for dataset 3 because the Euclidian distance function in the standard k-NN model assumed that all the attributes were equally important. Sometimes, the right choice of neighbors depends on modifying the distance function to favor some predictors over others. This is easily accomplished by incorporating weights into the distance function.

Dataset 4. In the study of Teymen and Mengüç [27], MRA and ANN models were made with the help of IBM SPSS modeler and Matlab [39], respectively. Gradient descent with momentum and adaptive learning rate backpropagation algorithm and one hidden layer with three neurons were used. All binary combinations of the six independent variables were tried as input parameters. According to the performance index assessment, the weakest model was the one with Vp and SSH as input and yielded an R2 of of 0.874 and 0.834 for ANN and MRA, respectively. The most successful model was the one with BTS and Is as input and yielded an R2 of 0.921 and 0.953 for ANN and MRA, respectively, as shown in Table 6. Nevertheless, k-NN with the six independent variables as input yielded a larger R2 of 0.931, 0.917, and 0.905 for 7-NN (CV), 5-NN (ReapeatedCV), and 5-NN (LOOCV), respectively. k-NN is a highly automated data-driven method. Therefore, the algorithm of k-NN is intuitive, straightforward, and easy to implement and can produce decent predictions. The observed CS values against those predicted from the 7-NN (CV), as shown in Figure 2, show good concordance between the observed and predicted values, and there are about as many positive as negative residuals and they do not show any strong patterns. There are some observations fairly far from the horizontal axis, but many more close to it. The majority of the residuals are within ±20 MPa for CS ranging from 6.64 to 303.67 MPa.

Dataset 5. Table 6 shows that there is no clear winner or loser for predicting E in the three models, k-NN, ANN, and multivariate regression analysis (MRA). k-NN and MRA can predict CS with a high degree of accuracy. The observed CS values against those predicted from the 3-NN (CV), as shown in Figure 2, show good concordance between the observed and predicted values, and there are about as many positive residuals as negative residuals, and they do not show any strong patterns. The majority of the residuals are within ±20 MPa for CS ranging from 28 to 211.9 MPa.

To overcome shortcomings such as the slow rate of learning and entrapment in local minima, Jahed Armaghani et al. [13] built an ANN enhanced with the imperialist competitive algorithm (ICA) [28, 29] to predict CS and E. The performance of the ICA-ANN can predict CS with a high degree of accuracy and E with a suitable degree of accuracy. The enhanced ANN model yielded an RMSE of 12.454 MPa and achieved 52.5% better error rates than those of conventional ANN for dataset 4. However, Jahed Armaghani et al. [13] mentioned that the proposed ICA-ANN predictive model is designed based on the CS of granite samples; hence, the direct use of the ICA-ANN model for CS prediction of other rock types is not suggested.

Dataset 6. In the study of Jalali et al. [30], Matlab software was used in ANN analyses with the standard feedforward network, the Levenberg–Marquardt algorithm, and one hidden layer with 5 neurons. Table 6 shows that ANN is top performing model. The predictive performances of MRA and k-NN show both models are comparable. The observed CS values against those predicted from the 2-NN (LOOCV), as shown in Figure 2, show good concordance between the observed and predicted values, and there are about as many positive residuals as negative residuals, and they do not show any strong patterns. The majority of the residuals are within ±15 MPa for CS ranging from 6.21 to 160.32 MPa.

Generally, parametric methods will tend to outperform nonparametric approaches when there are a small number of observations per predictor. For example, MLR, GENLIN, and MRA outperform nonparametric approaches for datasets 2, 4, and 5. However, algorithms for predicting material strength based on conventional regression analysis and statistical models may be unsuitable because it is highly complex and correlations give good results only in similar materials [4042].

Residual plots show that the resulting predictions of k-NN are always reasonable within a reasonable range. This is because the final prediction is based on the actual value of the neighbor. Keep in mind that regression and neural networks may produce impossible results because the prediction range is from negative infinity to positive infinity, and the range of reasonable values may not be so extreme. The k-NN technique produces reasonable values with many distinct values. However, the range of predicted values is narrower than the range in the dataset. This is due to the averaging combination function, which smooths out the maximum and minimum values.

No resampling method is uniformly better than another. However, putting computational issues aside, a less obvious but potentially more important advantage of 10-fold CV is that it often gives more accurate rate than does LOOCV. This has to do with a bias-variance trade-off. Since the mean value of many highly correlated predictors has a higher variance than the mean value of many non-highly correlated predictors, the test error estimate produced by LOOCV tends to have a higher variance than the test error estimate produced by 10-fold CV.

5. Conclusions

This study has examined the use of an elementary ML technique to predict compressive strengths of some concretes and rocks. As seen in Table 6, the values of RMSE for k-NN are modest, indicating adequacy to predict compressive strength with comprehensive range values of predictors. Additionally, the R2-values of the k-NN models were high. In other words, the models were able to explain the variance in compressive strength for data with a wide range of input values. One benefit of this approach is its simplicity, which allows us to use rigorous analysis to guide our intuition and research goals. Furthermore, k-NN does not require that the data satisfy some predefined model. k-NN requires neither temporary parameters nor background knowledge. Aha [34] showed that when combined with noisy example pruning and attribute weighting, IBL performs well compared with other methods. In short, k-NN with cross-validation is a simple, general, effective technique which yields high quality predictions by combining the predicted values of the k-nearest neighbors and weighting them by distance. However, finding a computationally effective means for calculating these weights requires further research.

Removing irrelevant predictors is a key preprocessing step for k-NN. Expert knowledge should first be applied to obtain relevant data for the required research objectives. Furthermore, in an attempt to remove noninformative or redundant predictors from the model and to find only the most relevant predictors in a given problem, many different types of feature selection methods have been proposed [43].

Table 6 indicates that the results obtained by ANN are better than k-NN techniques. However, the tuning of parameters such as momentum, learning rate, and number of hidden layers, makes ANN easy to overfit the data at hand. Furthermore, the application of some subject matter expertise to the data preparation improves model performance. In dataset 4, the most successful model was the one with BTS and Is of the six independent variables as input and yielded an R2 of 0.921 for ANN. Nevertheless, k-NN with the six independent variables as input yielded a larger R2 of 0.931. Therefore, the algorithm of k-NN is more intuitive and straightforward.

Sometimes one of simple models will be the best predicting model available; but in many cases, these models will serve as benchmarks rather than the model of choice. That is, any predicting model might be compared to these simple models to ensure that the new model is better than these simple alternatives. If not, the new model is not worth considering.

In model selection, whenever possible, analysts should not rely on a single data mining method. There is no single model that will always do better than any other model. Choosing between multiple models largely depends on the characteristics of the data and the type of questions being answered. Therefore, it is customary to apply several different methods in data mining and then choose the most useful method for the current goal. The k-NN model that reasonably approximates the performance of the more complex methods could be used as a tool to support decision making because the standard k-NN is easy to implement and has potential applications in material science.

Data Availability

The data used in the study were collected from different research papers in modelling aspect.

Conflicts of Interest

The author declares that there are no conflicts of interest.