2.1.1 Haemaphysalis spinigera tick occurrence data:
Data on reported availability of Haemaphysalis spinigera were collated by systematic and comprehensive literature retrieval from the google scholar, Cochrane library, PubMed and the web of science database, by using the keywords Kyasanur Forest Disease, Haemaphysalis spinigera occurrence, monkey death by KFD virus, KFD in India, human cases of KFD (Additional file 1). The literature dealing with availability of Haemaphysalis spinigera, occurrence of KFD cases or monkey death from 1957 onwards were taken into consideration to ascertain the coordinates. Location of confirmed cases (human cases and monkey death) were converted into point features (exact latitude and longitude, 1 km × 1 km) or polygon features (i.e., localities, villages, districts) geo-referenced using Google Earth and Arc-GIS for the rectification of latitude and longitude [38]. When the name of the locality or village could not be identified at the administrative level, the coordinates were overlaid in a geographic information system (GIS) and assigned to the appropriate polygon feature [38]. All the locations of the occurrence of Haemaphysalis spinigera tick were transformed into WGS 84 datum using Arc-GIS software. Since the present study was conducted by the resolution of 30 arc-second (approximately 1 km × 1 km resolution), localities within one pixel were selected as one occurrence point. Altogether, 34 locations with confirmed human cases, monkey deaths or availability of Haemaphysalis spinigera ticks were georeferenced from all the reported areas of Karnataka, Maharashtra, Kerala, Goa, and Tamil Nadu (Fig 1) (additional Table 1).
2.1.2 Bio-climatic variables:
Bio-climatic variables are biologically more significant to identify the physio-ecological resistance of plants and animals than simple temperature and rainfall [39, 40]. Therefore, these variables are commonly used in bio-climate envelope modelling [31, 41]. The study used 19 bio-climatic variables as potential predictors of Haemaphysalis spinigera distribution (as shown in additional Table 2). Raster-based bio-climatic variables were collected from the WorldClim Version2. The spatial resolution of these bio-climatic layers is ∼1 km (30 arc seconds) and show extremity and seasonality of temperature, annual trends of precipitation and temperature parameters.
Of nineteen bio-climatic variables, five extremely correlated variables, having negligible effect on the model, were removed to reduce the masking effect and produce a model with better predictability [42]. The test was run by Pearson’s correlation coefficient (r) using ENM Tool (version 1.3), and a cross-correlation ‘r’ values of more than 0.80 was taken as a cut of threshold [25, 42] (additional Table 3). Finally, for modelling, the remaining 14 bio-climatic variables having a higher permutation significance and percent contribution were used. Based on the MaxEnt produced response curves, the relationship between bioclimatic variables and habitat suitability for Haemaphysalis spinigera occurrence were evaluated.
2.2 Predictive modelling:
The ArcGIS 10.3 and ENVI 5.1 softwares were used to generate raster-based spatial layers of the bio-climatic variables. The maximum entropy (MaxEnt) modelling is a machine learning algorithm [24] that calculates the probability distribution for a vector or species location based on different environmental restraints. The model executes well even with less number of sampling points than other machine learning methods [43]. It uses presence-only vector/species location point to predict the potential distribution based on MaxEnt theory [24]. The basic principle of this algorithm is to ensure that approximation meets any limitations on the unknown points, meaning that the calculated probability of unknown distribution represents less number of constraints with a set of extra choices [44, 45]. However, in this study we used 34 locations’ data about the presence of Haemaphysalis spinigera and generated pseudo-absences. About 10,000 background points were randomly selected by the maximum entropy algorithm. Data on the presence of ticks were divided into 75% random samples to calibrate the model, and the 25% random samples were utilized to assess the model performance. We used subsampling method in an attempt to create a stable model because it has advantages over bootstrap and cross-validation [46, 47], and 50 replicates were chosen to run the model.
The model also suggests settings to assess complexity of the model by altering regularization multipliers and feature classes. Sixteen different combinations of the feature classes were created to identify the appropriate feature, by retaining the linear function in each feature, which was then used for model performance. In order to balance the fit of the model and to avoid overfitting, regularization multipliers were used [48]. The selected default value for model calibration is 1.0. In total, 123 models combinations were created for selecting the best fit model in different settings. Other values of the model were set as default to get better results.
2.3 Threshold identification:
For model results indicating probability of presence (suitability of a species), the logistic value ranging from 0 (unsuitable) to 1 (max. probability of presence) was used [24]. By applying ‘max SSS’ (maximum test sensitivity with specificity) logistic threshold value, binary unsuitable/maximum suitable map has been prepared. Specificity (Sp) and Sensitivity (Se), which are independent, implies the likelihood of a model that adequately forecasts a species absence and presence in any location and measures the commission and omission errors. Sp and Se are distinct and not influenced by predominant across models [49]. In the ROC curve, the ‘max SSS” identifies a point in which the tangent slope is 1 that demonstrates 1-specificity and sensitivity for maximizing TSS value. The value can be utilized as an efficient threshold value when only occurrence or target species presence data are available and has been used extensively [50-52]. This binary raster was used to show the potential distribution of the Haemaphysalis spinigera ticks using SDM toolbox 2.0. The selection of backgrounds for latitudinal changes resulting from the geographical coordinate systems has been corrected by a bias file [53].
2.4 Model assessment and validation:
To estimate the goodness of fit of the model, the Area Under the receiver operating characteristics Curve (AUC) was used, and the highest value indicated as the best performer. The AUC is a threshold-independent technique of a model assessment to discriminate outcomes of presence/absence [54]. AUC values vary from least value 0 to the highest value 1. The 0.5 value signifies that the model findings were less than random, while the 1.0 value indicates complete discrimination [54, 55]. In the Jackknife test, the contribution of the bio-climatic factors was also measured. The detailed methodological flow diagram in this work are shown in Fig 2.