Next Article in Journal
Real-Time Augmented Reality Physics Simulator for Education
Next Article in Special Issue
Machine Learning and Deep Learning Methods for Intrusion Detection Systems: A Survey
Previous Article in Journal
Thermoeconomic Analysis of Different Exhaust Waste-Heat Recovery Systems for Natural Gas Engine Based on ORC
Previous Article in Special Issue
Information Extraction of Cybersecurity Concepts: An LSTM Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Insider Threat Detection Based on User Behavior Modeling and Anomaly Detection Algorithms

School of Industrial Management Engineering, Korea University, Seoul, Korea
*
Author to whom correspondence should be addressed.
Submission received: 28 August 2019 / Revised: 20 September 2019 / Accepted: 23 September 2019 / Published: 25 September 2019

Abstract

:
Insider threats are malicious activities by authorized users, such as theft of intellectual property or security information, fraud, and sabotage. Although the number of insider threats is much lower than external network attacks, insider threats can cause extensive damage. As insiders are very familiar with an organization’s system, it is very difficult to detect their malicious behavior. Traditional insider-threat detection methods focus on rule-based approaches built by domain experts, but they are neither flexible nor robust. In this paper, we propose insider-threat detection methods based on user behavior modeling and anomaly detection algorithms. Based on user log data, we constructed three types of datasets: user’s daily activity summary, e-mail contents topic distribution, and user’s weekly e-mail communication history. Then, we applied four anomaly detection algorithms and their combinations to detect malicious activities. Experimental results indicate that the proposed framework can work well for imbalanced datasets in which there are only a few insider threats and where no domain experts’ knowledge is provided.

1. Introduction

Insider threat is a security issue that arises from persons who have access to a corporate network, systems, and data, such as employees and trusted partners [1]. Although insider threats do not occur frequently, the magnitude of damage is greater than from external intrusions [2,3]. Because insiders are very familiar with their organization’s computer systems and operational processes, and have the authorization to use these systems, it is difficult to determine when they behave maliciously [4]. Many system protection technologies have been developed against intrusions attempted by outsiders, e.g., quantifying the pattern of connection Internet protocols (IPs) and types of attacks [5]. However, past research on the security of a company’s internal information has mainly focused on detecting and preventing intrusion from the outside, and few studies have addressed methods to detect insider threats [6].
There are three research mainstream strategies for insider threat detection. The first strategy is to develop a rule-based detection system [7,8]. In this strategy, a pool of experts generates a set of rules to identify insiders’ malicious activities. Then, each user’s behavior is recorded as a log and is tested to determine whether it meets any of the pre-designed rules. Cappelli et al. [9] discussed the types of insider threats and domain knowledge to prevent/detect insider threats. Rule-based detection methods have a critical limitation in that the rules must be constantly updated through the knowledge of domain experts, so the risk of someone circumventing the rules always exists [10]. Hence, rule-based methods based on expert knowledge are inflexible to changing insider threats methods, which results in unsatisfactory detection performance [7,10,11].
The second strategy is to build a network graph to identify suspicious users or malicious behaviors by monitoring the changes of the graph structure [12]. Graph-based insider threat identification does not only analyze the value of the data itself but also analyzes the relationships among the data. The relationships among the data are represented by edges connecting the nodes of the graph, and its properties can be analyzed to determine the relationships of specific nodes to insider threats. Eberle et al. [12] defined an abnormal activity if modifications, insertions, or deletions occur in the underlying structure of a normal data graph. To determine the structure of a normal data graph, they employed a graph-based knowledge discovery system called “SUBDUE”. Parveen et al. [13] used graph-based anomaly detection (GBAD)-MDL, GBAD-P and GBAD-MPS to determine the ideal structure of a graph, and added an ensemble-based approach to detect abnormal insider activities in the “1998 Lincoln Laboratory Intrusion Detection” dataset.
The third strategy is to build a statistical or machine learning model based on previous data to predict potential malicious behavior [14]. Machine learning is a methodology in which a computer learns an algorithm to optimize appropriate performance criteria from training data to perform given tasks [15]. Insider threat detection using machine learning aims at developing a method to automatically identify users who perform unusual activities among all users without prior knowledge or rules. Because the machine learning methodology can continuously learn and update the algorithms from the data, it can perform stable and accurate detection compared to rule-based detection. Gavai et al. [16] employed random forest [17] and isolation forest [18] to classify retirees for the ‘Vegas’ dataset, in which behavior features are extracted from e-mail transmission patterns and contents, logon and logoff records, web browsing patterns, and file access patterns. Ted et al. [4] collected user activity data for 5500 users using a tool called “SureView” (Forcepoint, Austin, TX, USA). They extracted features from the data by considering potential malicious activity scenarios by insiders, implied abnormal activities, temporal order, and high-level statistical patterns. They created variables involving insider’s various actions such as email, files, and logons, and they applied 15 statistical indices and various machine-learning algorithms to determine the most suitable combination of algorithms. Eldardiry et al. [10] detected insider threats by measuring similarity in behavior between the role group to which a user actually belongs and another role group to which he/she does not belong, assuming that users in the same role groups have similar patterns of activities.
Although the learning model-based strategy is advantageous in that it does not depend on the knowledge of domain experts to define a set of rules or to construct a relational graph, it has two practical obstacles: (1) the way of quantifying a user’s behavioral data and (2) the lack of abnormal cases available for model building. As most statistical/machine learning models take a continuous value as an input to the detection model, each user’s behavior during a certain period (e.g., day) should be transformed into a numerical vector in which each element represents a specific behavioral characteristic. Because a user’s behavior can be extracted from different data sources, such as systems usage logs, e-mail sending and receiving networks, and e-mail contents, one of the key points of building successful insider threat detection models is to define useful features for different types of data and to transform the unstructured raw data into a structured dataset. From a modeling perspective, it is virtually impossible to train a binary classification algorithm when only a few abnormal examples exist [19]. Under this class imbalance circumstance, most statistical/machine learning algorithms tend to classify all activities as normal, which results in a useless insider-threat detection model. To resolve these shortcomings, we propose an insider-threat detection framework based on user activity modeling and one-class classification. During the user activity modeling stage, we consider three types of data. First, all activity logs of individual users recorded in the corporate system are collected. Then, candidate features are extracted by summarizing specific activities. For example, if the system logs contain information on when a user connects his/her personal Universal serial bus (USB) drive to the system, the total number of USB connections per day can be extracted as a candidate variable. Second, we consider user-generated contents, such as the body of an e-mail, to create candidate features. Specifically, we used topic modeling to convert unstructured text data to a structured vector while preserving the meaning of text as much as possible. Lastly, we construct a communication network of users based on e-mail exchange records. Then, summary statistics for a node including centrality indices are computed and used as candidate features. During the insider-threat detection model-building stage, we employ one-class classification algorithms to learn the characteristics of normal activities based on three categories of candidate feature sets. We then employ four individual one-class classification algorithms and exploit the possible advantages of their combinations. By considering heterogeneous feature sets, we expect an improved detection performance compared to detection models based on a single dataset. In addition, by employing one-class classification algorithms, it becomes practically possible to build an insider-threat detection model without the need for past abnormal records.
The rest of this paper is organized as follows. In Section 2, we introduce the dataset used in this study and demonstrate user activity modeling, i.e., how to transform unstructured logs or contents to a structured dataset. In Section 3, we introduce the one-class classification algorithms employed to build the insider-threat detection model. In Section 4, experimental results are demonstrated with some interesting observations. Finally, in Section 5, we conclude our study with some future research directions.

2. Dataset and User Activity Modeling

In this section, we briefly introduce the dataset used in our study. Then, we demonstrate how we define candidate features for the insider-threat detection model and how we transform three different types of user activity data into numeric vectors.

2.1. CERT Dataset

Because it is very difficult to obtain actual corporate system logs, we used the “CERT Insider Threat Tools” dataset (Carnegie Mellon’s Software Engineering Institute, Pittsburgh, PA, USA) [20]. The CERT dataset is not real-world enterprise data, but it is an artificially generated dataset created for the purpose of validating insider-threat detection frameworks [1].
The CERT dataset includes employee computer usage logs (logon, device, http, file, and email) with some organizational information such as employee departments and roles. Each table consists of columns related to a user’s ID, timestamps, and activities. The CERT dataset has six major versions (R1 to R6) and the latest version has two variations: R6.1 and R6.2. The types of usage information, number of variables, number of employees, and number of malicious insider activities are different depending on the dataset version. We conducted this study using R6.2, which is the latest and largest dataset. In this version, the dataset includes 4000 users, among whom only five users behaved maliciously. The description of the logon activity table is provided in Table 1 and the other activities are provided in the Appendix A, Table A1.

2.2. User Activity Modeling Based on Daily Activity Summaries

In the CERT dataset, user behaviors are stored in five data tables: logon, USB, http, file, and email. To comprehensively utilize heterogeneous user behavior data, it is necessary to integrate the behavioral information into one standardized data table in chronological order. Because the proposed user-level insider-threat detection models developed in this study work on a daily or weekly basis, we first integrated a user’s fragmented activity records for each day and summarized them to quantify the intensity of activity, which becomes an input variable in the detection model. For example, based on the information stored in the logon table, it is possible to extract the number of times a user has logged on to the computer during a specific day.
To determine candidate input variables for insider-threat detection, we examined the input variables used in past studies, as shown in Table 2. From these references, we created all possible input variables that can be extracted from the CERT dataset. The total number of candidate input variables is 60 and the description of each variable is provided in the Appendix A, Table A2. Once this daily summarization process was completed, a total of 1,394,010 instances were obtained. Each instance represents a behavior summary of a specific day for a specific user.
Among more than 1 million instances, only 73 instances are potentially actual insider threats. To identify the characteristics of malicious insiders, we investigated the roles of the 73 abnormal instances, as shown in Table 3. We found that most abnormal activities (nearly 90%) are committed by three roles: “Salesman”, “Information Technology (IT) Admin”, and “Electrical Engineer”. If a role has no abnormal instances, or in the case of roles with less than three abnormal instances, it is not only impossible to build a good detection model, it is also impossible to verify the performance of the developed model. For this reason, we constructed role-dependent insider-threat detection models and evaluated the performance of the developed model for the above three roles. The frequency of normal instances and abnormal instances in the three roles are shown in Table 4.
The performance of machine learning models, including anomaly detection, is strongly affected by the input variables used to train the model [24]. Theoretically, the performance of machine learning models improves as the number of variables increases when independence between input variables is ensured. However, when applied to a real-world dataset, a large number of input variables sometimes degenerate the model performance because of the high dependency between input variables (multi-collinearity) and the existence of noise. Hence, it is necessary to select a set of effective variables rather than using all variables to secure better performance. In this study, we used the univariate Gaussian distribution to select possible beneficial variables to detect malicious instances. For each variable, we first estimated the parameter of Gaussian distribution (mean and standard variation). Then, if at least one of the abnormal activities was located in the rejection region with the significance level α = 0.1 for a certain variable, we included the variable as an input variable for further anomaly detection modeling. Table 5 shows the selected variables obtained by the univariate Gaussian distribution test.

2.3. User Activity Modeling Based on E-mail Contents

A user’s daily e-mail usage logs (number of sent and received e-mails) are stored as shown in Table 6. Although some summary statistics are included in the input variables in Table 5, it is sometimes more important to analyze the content of each e-mail than to rely on simple statistics. Because the e-mail data table in the CERT dataset also contains content information as well as log records as shown in Table 6, we can conduct an individual e-mail-level content analysis. To do so, we employed topic modeling to transform a sequence of words (e-mail body) to a fixed size of numerical vectors to be used for training the insider-threat detection models.
Topic modeling is a method of text analysis that uncovers main topics that permeate in a large collection of documents [25,26]. Topic models assume that each document is a mixture of topics (Figure 1(c-1)) and each topic has its own word selection probability distribution (Figure 1(c-2)). Hence, the purpose of topic modeling is to estimate the parameters of the probabilistic document generation process such as topic distribution per document and word distribution per topic. Latent Dirichlet allocation (LDA) is the most widely used topic modeling algorithm [25]. The document generation process and two outputs of the LDA are shown in Figure 1. By observing actual words w d , i in each document, LDA estimates the topic distribution per document θ d and the word distribution per topic φ k given the hyper-parameter α. In this study, we set the number of topics to 50 and the value of α to 1.
Table 7 shows the data format for insider-threat detection based on e-mail content analysis using LDA. The “ID” is a unique string that distinguishes a specific e-mail from other observations. The columns “Topic 1” through “Topic 50” indicate the probabilities assigned to the 50 topics per individual e-mail and are used as an input variable of the anomaly detection model. Note that the sum of the probabilities of the 50 topics is always 1. The “Target” is a variable that indicates whether the e-mail is normal (0) or abnormal (1). Table 8 shows the number of normal and abnormal e-mails for each of the three roles. We assumed that the e-mail topic distributions in each role are similar. Thus, if a topic distribution of a certain e-mail is significantly different from that of the other e-mails, it should be suspected as abnormal/malicious behavior.

2.4. User Activity Modeling Based on E-mail Network

Because the sender/receiver information is also available from the e-mail log records, as shown in Table 6, we constructed the e-mail communication network on a weekly basis and extracted quantified features as the third source of user activity analysis for insider-threat detection. Based on the information available from Table 6, a directed e-mail communication network can be constructed, as shown in Figure 2. The imaginary company name for CERT data is “dtaa” and uses the email domain @dtaa.com. There are also 21 other domain names. In the CERT dataset, users used either the company account email domain “@dtaa.com” or another domain such as “@gmail.com”. Users sent and received e-mails to/from users in the same department or different departments in the same company. They also sent and received emails to/from entities outside of the company. In this study, a user’s email account is set as a node, and the edges between two e-mail accounts are weighted based on the number of incoming and outgoing e-mails.
Once the weekly e-mail communication network was constructed, we computed a total of 28 network-specific quantified features for each user, as shown in the Appendix A, Table A3. These variables include the in- and out-degrees for personal or business e-mail account, the in- and out-degree similarity between two consecutive time-stamps for the same account in terms of the Jaccard similarity [27], as computed by Equation (1), and the centrality measure in terms of the betweenness, as computed by Equation (2).
J a c c a r d N o d e t 1 ,   N o d e t = D e g r e e N o d e t 1 D e g r e e N o d e t D e g r e e N o d e t 1 D e g r e e N o d e t ,
B e t w e e n n e s s i = j < k g j k N i g j k ,
where, g j k is the number of the shortest paths between two nodes j and k, and g j k N i is the number of paths containing node i among the shortest paths between the two nodes j and k . Betweenness centrality tends to be higher when one node in a network plays a bridging role for other nodes. Among the four well-known centrality measures, i.e., degree centrality, closeness centrality, betweenness centrality, and eigenvector centrality [28], we used the betweenness centrality to determine whether a specific e-mail account behaves as an information gateway in the overall e-mail communication network.
Among the 4000 users in the CERT dataset, only four users, i.e., CDE1846, CMP2946, DNS1758, and HIS1706, sent or received unusual emails. The numbers of normal and abnormal e-mails for these users are shown in Table 9.

3. Insider-Threat Detection

Figure 3 shows the overall framework of the insider-threat detection method developed in this study. In the user behavior-modeling phase, each user’s behaviors stored in the log system are converted to three types of datasets: daily activity summary, e-mail contents, and e-mail communication network. In the anomaly detection phase, one-class classification algorithms are trained based on the three datasets. Once a new record is available, it is input to one of these three models to predict potential malicious scores.
For the insider-threat detection domain, it is very common that a very large number of normal user activity cases is available, whereas there are only a handful or no abnormal cases available. In this case, conventional binary classification algorithms cannot be trained due to the lack of abnormal classes [19]. Alternatively, in practice, one-class classification algorithms can be used in such unbalanced data environments [29]. Unlike binary classification, one-class classification algorithms only use the normal class data to learn their common characteristics without relying on abnormal class data. Once the one-class classification model is trained, it predicts the likelihood of a newly given instance being a normal class instance. In this paper, we employed Gaussian density estimation (Gauss), Parzen window density estimation (Parzen), principal component analysis (PCA) and K-means clustering (KMC) as one-class classification algorithms for insider-threat detection, as shown in Figure 4.
Gauss [30] assumes that the entire normal user behavior cases are drawn from a single multivariate Gaussian distribution (Figure 4a), as defined in Equation (3):
p x = 1 2 π d / 2 Σ exp 1 2 x μ T Σ 1 x μ
Hence, training Gauss is equivalent to estimating the mean vector and covariance matrix that are most likely to generate the given dataset, as in Equations (4) and (5):
μ = 1 n x i X n o r m a l x i
Σ = 1 n 1 x i X n o r m a l x i μ T Σ 1 x i μ
where, x i is an object of a normal training instance and X n o r m a l is the entire learning dataset consisting of only normal instances. The anomaly score of a test observation can be determined by estimating the generation probability of the given observation with the estimated distribution parameters [31].
Parzen is one of the well-known, kernel-based, non-parametric density function estimation methods [32]. Parzen does not assume any type of prior distribution and estimates the probability density function based solely on the given observations using a kernel function K, as in Equation (6):
f h x ^ = 1 n i = 1 n K x x i h
where, h is the bandwidth parameter of the kernel function that controls the smoothness of the estimated distribution. The kernel function (Uniform, Triangular, Gaussian, and Epanechnikov) is a non-negative function that is symmetric about the origin and has an integral value of 1 [33]. In this paper, we used the Gaussian kernel function. It is possible to estimate the density of the given dataset by adding all kernel function values for a certain location and dividing the sum by the total number of observations. If the density of a new observation is low, it is highly likely that it is abnormal.
PCA is a statistical method that finds a new set of axes that preserves the variance of the original dataset as much as possible [34]. Once these axes are determined, the high-dimensional original dataset can be mapped to a lower dimensional space without significant loss of information [15]. Solving PCA for the dataset X R n × p is equivalent to finding the eigenvector matrix V R n × p and the corresponding eigenvalues λ 1 , λ 2 , , λ p ( λ 1 > λ 2 > > λ p ). Applying PCA, an instance x R p is mapped into a k -dimensional space (k < p) using the first k eigenvectors:
x = V T x
where, V R n × k consists of the first k eigenvectors. In PCA, the reconstruction error e(x), which is the difference between the original vector and its image reconstructed from the lower dimensional space to the original space, can be used as an anomaly score:
e x = x V V T x 2 = x 2 V V T x 2
KMC is a clustering method that assigns each observation ( x j ) to the closest centroid ( c i ) so that observations assigned to the same centroid form a cluster [15]:
argmax c i = 1 K x j C i x j c i 2
where, K is the number of clusters and is an algorithm-specific hyper-parameter that must be determined prior to executing the algorithm. We examined three K values (3, 5, and 10) in this study. Once KMC is completed based only on normal instances, the distance information between a new instance and its closest centroid is used to compute the anomaly score, as shown in Figure 4d [35]. D i is the distance between the test instance and its closest centroid while R is the radius of the cluster (the distance between the centroid and the farthest instance from the centroid in the cluster). The relative distance D i / R is the commonly used anomaly score in KMC-based anomaly detection.
In addition to the individual anomaly detection algorithms, we also consider a combination of these algorithms. Even when learning the same data, the methodology for building the optimal model for each algorithm is different, so there is no single algorithm that is superior in all situations in the machine learning field [36]. In this situation, combining different techniques can be advantageous as they generally improve the prediction performance compared to a single algorithm [36,37,38,39]. Hence, we examined all possible combinations of four individual anomaly detectors to determine the best combination for the given task and dataset. Since each algorithm has a different range of anomaly scores, we used the rank instead of score to produce the ensemble output. More specifically, for each test instance, its anomaly score ranking for each model in the ensemble is computed and the inverse of the averaged ranks was used as the anomaly score of the ensemble.

4. Results

Generally, an anomaly detection algorithm is trained based only on normal data in a situation where most instances belong to the normal class and only a few instances are from the abnormal class. Under this condition, it is impossible to set a cut-off value for detection as generally used by classification models. Hence, for daily activity-based model and e-mail contents-based model, the performances of anomaly detectors are evaluated as follows. First, the dataset is split to the training dataset, which contains 90% of randomly selected normal instances, and the test dataset, which contains the remaining 10% of normal instances and all abnormal instances. Second, an anomaly detection algorithm is trained based on the training dataset only. Third, the anomaly scores for the instances in the test dataset are computed and sorted in descending order. Finally, we compute the true detection rate using seven different cut-off values (1%, 5%, 10%, 15%, 20%, 25%, and 30%) based on Equation (10):
T r u e   d e t e c t i o n   r a t e   i n   t o p   X % =   T h e   n u m b e r   o f   m a l i c i o u s   a c t i v i t i e s   i n c l u d e d   i n   t h e   t o p   X %   a n o m a l y   s c o r e s   T o t a l   n u m b e r   o f   m a l i c i o u s   a c t i v i t i e s
In order to achieve statistically reasonable performances, we repeated the above process 30 times for each anomaly detection algorithm and used the average true detection rate in the top X% as the performance measure for insider-threat detection. Since the number of samples for e-mail network data is not as sufficient as those of daily activity data and e-main contents data, we used all normal instances for training, and anomaly scores are computed for all normal instances and abnormal instances for the e-mail network-based anomaly detection model.

4.1. Insider-Threat Detection with Daily Activity Summaries

Table 10, Table 11 and Table 12 show the insider-threat detection performance of six individual anomaly detectors with the best combination we determined (i.e., Parzen + PCA), based on the daily activity summary dataset for the three roles: "Electrical Engineer", "IT Admin", and "Salesman". As explained in the previous section, we tested all combinations of individual models and the "Parzen + PCA” combination resulted in the best performance for 10 out of 21 cases (three roles with seven cut-off rankings) followed by "Gauss + Parzen + PCA” (5 cases). The anomaly detection performances of all possible ensemble models are provided in Table A4, Table A5 and Table A6 in the Appendix A. Table A7 summarizes the number of best cases for each ensemble model. The proposed method exhibits effective detection performance. For example, among the top 1% of the anomaly scores predicted by Gauss for “Electrical Engineer”, half of the actual abnormal behaviors are successfully detected, which is more than 50 times higher than a random model that randomly determines 1% of test instances as anomalous behaviors.
For the “Electrical Engineer” role, when the top 1% of suspicious daily behaviors are monitored, the system can detect at most 53.66% of the actual insider threats (KMC with K = 10). It means that among the 141 test instances belonging to the 1% of highest anomaly score ranking, 5.367 out of 10 actual malicious behaviors are correctly detected, which can improve the monitoring efficiency of insider surveillance systems by prioritizing suspicious behaviors with high accuracy. This detection rate increases up to 76.33%, 79.33%, and 90% when the coverage of monitoring activities increases to the top 5%, 10%, and 15% anomaly scores, respectively. For the “IT Admin” role, detection performance is not as apparent as for “Electrical Engineer”, but it is still much better than the random guess model. The lift of the true detection rate against the random guess is 9.71 with 1% cut-off and 4.35 with 5% cut-off. For the “Salesman” role, although the detection performance is not as good as for “Electrical Engineer” with the higher cut-off values (1–15%), actual malicious activities are gradually detected when the cut-off values are lowered (15–30%). Hence, when the cut-off value is set to the top 30% of anomaly scores, 94.79% of the actual malicious behaviors are identified by the Parzen + PCA combination, which is the highest detection rate among the three roles (90% for “Electrical Engineer” and 40.87% for “IT Admin”).
Among the single algorithms, Parzen yielded the best detection rate for eight cases out of 21 cases (seven cut-off values and three roles). Although both Gauss and Parzen are based on density estimation, the assumption of Gauss, i.e., single multivariate Gaussian distribution for the entire dataset, is too strict to be applied to real datasets, which results in the worst performances in many cases. On the other hand, Parzen estimates the probability distribution in a more flexible manner, so it can be well fitted to non-Gaussian shape distributions. Note also that the Parzen + PCA combination yields the best detection performance in most cases. Compared to the detection performance of single algorithms, Parzen + PCA outperformed the single best algorithms for 10 cases. The effect of model combination is especially noticeable for the “Salesman” role.

4.2. Insider-Threat Detection with E-mail Contents

Table 13, Table 14 and Table 15 show the insider threat detection performance of six individual anomaly detectors with the best combination of them, i.e., Parzen + PCA, based on the e-mail contents dataset for the three roles. In contrast to the daily activity datasets, anomaly detection is more successful for “IT Admin” than the other two roles. Parzen + PCA yields a 37.56% detection rate with the top 1% of cut-off values, and it steadily increases to 98.67% for the top 30% of cut-off values. Anomaly detection performance for “Electrical Engineer” and “Salesman” are similar, the lift of the true detection rate against the random guess is above 4.5 with 1% of the cut-off value and approximately two-thirds of abnormal activities are detected with the 30% cut-off values.
Among the anomaly detection algorithms, KMC is the most effective algorithm for “Electrical Engineer” but no single algorithm yielded the best performance for “Salesman”. Another observation that is worth noting is that the performance of single anomaly detection algorithms is highly dependent on the characteristics of the dataset. Parzen + PCA yielded the highest detection rate for “IT Admin” but did not work well for “Electrical Engineers” and “Salesman”. On the other hand, KMC produced the highest detection rate for “Electrical Engineer” but it failed to detect any of the actual malicious e-mails for “IT Admin”.

4.3. Detection with E-mail Network

For the email communication history dataset among the 4000 users, four users (CDE1846, CMP2946, DNS1758, and HIS1706) sent or received numerous unusual e-mails. Table 16, Table 17, Table 18 and Table 19 show the user-level insider-threat detection performance of the anomaly detection models based on the e-mail communication network dataset.
It is worth noting that all the malicious e-mail communication of three users (CDE1846, DNS1758, and HIS1706) were successfully detected by the anomaly detection algorithms using at most 25% of the cut-off value. Surprisingly, Gauss yielded a 100% detection rate by monitoring only the top 5% of suspicious instances for user CDE1846, whereas KMC succeeded in detecting all the malicious instances of user HIS1706 by monitoring the top 10% of suspicious instances. The only exceptional user is CMP2946, for whom the anomaly detection model failed to detect more than 30% of actual malicious e-mail communications although the cut-off value was lowered to the top 30% of anomaly scores. Another interesting observation is that unlike the other two datasets, model combinations did not achieve a better detection performance than individual models. The best algorithms for each user are Gauss for CDE1846 and KMC for HIS1706. For the other two users, none of the single algorithms yielded the highest detection rate for all cut-off values.

5. Conclusions

In this paper, we proposed an insider-threat detection framework based on user behavior modeling and anomaly detection algorithms. During the user behavior modeling, individual users’ heterogeneous behaviors are transformed into a structured dataset where each row is associated with an instance (user-day, e-mail content, user-week) and each column is associated with input variables for anomaly detection models. Based on the CERT dataset, we constructed three datasets, i.e., daily activity summary dataset based on user activity logs, an e-mail content dataset based on topic modeling, and an e-mail communication network dataset based on the user’s account and sending/receiving information. Based on these three datasets, we constructed insider-threat detection models by employing machine learning-based anomaly detection algorithms to simulate real-word organizations in which only a few insiders’ behaviors are actually potentially malicious.
Experimental results show that the proposed framework can work reasonably well to detect insiders’ malicious behaviors. Based on the daily activity summary dataset, the anomaly detection model yielded at most 53.67% of the detection rate by only monitoring the top 1% of suspicious instances. When the monitoring coverage was extended to the top 30% of anomaly scores, more than 90% of actual abnormal behaviors were detected for two roles among the three evaluated. Based on the e-mail content datasets, at most 37.56% of malicious e-mails were detected with the 1% cut-off value while the detection rate increased to 65.64% (98.67% at most) when the top 30% of suspicious e-mails were monitored. Based on the e-mail communication network dataset, all the malicious instances were correctly identified for three out of four tested users.
Although the proposed framework was empirically verified, there are some limitations in the current research, which led us to future research directions. First, we constructed three structured datasets to train the anomaly detection algorithms. Because the instances of these three datasets are different from each other (a user’s daily activity, an e-mail’s topic distribution, a user’s weekly e-mail communication), anomaly detection models are independently trained based on each dataset. If these different anomaly detection results are properly integrated, it may be possible to achieve a better insider-threat detection performance. Second, we built the insider-threat detection model based on specific time unit, e.g., a day. In order words, this approach can detect malicious behaviors based on the batch process, but cannot detect them in a real-time. Hence, it could be worth developing a sequence-based insider-threat detection model that can process online stream data. Third, the proposed model is purely data-driven. However, in the security domain, combining the knowledge of experts and a pure data-driven machine learning model can enhance the insider-threat detection performance. Lastly, although the CERT dataset was carefully constructed containing various threat scenarios designed by domain experts, it is still a simulated and artificially generated dataset. If the proposed framework can be verified through a real-world dataset, its practical applicability could be more validated.

Author Contributions

Conceptualization, P.K. and J.K.; methodology, J.K., M.P. and H.K., software, J.K.; validation, S.C.; formal analysis, J.K. and H.K.; investigation, M.P. resources, M.P. and H.K.; data curation, S.C.; writing—original draft preparation, J.K.; writing—review and editing, P.K.; visualization, J.K. and M.P.; supervision, P.K.; project administration, P.K.; funding acquisition, P.K.

Funding

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2019R1F1A1060338) and Korea Electric Power Corporation (Grant number: R18XA05).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. User behavior log records of computer activities in the CERT dataset.
Table A1. User behavior log records of computer activities in the CERT dataset.
Recorded ItemDescription
FileIDPrimary key
DateDay/Month/Year Hour:Min:Sec
UserPrimary key of a user
PCPrimary key of a PC
FilenameFile directory
ActivityFile Open/Write/Copy/Delete
To_removable_mediaData PC to Removable media (TRUE, FALSE)
From_removable_mediaData Removable media to PC (TRUE, FALSE)
ContentHexadecimal encoded file header and contents text
DeviceIDPrimary key of an observation
DateDay/Month/Year Hour:Min:Sec
UserUser ID
PCKey number of a PC logged on
File treeFile directory
ActivityLog on or log off (Binary type)
HTTPIDPrimary key
DateDay/Month/Year Hour:Min:Sec
UserPrimary key of a user
PCPrimary key of a PC
URLURL address
ActivityActivity (Visit/Upload/Download)
ContentContent of a URL
EmailIDPrimary key of an observation
DateDay/Month/Year Hour:Min:Sec
UserPrimary key of a user
PCPrimary key of a PC
ToReceiver
CcCarbon copy
BccBlind carbon copy
FromSender
ActivityActivity (Send/Receive)
SizeSize of an email
AttachmentsAttachment file name
ContentContent of an email
PsychometricEmployee nameEmployee name
User IDID of an employee
OOpenness to experience
CConscientiousness
EExtraversion
AAgreeableness
NNeuroticism
LDAPEmployee nameEmployee name
User IDID of a user
EmailEmail address
RoleRole of a user
ProjectProject participation
Business unitBusiness unit
Functional unitFunctional unit
DepartmentDepartment
TeamTeam
SupervisorSupervisor
Decoy fileDecoy filenameDecoy file directory
Name and extension
PCPC name
Table A2. List of variables extracted from integrated daily activities.
Table A2. List of variables extracted from integrated daily activities.
Variable NameDescription
BaseuserUser ID
dateDate
workingdayWorking day or not (Binary type)
LogonnumlogonDayNumber of logons on working hours
numlogonNightNumber of logons on off-hours
numlogoffDayNumber of logoffs on working hours
numlogoffNightNumber of logoffs on off-hours
numPClogonDayNumber of s PC logons on working hour
numPClogonNightNumber of PC logons on off-hours
numPClogoffDayNumber of PC logoffs on working hour
numPClogoffNightNumber of PC logoffs on off-hours
onoffNotsameDayCounts of the number of logons and logoffs are not the same on working hours
onoffNotsameNightCounts of the number of logons and logoffs are not the same on off-hours
DevicenumPCwithUSBDayNumber of PC that uses USB device on working hours
numPCwithUSBNightNumber of PC that uses USB device on off-hours
numConnectionDayNumber of connections with devices on working hours
numConnectionNightNumber of connections with devices on off-hours
numCopy2DeviceDayNumber of copied files from PC to devices on working hours
numCopy2DeviceNightNumber of copied files from PC to devices on off-hours
numWrite2DeviceDayNumber of written files from PC to devices on working hours
numWrite2DeviceNightNumber of written files from PC to devices on off-hours
numCopyFromDeviceDayNumber of copied files from devices to PC on working hours
numCopyFromDeviceNightNumber of copied files from devices to PC on off-hours
numWriteFromDeviceDayNumber of files written from devices to PC on working hours
numWriteFromDeviceNightNumber of files written from devices to PC on off-hours
numDelFromDeviceDayNumber of deleted files from devices on working hours
numDelFromDeviceNightNumber of deleted files from devices on off-hours
numOpenOnPCDayNumber of opened files on working hours
numOpenOnPCNightNumber of opened files on off-hours
Web accessnumWebAccDayNumber of web accesses on working hours
numWebAccNightNumber of web accesses on off-hours
numURLAccessedDayNumber of accessed URLs on working hours
numURLAccessedNightNumber of accessed URLs on off-hours
numUploadDayNumber of uploads on working hours
numUploadNightNumber of uploads on off-hours
numDownloadDayNumber of downloads on working hours
numDownloadNightNumber of downloads on off-hours
EmailnumAttachmentDayNumber of attachments on working hours
numAttachmentNightNumber of attachments on off-hours
numSendDayNumber of sent emails on working hours
numSendNightNumber of sent emails on off-hours
numRecievedDayNumber of received emails on working hours
numRecievedNightNumber of received emails on off-hours
numEmailSentwithAttachDayNumber of sent emails containing attachments on working hours
numEmailSentwithAttachNightNumber of sent emails containing attachments on off-hours
numEmailRecievedwithAttachDayNumber of received email containing attachments on working hour
numEmailRecievedwithAttachNightNumber of received emails containing attachments on off-hours
numdistinctRecipientsDayNumber of recipients on working hours
numdistinctRecipientsNightNumber of recipients on off-hours
numinternalRecipientsDayNumber of internal recipients on working hours
numinternalRecipientsNightNumber of internal recipients on off-hours
RoleroleRole
functional_unitFunctional unit
departmentDepartment
teamTeam
PsychometricOOpenness to experience
CConscientiousness
EExtraversion
AAgreeableness
NNeuroticism
TargetAnomaly or not (Binary type)
The list of variables calculated to apply the email sending/receiving network information to the one-class classification models is shown in Table A3.
Table A3. List of email networks variables.
Table A3. List of email networks variables.
User_IDUser ID
Start_DataStart date
End_DataEnd date
TargetAnomaly or not (Binary type)
Jaccard_Out_Degree_ZeroPersonal email based on previous time Out degree of Jaccard Similarity
Jaccard_In_Degree_ZeroPersonal email based on previous time In degree of Jaccard Similarity
Jaccard_Out_Degree_OneCompany email based on previous time Out degree of Jaccard Similarity
Jaccard_In_Degree_OneCompony email based on previous time In degree of Jaccard Similarity
Not_Company_Out_Other_RolePersonal email based on current time Out degree sent to other company’s Role
Not_Company_Out_FROM_To_Me_RolePersonal email based on current time Out degree sent to me
Not_Company_Out_Other_CompanyPersonal email based on current time Out degree sent to not company domain
Not_Company_Out_Same_RolePersonal email based on current time Out degree sent to same company’s Role
Not_Company_In_Other_RolePersonal email based on current time In degree sent to other company’s Role
Not_Company_In_FROM_To_Me_RolePersonal email based on current time In degree sent to me
Not_Company_In_Other_CompanyPersonal email based on current time In degree sent to not company domain
Not_Company_In_Same_RolePersonal email based on current time In degree sent to same company’s Role
Company_Out_Other_RoleCompany email based on current time Out degree sent to other company’s Role
Company_Out_FROM_To_Me_RoleCompany email based on current time Out degree sent to me
Company_Out_Other_CompanyCompany email based on current time Out degree sent to not company domain
Company_Out_Same_RoleCompany email based on current time Out degree sent to same company’s Role
Company_In_Other_RoleCompany email based on current time In degree sent to other company’s Role
Company_In_FROM_To_Me_RoleCompany email based on current time In degree sent to me
Company_In_Other_CompanyCompany email based on current time In degree sent to not company domain
Company_In_Same_RoleCompany email based on current time In degree sent to same company’s Role
Company_Account_BetCompany email based on current time Betweenness centrality
NotCompany_Account_BetPersonal email based on current time Betweenness centrality
Same_Role_ViewThe number of on current time viewing company email from the same role except notification email in the same role
Same_Role_SendThe number of on current time sending company email from the same role except notification email in the same role
Other_Company_ViewThe number of on current time viewing not company email from the same role except notification email in the same role
Other_Company_SendThe number of on current time sending not company email from the same role except notification email in the same role
Diff_Role_ViewThe number of on current time viewing email from the same role except notification email in the different role
Diff_Role_SendThe number of on current time sending email from the same role except notification email in the different role
Table A4. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (Electrical Engineer). Best performance is shown in bold and underlined.
Table A4. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (Electrical Engineer). Best performance is shown in bold and underlined.
Algorithm0.10%0.50%1.00%3.00%5.00%7.00%10.00%
Parzen + PCA0.33670.43000.48330.68330.76330.78000.7933
Gauss + Parzen0.03000.30000.57670.70000.70000.70000.7000
Gauss + kMeans0.49330.50000.51330.60330.61000.63000.6900
Gauss + PCA0.32000.46670.53000.64000.67670.68670.6933
Parzen + kMeans0.31330.36330.61670.75330.76330.78670.8000
kMeans + PCA0.34000.50330.56000.68000.72670.74000.7567
Gauss + Parzen + kMeans0.27330.50330.60330.71670.73670.74670.7900
Gauss + Parzen + PCA0.29000.51330.64330.69000.75000.80000.8000
Gauss + kMeans + PCA0.32670.51670.56000.62330.68670.71670.7400
Parzen + kMeans + PCA0.32000.56670.69330.75330.76330.77000.7767
Gauss + Parzen + kMeans + PCA0.30670.55670.64000.73330.77670.79670.8000
Table A5. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (IT Admin). Best performance is shown in bold and underlined.
Table A5. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (IT Admin). Best performance is shown in bold and underlined.
Algorithm0.10%0.50%1.00%3.00%5.00%7.00%10.00%
Parzen + PCA0.00290.07100.09710.18120.21740.22320.2580
Gauss + Parzen0.01300.04350.04350.08700.12750.13040.1333
Gauss + kMeans0.00000.03770.04490.07100.07390.08700.0899
Gauss + PCA0.00140.04200.04350.04780.10140.12610.1536
Parzen + kMeans0.02170.06380.07970.11010.14200.17390.2043
kMeans + PCA0.00140.03330.07390.10140.11160.14060.1971
Gauss + Parzen + kMeans0.02750.04640.05510.07830.09710.11300.1536
Gauss + Parzen + PCA0.02170.04350.05220.11010.13480.16960.1797
Gauss + kMeans + PCA0.00140.04200.05650.07680.09420.11160.1565
Parzen + kMeans + PCA0.02460.02460.02460.02460.02460.02460.0246
Gauss + Parzen + kMeans + PCA0.02750.04780.06520.09280.12320.15220.1957
Table A6. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (Salesman). Best performance is shown in bold and underlined.
Table A6. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (Salesman). Best performance is shown in bold and underlined.
Algorithm0.10%0.50%1.00%3.00%5.00%7.00%10.00%
Parzen + PCA0.03440.06670.10210.19380.34060.47710.6156
Gauss + Parzen0.00000.06250.06250.21880.33020.33130.5302
Gauss + kMeans0.00000.02290.05100.08960.11040.12710.1583
Gauss + PCA0.00000.01460.06350.23440.32920.36250.4125
Parzen + kMeans0.00100.01670.03330.16150.19270.26150.3063
kMeans + PCA0.01350.03540.05420.13130.19790.23230.2771
Gauss + Parzen + kMeans0.00420.04170.06770.18540.26460.34060.4052
Gauss + Parzen + PCA0.00000.04380.06880.25000.38130.47920.5750
Gauss + kMeans + PCA0.01670.04380.06350.14170.21560.26770.3292
Parzen + kMeans + PCA0.00000.03230.09270.19380.26460.33130.4302
Gauss + Parzen + kMeans + PCA0.00420.05420.10100.23440.30520.37400.4813
Table A7. The number of selected best model for each ensemble algorithm.
Table A7. The number of selected best model for each ensemble algorithm.
NumberAlgorithmThe Number of Selected Best Model
1Parzen + PCA10
2Gauss + Parzen0
3Gauss + kMeans1
4Gauss + PCA0
5Parzen + kMeans2
6kMeans + PCA0
7Gauss + Parzen + kMeans0
8Gauss + Parzen + PCA5
9Gauss + kMeans + PCA0
10Parzen + kMeans + PCA3
11Gauss + Parzen + kMeans + PCA3

References

  1. Lindauer, B.; Glasser, J.; Rosen, M.; Wallnau, K. Generating Test Data for Insider Threat Detectors. JoWUA 2014, 5, 80–94. [Google Scholar]
  2. Schultz, E.E. A framework for understanding and predicting insider attacks. Comput. Secur. 2012, 21, 526–531. [Google Scholar] [CrossRef]
  3. Legg, P.A. Visualizing the insider threat: Challenges and tools for identifying malicious user activity. In Proceedings of the 2015 IEEE Symposium on Visualization for Cyber Security (VizSec), Chicago, IL, USA, 25 October 2015; pp. 1–7. [Google Scholar]
  4. Ted, E.; Goldberg, H.G.; Memory, A.; Young, W.T.; Rees, B.; Pierce, R.; Huang, D.; Reardon, M.; Bader, D.A.; Chow, E. Detecting insider threats in a real corporate database of computer usage activity. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; pp. 1393–1401. [Google Scholar]
  5. Mather, T.; Kumaraswamy, S.; Latif, S. Cloud Security and Privacy: An Enterprise Perspective on Risks and Compliance; O’Reilly Media, Inc.: Newton, MA, USA, 2009. [Google Scholar]
  6. Salem, M.B.; Hershkop, S.; Stolfo, S.J. A survey of insider attack detection research. Insid. Attack Cyber Secur. 2008, 39, 69–90. [Google Scholar]
  7. Theoharidou, M.; Kokolakis, S.; Karyda, M.; Kiountouzis, E. The insider threat to information systems and the effectiveness of ISO17799. Comput. Secur. 2005, 24, 472–484. [Google Scholar] [CrossRef]
  8. Wong, W.-K.; Moore, A.; Cooper, G.; Wagner, M. Rule-based anomaly pattern detection for detecting disease. In Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-02), Edmonton, AB, Canada, 28 July–1 August 2002; pp. 217–223. [Google Scholar]
  9. Cappelli, D.M.; Moore, A.P.; Trzeciak, R.F. The CERT Guide to Insider Threats: How to Prevent, Detect, and Respond to Information Technology Crimes (Theft, Sabotage, Fraud); Addison-Wesley: Boston, MA, USA, 2012. [Google Scholar]
  10. Eldardiry, H.; Sricharan, K.; Liu, J.; Hanley, J.; Price, B.; Brdiczka, O.; Bart, E. Multi-source fusion for anomaly detection: Using across-domain and across-time peer-group consistency checks. JoWUA 2014, 5, 39–58. [Google Scholar]
  11. Lunt, T.F.; Jagannathan, R.; Lee, R.; Whitehurst, A.; Listgarten, S. Knowledge-based intrusion detection. In Proceedings of the Annual AI Systems in Government Conference, Washington, DC, USA, 27–31 March 1989; pp. 102–107. [Google Scholar]
  12. Eberle, W.; Graves, J.; Holder, L. Insider threat detection using a graph-based approach. J. Appl. Secur. Res. 2010, 6, 32–81. [Google Scholar] [CrossRef]
  13. Parveen, P.; Evans, J.; Thuraisingham, B.; Hamlen, K.W.; Khan, L. Insider threat detection using stream mining and graph mining. In Proceedings of the Privacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third Inernational Conference on Social Computing (SocialCom), Boston, MA, USA, 9–11 October 2011; pp. 1102–1110. [Google Scholar]
  14. Mayhew, M.; Atighetchi, M.; Adler, A.; Greenstadt, R. Use of machine learning in big data analytics for insider threat detection. In Proceedings of the MILCOM 2015—2015 IEEE Military Communications Conference, Tampa, FL, USA, 26–28 October 2015; pp. 915–922. [Google Scholar]
  15. Alpaydin, E. Introduction to Machine Learning; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
  16. Gavai, G.; Sricharan, K.; Gunning, D.; Hanley, J.; Singhal, M.; Rolleston, R. Supervised and Unsupervised methods to detect Insider Threat from Enterprise Social and Online Activity Data. JoWUA 2015, 6, 47–63. [Google Scholar]
  17. Ho, T.K. Random decision forests. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; pp. 278–282. [Google Scholar]
  18. Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008; pp. 413–422. [Google Scholar]
  19. Sun, Y.; Wong, A.K.; Kamel, M.S. Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif. Intell. 2009, 23, 687–719. [Google Scholar] [CrossRef]
  20. Glasser, J.; Lindauer, B. Bridging the gap: A pragmatic approach to generating insider threat data. In Proceedings of the 2013 IEEE Security and Privacy Workshops, San Francisco, CA, USA, 23–24 May 2013; pp. 98–104. [Google Scholar]
  21. McGough, A.S.; Arief, B.; Gamble, C.; Wall, D.; Brennan, J.; Fitzgerald, J.; van Moorsel, A.; Alwis, S.; Theodoropoulos, G.; Ruck-Keene, E. Detecting insider threats using Ben-ware: Beneficial intelligent software for identifying anomalous human behavior. J. Wirel. Mob. Netw. Ubiquitous Comput. Dependable Appl. 2015, 6, 1–44. [Google Scholar]
  22. Young, W.T.; Goldberg, H.G.; Memory, A.; Sartain, J.F.; Senator, T.E. Use of domain knowledge to detect insider threats in computer activities. In Proceedings of the 2013 IEEE Security and Privacy Workshops, San Francisco, CA, USA, 23–24 May 2013; pp. 60–67. [Google Scholar]
  23. Nellikar, S. Insider Threat Simulation and Performance Analysis of Insider Detection Algorithms with Role Based Models. Master’s Thesis, University of Illinois at Urbana-Champaign, Champaign County, IL, USA, 19 May 2010. [Google Scholar]
  24. Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
  25. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent dirichlet allocation. J. Mach. Learn. Res. 2003, 3, 993–1022. [Google Scholar]
  26. Wallach, H.M. Topic modeling: Beyond bag-of-words. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; ACM: New York, NY, USA; pp. 977–984. [Google Scholar]
  27. Levandowsky, M.; Winter, D. Distance between sets. Nature 1971, 234, 34–35. [Google Scholar] [CrossRef]
  28. Wasserman, S.; Faust, K. Social Network Analysis: Methods and Analysis; Cambridge University Press: Cambridge, UK, 1994. [Google Scholar]
  29. Ali, A.; Shamsuddin, S.M.; Ralescu, A.L. Classification with class imbalance problem. Int. J. Adv. Soft Comput. Appl. 2013, 5, 1–38. [Google Scholar]
  30. Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 2009, 41, 1–58. [Google Scholar] [CrossRef]
  31. Barnett, V.; Lewis, T. Outliers in Statistical Data; Wiley: New York, NY, USA, 1994. [Google Scholar]
  32. Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  33. Muller, K.-R.; Mika, S.; Ratsch, G.; Tsuda, K.; Scholkopf, B. An introduction to kernel-based learning algorithms. IEEE Trans. Neural Netw. 2001, 12, 181–201. [Google Scholar] [CrossRef] [PubMed]
  34. Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis. Chemom. Intell. Lab. Syst. 1987, 2, 37–52. [Google Scholar] [CrossRef]
  35. Markou, M.; Singh, S. Novelty detection: A review—Part 1: Statistical approaches. Signal Process. 2003, 83, 2481–2497. [Google Scholar] [CrossRef]
  36. Fernández-Delgado, M.; Cernadas, E.; Barro, S.; Amorim, D. Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res. 2014, 15, 3133–3181. [Google Scholar]
  37. Woźniak, M.; Graña, M.; Corchado, E. A survey of multiple classifier systems as hybrid systems. Inf. Fusion 2014, 16, 3–17. [Google Scholar] [CrossRef]
  38. Britto, A.S.; Sabourin, R.; Oliveira, L.E. Dynamic selection of classifiers—A comprehensive review. Pattern Recognit. 2014, 47, 3665–3680. [Google Scholar] [CrossRef]
  39. Tang, E.K.; Suganthan, P.N.; Yao, X. An analysis of diversity measures. Mach. Learn. 2006, 65, 247–271. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Latent Dirichlet allocation (LDA) process and its two outputs. (a) LDA document generation process, (b)An illustrative example of the LDA document generation process, (c) Two outputs of LDA ((c-1) Per-document topic proportions ( θ d ), (c-2) Per-topic word distributions ( Φ k )).
Figure 1. Latent Dirichlet allocation (LDA) process and its two outputs. (a) LDA document generation process, (b)An illustrative example of the LDA document generation process, (c) Two outputs of LDA ((c-1) Per-document topic proportions ( θ d ), (c-2) Per-topic word distributions ( Φ k )).
Applsci 09 04018 g001
Figure 2. Example of email communication network.
Figure 2. Example of email communication network.
Applsci 09 04018 g002
Figure 3. Insider-threat detection framework.
Figure 3. Insider-threat detection framework.
Applsci 09 04018 g003
Figure 4. Four anomaly detection algorithms used in this paper. (a) Gaussian density estimation, (b) Parzen window density estimation (reprinted from Alpaydin (2014)), (c) Principal component analysis (PCA), and (d) K-means clustering (KMC).
Figure 4. Four anomaly detection algorithms used in this paper. (a) Gaussian density estimation, (b) Parzen window density estimation (reprinted from Alpaydin (2014)), (c) Principal component analysis (PCA), and (d) K-means clustering (KMC).
Applsci 09 04018 g004
Table 1. Log records of logon activities.
Table 1. Log records of logon activities.
Recorded ItemDescription
IDPrimary key of an observation
DateDay/Month/Year
UserUser ID
Personal computer (PC)Key number of a PC logged on
ActivityLog on or log off (Binary type)
Table 2. Referred articles for variable selection.
Table 2. Referred articles for variable selection.
Research PapersAuthorsYear
Supervised and unsupervised methods to detect insider threat from enterprise and online activity dataGavai et al. [16]2015
Detecting insider threats using ben-ware: beneficial intelligent software for identifying anomalous human behaviorMcGough et al. [21]2015
Multi-source fusion for anomaly detection: using across-domain and across-time peer-group consistency checksEldardiry et al. [10]2014
Use of domain knowledge to detect insider threats in computer activitiesYoung et al. [22]2013
Insider threat simulation and performance analysis of insider detection algorithms with role-based modelsNellikar [23]2010
Table 3. The number of anomalous records according to role.
Table 3. The number of anomalous records according to role.
RoleNumber of Anomalous Records
Salesman32
IT Admin23
Electrical Engineer10
Computer Programmer3
Manager2
Director1
Production line worker1
Software developer1
Total73
Table 4. Frequency of records of three roles.
Table 4. Frequency of records of three roles.
Electrical EngineerIT AdminSalesman
NormalAbnormalNormalAbnormalNormalAbnormal
141,1991034,24423125,52432
Table 5. Selected variables for each role.
Table 5. Selected variables for each role.
NumberElectrical EngineerIT AdminSalesman
1numlogoffNightnumlogoffNightnumlogonDay
2numPClogoffNightonoffNotsameDaynumPClogonDay
3numOpenOnPCNightnumConnectionDayonoffNotsameNight
4numAttachmentNightnumCopyFromDeviceNightnumConnectionNight
5numSendNightnumWriteFromDeviceNightnumWriteFromDeviceNight
6numEmailSentwithAttachNightnumWebAccNightnumUploadNight
7numPClogonDaynumDownloadNightnumSendDay
8numOpenOnPCDaynumPClogoffNightnumlogonNight
9numAttachmentDaynumPCwithUSBDayonoffNotsameDay
10numSendDaynumCopy2DeviceNightnumPCwithUSBNight
11numEmailSentwithAttachDaynumCopyFromDeviceNightnumWrite2DeviceDay
12Electrical engineernumWebAccDaynumOpenOnPCDay
13numlogoffNightnumURLAccessedNightnumAttachmentDay
14numPClogoffNight numRecievedDay
Table 6. Log records of email activities.
Table 6. Log records of email activities.
EmailDescription
IDPrimary key of an observation
DateDay/Month/Year Hour:Min:Sec
UserPrimary key of a user
PCPrimary key of a PC
ToReceiver
CcCarbon Copy
BccBlind carbon copy
FromSender
ActivityActivity (Send/Receive)
SizeSize of an email
AttachmentsAttachment file name
ContentContent of an email
Table 7. Quantified e-mail content examples with actual label (normal = 0 and abnormal = 1).
Table 7. Quantified e-mail content examples with actual label (normal = 0 and abnormal = 1).
IDTopic1Topic2Topic50Target
(I1O2-B4EB49RW-7379WSQW)0.0080.0120.1541
(L7E7-V4UX89RR-3036ZDHU)0.0210.0080.1251
(S8C2-Q8YX87DJ-0516SIWZ)0.0140.0060.1450
(A1V9-O5BL46SW-1708NAEC)0.3520.0140.0860
(N6R0-M2EI82DM-5583LSUM)0.4120.0580.2850
(O2N1-C4ZZ85NQ-8332GEGR)0.0850.4210.0010
Table 8. Normal (0) and abnormal (1) e-mail count for the three roles.
Table 8. Normal (0) and abnormal (1) e-mail count for the three roles.
SalesmanIT AdminElectrical Engineer
NormalAbnormalNormalAbnormalNormalAbnormal
644,25240170,76515694,05013
Table 9. Normal (0) and abnormal (1) e-mails for each user.
Table 9. Normal (0) and abnormal (1) e-mails for each user.
CDE1846CMP2946DNS1758HIS1706
NormalAbnormalNormalAbnormalNormalAbnormalNormalAbnormal
4569474749834933
Table 10. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (Electrical Engineer). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 14,120, number of malicious behaviors in the test set: 10).
Table 10. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (Electrical Engineer). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 14,120, number of malicious behaviors in the test set: 10).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.50000.40000.49330.52330.53330.53670.4833
5%0.60000.50000.66670.64000.63000.63330.7633
10%0.61670.79330.74670.70330.64670.69330.7933
15%0.70000.90000.78000.71670.67670.73330.8000
20%0.70000.90000.79000.75000.69670.76000.8167
25%0.70000.90000.80000.77670.74330.77670.8233
30%0.70000.90000.80330.77670.77000.79330.8500
Table 11. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (IT Admin). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 3424, number of malicious behaviors in the test set: 23).
Table 11. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (IT Admin). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 3424, number of malicious behaviors in the test set: 23).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.04350.04780.07390.05800.05210.05220.0971
5%0.04350.17390.21300.08410.07390.07680.2174
10%0.04350.30150.23040.12460.10870.11740.2580
15%0.09710.30430.28840.13910.12750.13620.2913
20%0.15940.30430.33480.23330.20000.20430.3246
25%0.17390.30430.36810.30290.26810.27970.3551
30%0.26090.30430.40870.34930.33040.34060.3928
Table 12. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (Salesman). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 14,120, number of malicious behaviors in the test set: 32).
Table 12. True detection rate of each anomaly detection algorithm based on the daily activity summary dataset (Salesman). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 14,120, number of malicious behaviors in the test set: 32).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.00930.11770.07810.03750.03960.02810.1021
5%0.03130.32710.33750.10830.08430.08020.3406
10%0.03130.56770.54580.13960.11250.11350.6156
15%0.65630.58440.66250.26040.19690.21150.7958
20%0.65630.77810.71770.29380.24270.24160.8646
25%0.65630.83960.76770.32400.28540.28020.9041
30%0.65630.87190.80420.39270.32600.32190.9479
Table 13. True detection rate of each anomaly detection algorithm based on the e-mail contents dataset (Electrical Engineer). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 69,405, number of malicious behaviors in the test set: 13).
Table 13. True detection rate of each anomaly detection algorithm based on the e-mail contents dataset (Electrical Engineer). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 69,405, number of malicious behaviors in the test set: 13).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.00000.00000.00000.04160.04580.04420.0000
5%0.00000.05000.00160.13830.13580.14330.0008
10%0.07500.07170.03920.32000.27170.27830.0200
15%0.07500.07420.04170.45750.42420.40250.0500
20%0.10160.07910.07920.53170.51330.46920.0608
25%0.17250.08000.15920.60080.57920.55250.0808
30%0.56080.08000.29080.66750.65420.62420.1050
Table 14. True detection rate of each anomaly detection algorithm based on the e-mail contents dataset (IT Admin). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 17,077, number of malicious behaviors in the test set: 15).
Table 14. True detection rate of each anomaly detection algorithm based on the e-mail contents dataset (IT Admin). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 17,077, number of malicious behaviors in the test set: 15).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.06660.00000.06000.00000.00000.00000.3756
5%0.20220.31780.28670.00000.00000.00000.6867
10%0.33330.75330.41780.00000.00000.00000.8333
15%0.51560.86220.58670.00000.00000.00000.8667
20%0.66670.93330.68670.00000.00000.00000.9044
25%0.73780.93560.82220.00000.00000.00000.9600
30%0.86670.93560.90670.00000.00000.00000.9867
Table 15. True detection rate of each anomaly detection algorithm based on the e-mail contents dataset (Salesman). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 64,425, number of malicious behaviors in the test set: 40).
Table 15. True detection rate of each anomaly detection algorithm based on the e-mail contents dataset (Salesman). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 64,425, number of malicious behaviors in the test set: 40).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.00000.00000.00000.03760.04360.04870.0000
5%0.00000.19490.00000.11540.11280.13590.0359
10%0.00000.23590.00000.16920.14870.13850.1436
15%0.00000.27690.00000.21540.22820.19740.2410
20%0.00000.30260.03590.32050.32050.31280.2846
25%0.00000.30510.16920.51800.54100.53330.3103
30%0.28720.31030.40770.61280.64870.65640.3282
Table 16. True detection rate of each anomaly detection algorithm based on the e-mail communication network dataset (CDE1846). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 456, number of malicious behaviors in the test set: 9).
Table 16. True detection rate of each anomaly detection algorithm based on the e-mail communication network dataset (CDE1846). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 456, number of malicious behaviors in the test set: 9).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.44440.11110.11110.33330.00000.00000.1111
5%1.00000.11110.44440.44440.00000.00000.1111
10%1.00000.11110.55560.55560.00000.00000.4444
15%1.00000.11110.77780.55560.00000.00000.4444
20%1.00000.22220.77780.55560.11110.11110.5556
25%1.00000.22220.77780.55560.11110.11110.5556
30%1.00000.22220.77780.55560.22220.11110.6667
Table 17. True detection rate of each anomaly detection algorithm based on the e-mail communication network dataset (CMP2946). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 474, number of malicious behaviors in the test set: 7).
Table 17. True detection rate of each anomaly detection algorithm based on the e-mail communication network dataset (CMP2946). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 474, number of malicious behaviors in the test set: 7).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.00000.00000.00000.00000.00000.00000.0000
5%0.00000.00000.00000.00000.00000.00000.0000
10%0.00000.00000.00000.00000.00000.00000.0000
15%0.00000.00000.00000.00000.00000.00000.0000
20%0.33330.00000.33330.00000.00000.00000.0000
25%0.66670.00001.00000.00000.00000.00000.3333
30%0.66670.00001.00000.00000.00000.00000.3333
Table 18. True detection rate of each anomaly detection algorithm based on the e-mail communication network dataset (DNS1758). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 498, number of malicious behaviors in the test set: 3).
Table 18. True detection rate of each anomaly detection algorithm based on the e-mail communication network dataset (DNS1758). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 498, number of malicious behaviors in the test set: 3).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.00000.00000.00000.33330.33330.33330.0000
5%0.33330.00000.00000.66670.66670.33330.0000
10%0.33330.00000.33331.00001.00000.33330.0000
15%0.66670.33330.66671.00001.00000.33330.0000
20%0.66670.33330.66671.00001.00000.33330.3333
25%0.66670.33330.66671.00001.00000.33330.3333
30%0.66670.33330.66671.00001.00000.66670.3333
Table 19. True detection rate of each anomaly detection algorithm based on the e-mail communication network dataset (HIS1706). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 493, number of malicious behaviors in the test set: 3).
Table 19. True detection rate of each anomaly detection algorithm based on the e-mail communication network dataset (HIS1706). Best performance is shown in bold and underlined (number of normal behaviors in the test set: 493, number of malicious behaviors in the test set: 3).
Anomaly RankGaussParzenPCAKMC (K = 3)KMC (K = 5)KMC (K = 10)Parzen + PCA
1%0.00000.00000.00000.33330.33330.33330.0000
5%0.33330.00000.00000.66670.66670.33330.0000
10%0.33330.00000.33331.00001.00000.33330.0000
15%0.66670.33330.66671.00001.00000.33330.0000
20%0.66670.33330.66671.00001.00000.33330.3333
25%0.66670.33330.66671.00001.00000.33330.3333
30%0.66670.33330.66671.00001.00000.66670.3333

Share and Cite

MDPI and ACS Style

Kim, J.; Park, M.; Kim, H.; Cho, S.; Kang, P. Insider Threat Detection Based on User Behavior Modeling and Anomaly Detection Algorithms. Appl. Sci. 2019, 9, 4018. https://0-doi-org.brum.beds.ac.uk/10.3390/app9194018

AMA Style

Kim J, Park M, Kim H, Cho S, Kang P. Insider Threat Detection Based on User Behavior Modeling and Anomaly Detection Algorithms. Applied Sciences. 2019; 9(19):4018. https://0-doi-org.brum.beds.ac.uk/10.3390/app9194018

Chicago/Turabian Style

Kim, Junhong, Minsik Park, Haedong Kim, Suhyoun Cho, and Pilsung Kang. 2019. "Insider Threat Detection Based on User Behavior Modeling and Anomaly Detection Algorithms" Applied Sciences 9, no. 19: 4018. https://0-doi-org.brum.beds.ac.uk/10.3390/app9194018

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop