Ⅰ . Introduction
In recent years, advanced technologies at the core of the Fourth Industrial Revolution, such as big data, artificial intelligence (AI), the Internet of Things (IoT), cloud computing, and 5G network communica- tions, have been widely introduced and made available. These advanced technologies are closely connected to our society and have a profound impact on our daily lives and various industries. However, as these advanced technologies evolve, potential cy- bersecurity threats also increase.
The issue of cybersecurity threats is becoming more prominent not only domestically but also internationally. In the 2024 Verizon Data Breach Investigations Report (DBIR)[1], researchers found that 68% of data breaches are linked to non-malicious human factors. Additionally, Proofpoint's “2024 Voice of the CISO[2]” report, which surveyed 1,600 chief information security officers (CISOs) world- wide, found that 70% of CISOs believe their organ- izations are at risk of suffering a significant cyberse- curity attack within the next year. Among the various cybersecurity threats, CISOs expect human error to be the most significant cybersecurity vulnerability, as insider threats and human-caused data loss continue to rise. In 2024, 74% of CISOs agreed that human error is a serious cybersecurity threat, up 18% from two years ago, compared to 60% in 2023 and 56% in 2022. Additionally, 80% of CISOs recognize cy- bersecurity threats caused by human error, such as da- ta loss due to employee carelessness, as a significant issue requiring attention over the next two years. This is a 17% increase from the 63% reported in 2023, suggesting that insider threats can occur independent of malicious intent and individual intent. This means that organizations must also have countermeasures in place against insider threats alongside building techni- cal defenses against cybersecurity threats. This em- phasizes that insider threats are just as important a security risk as external attacks.
Businesses and public institutions are working to strengthen their cyber defenses to respond to and pre- vent cybersecurity threat incidents. These efforts in- clude protecting sensitive information through data encryption and access control, improving internal se- curity systems by introducing solutions from external specialized companies, and establishing continuous monitoring systems for real-time threat detection. As the importance of cybersecurity is increasing due to the diversification of attack types such as ransomware, insider threats, and DDoS attacks, each organization needs to establish a systematic and effective advanced security system to respond to cybersecurity threats.
In general, traditional security software, such as firewalls and intrusion detection systems, can effec- tively detect and respond to well-known cyberattack patterns. However, these systems have limitations in identifying and preventing new types of advanced cy- bersecurity threats that are constantly evolving. In par- ticular, the increasing number of data security in- cidents caused by human error, such as data leaks, mistakes, and negligence by insiders, shows that cy- bersecurity threats are not only coming from external factors but also from internal factors. Currently, vari- ous organizations are making great efforts to utilize various security systems and strengthen cybersecurity. However, in order to effectively respond to newly modified attack types or potential internal cyberse- curity threats in addition to existing generalized cyber threats, existing security systems alone are not enough. Along with existing security systems, a new cybersecurity threat detection system is needed that can respond to new types of attacks or potential in- ternal threats.
In this paper, we aim to implement a user behavior analysis-based anomaly detection system for detecting abnormal user behavior in a specific domain. For this purpose, we performed exploratory data analysis (EDA) on a user behavior dataset to extract user be- havior features and define user behavior feature elements. In the process, we proposed a new token combination method for anomaly detection and de- fined user behavior patterns as “sessions”. Next, we performed input sequence vectorization on the session data after a simple preprocessing. We then leveraged BERT model architecture as the underlying model for our user behavior analysis-based anomaly detection system. During the pretraining process, we used only normal session data for the vectorized session data. In this study, we focused on the implementation of the anomaly detection system pipeline as our primary goal. For this reason, we added a simple binary classi- fier during the fine-tuning and system evaluation phases to perform performance evaluation. We aim to explore a practical methodology for implementing an anomaly detection system, explicitly emphasizing analyzing user behavior data characteristics and inves- tigating optimal model architectures for this domain. To achieve this goal, we propose a comprehensive ap- proach addressing two main aspects of system implementation. First, we conduct an in-depth analysis of user behavior characteristics to identify essential features to consider in anomaly detection. We propose a novel feature factor combination method as part of our feature engineering approach to address the chal- lenge of processing large-scale chunk data into BERT-compatible input. This method enables effec- tive tokenization of user behavior information while preserving meaningful patterns within the data. Furthermore, we conduct extensive ablation case stud- ies on BERT model to determine the most efficient configuration for anomaly detection. This includes ex- ploring various hyperparameters for pre-training and transfer learning phases to optimize the learning proc- ess for our specific input data characteristics. We sys- tematically investigate different BERT model archi- tectures to identify the optimal model size that balan- ces detection performance with computational efficiency.
This paper is organized as follows. Chapter 2 re- views existing related work, and Chapter 3 details the structure of our proposed user behavior analysis-based anomaly detection system. Chapter 4 describes the ex- perimental results of the implemented user anomaly detection system, and finally, Chapter 5 discusses con- clusions and future work.
Ⅱ . Related Works
Anomaly detection techniques[3,4] refer to defining patterns considered “normal” and identifying outliers or abnormal patterns in data that deviate from them. Anomaly detection techniques can distinguish normal samples from abnormal (outliers, anomalies) samples to find data within a data set with different character- istics from other observations. Anomaly detection techniques must now go beyond applying predefined rules. It can automatically learn new types of anoma- lies, such as real-time anomaly detection in data streams and pattern recognition in complex multi- variate data. As technology evolves, the research and application of anomaly detection techniques are ex- panding from statistical methods to machine learn- ing[5,6] and deep learning[7,8] approaches.
One-class support vector machine (OCSVM)[9] is an unsupervised learning model variant of supervised learning SVM. In anomaly detection, OCSVM[10-13] trains by mapping normal data into a high-dimen- sional space and finding the normal data's minimum boundary. It detects an anomaly if the new data is not contained within the boundary.
A convolutional neural network (CNN)[14] is a deep learning model that uses convolutional layers to ex- tract and train image features. In anomaly detection, CNN[15-17]learns features from normal data, and if fea- tures extracted from new input data are different from the trained data, it detects an anomaly.
Long-Short-Term Memory (LSTM)[18] is a re- current neural network (RNN) model that specializes in processing time series data. In anomaly detection, LSTM[19-22] learns normal time series data patterns. Based on the trained patterns, LSTM predicts the val- ue at the next point in time and calculates the error between the predicted value and the actual observed value. It detects the data as an anomaly if the error exceeds a threshold.
Auto-Encoder (AE)[23] is a neural network-based unsupervised learning model with an encoder-decoder structure. It is mainly used for data compression and feature extraction. Encoders compress the input data into a low-dimensional representation. The low-di- mensional representation is placed in the AE's middle layer (latent space). The decoder reconstructs the la- tent space's low-dimensional representation into the original dimensions. The AE trains the main features of the data in this process. The AE[24-26] trains a model on normal data in anomaly detection. The trained AE calculates the errors that occur in reconstructing the input data. In this case, if the reconstruction error of the input data is significant, it is detected as an anomaly.
A Variational Auto-Encoder (VAE)[27] is an un- supervised learning model with an encoder-decoder structure that learns the probability distribution of data. Encoders compress the input data into a proba- bility distribution defined by a mean and variance. Decoders recover the samples' original dimension from the latent space's probability distribution. VAE represents the latent variables as probability dis- tributions and uses reconstruction error and KL(Kullback– Leibler) divergence as loss functions. In anomaly detection, a VAE[28-31] trained on normal data calculates the reconstruction error on the input data. If the reconstruction error of the input data is significant, it detects an anomaly.
In recent years, transformer-based models[32] such as BERT have demonstrated outstanding performance in natural language processing, and attempts have been made to utilize them in anomaly detection. BERT's[33] bidirectional encoding structure and self-attention mechanism can effectively capture long-term dependencies and complex patterns in se- quence data and have shown remarkable results in anomaly detection tasks.
Dang et al. (2021)[34] proposed a BERT-based mod- el, TS-BERT, for anomaly detection in time series data. TS-BERT was developed to effectively handle the long-term dependence of time series data and im- prove the problem of lack of label data. However, due to BERT structure, TS-BERT suffered from increased computational complexity in mapping low-dimen- sional data to high-dimensional space and long train- ing time. In addition, label generation using the spec- tral residual method made fully unsupervised learning impossible, and it limited itself to fully reflecting the unique features of time series data.
Guo et al. (2021)[35] proposed a BERT-based mod- el, LogBERT, for anomaly detection in log data. LogBERT was designed to detect log data anomalies by performing pre-training with Masked Log Key Prediction (MLKP) and fine-tuning for anomaly detection. However, LogBERT suffered from high computational complexity and increased processing time due to BERT structure, performance deviations depending on the features of the pre-training data, and limitations in interpreting the results of deep learning models.
Anomaly detection system based on user behavior analysis architecture
Tang and Guan (2024)[36] proposed a BERT-based model, SD-BERT, to detect anomalies in system log data. SD-BERT was designed to capture log se- quences' global context and local features effectively by introducing a Separated Score Attention (SSA) mechanism and a dual branching module. However, SD-BERT was trained only on normal log sequences, which made it difficult to detect new types of anoma- lies, model complexity limited real-time processing, and SSA and the dual branching module were opti- mized for specific datasets, which limited generalization.
Ⅲ. Method
The structure of the user anomaly detection system based on user behavior analysis is shown in Figure 1, which consists of (1) exploratory analysis of user behavior dataset (Dataset EDA), (2) behavior feature extraction (Feature Extraction/Engineering), (3) fea- ture preprocessing and vectorization, (4) pre-training and fine-tuning, and (5) anomaly detection system evaluation.
First, Dataset EDA performs a preliminary analysis of user activity log files. Next, Feature Extraction/Engineering extracts features describing user behavior and generates feature vectors through feature factor combination. Then, Feature Preprocessing & Vectorization create sessions with preprocessed and vectorized user behavior patterns. The created sessions are used as input for BERT model. Next, pre-training is performed only during normal sessions, and fine-tuning is performed using anomaly detection tasks. Finally, the anomaly de- tection system with pre-trained and fine-tuned weights is evaluated using separate test data.
3.1 Dataset
Computer Emergency Response Team (CERT)[37] dataset is an insider threat behavior dataset provided by the Software Engineering Institute (SEI) at Carnegie Mellon University in the United States. The CERT dataset contains user activity logs and profiles spanning 18 months. The CERT dataset is widely used in the cybersecurity and cyber threat fields for analyz- ing user behavior patterns and evaluating the perform- ance of anomaly detection algorithms. The anomaly detection system implementation in this paper uses the CERT r6.2 dataset.
The CERT r6.2 version is a large dataset of about 93 GB. This dataset includes 4,000 users, of which only 5 are malicious insiders, a tiny percentage. There are a total of 5 malicious insider scenarios, with one malicious scenario for each malicious insider. The r6.2 version of CERT includes eight files, including a Lightweight Directory Access Protocol (LDAP) file containing user information and five activity log (logon.csv, file.csv, device.csv, Http.csv, and email.csv) files related to user behavior. This dataset is highly imbalanced, with 135,117,169 total activity logs for all users but only 470 malicious activity logs from insiders.
Table 1 describes the overall basic statistics of the CERT r6.2 dataset, and Table 2 describes the mali- cious insider scenarios in the dataset. Table 3 de- scribes the files in the dataset, and Table 4 summa- rizes the key fields in the activity log files in the dataset.
The statistics of CERT r6.2 dataset
Malicious insider scenarios in the CERT r6.2 dataset
CERT r6.2 dataset file information
CERT r6.2 dataset activity log files
3.2 User Behavior Dataset EDA
Exploratory data analysis (EDA)[38,39] is an analysis method performed in the early stages of data analysis. The goal of EDA is to understand the structure and features of the data. In the process, relationships be- tween variables are explored, and patterns in the data are discovered. By performing EDA, insights can be gained for future analysis direction or effective model- ing, and analysis of user activity logs is required to detect user anomalies effectively. In this paper, we performed EDA on raw activity log files ('Logon.csv', 'File.csv', 'Device.csv', 'Http.csv', and 'Email.csv') for all users in the CERT r6.2 dataset to identify user behavior types and behavior patterns.
After EDA, we summarized the key features to consider in this dataset as follows;
(a) Everyone has a 'Logon.csv' file corresponding to the user's commute file, but other activity log files may or may not exist.
(b) Users' behavior patterns are represented by time series data, with periodicity in the form of log- on-logoff.
(c) Each user normally works a different day and time of day for their job.
(d) Each user is assigned a computer, but they occa- sionally use a public computer to access other users' computers.
(e) If a user's job is ITAdmin, he/she can freely ac- cess other users' computers as a system administrator.
(f) As a system administrator, an ITAdmin is likely to have many entries in their activity logs for ac- cessing other users' computers.
As seen above, the absence of a specific behavior cannot be considered abnormal because of cases such as (a). For example, a user may have never opened a file, but if the lack of file-related behavior is consid- ered an anomaly, it will be recognized as a false positive. From (c), we can see different normal behav- ior patterns for each occupation. We can see from (e) and (f) that the ITAdmin occupation has a unique be- havior pattern, unlike other occupations. To avoid de- tecting different behavior patterns of various users as abnormal behavior, we need to define the behavior patterns of users, and for this, we need feature factors that determine the behavioral characteristics of users.
3.3 Behavior Feature Extraction
For a pre-trained model to properly understand and train on user behavior patterns, it is important to pro- vide user behavior data as input to the model using meaningful feature factors. In this paper, we extract meaningful feature fields from user behavior feature fields and define feature factor, which means a factor that determines a feature. By defining meaningful fea- ture factors, we can reduce the probability of false positives, which is mistakenly recognizing normal be- havior as abnormal behavior. In addition, we can ex- pect to improve the performance of the pre-trained model. Table 5 describes the fields extracted by EDA on the user behavior data. Table 6 shows the user behavior patterns with periodicity in the form of log- on-logoff defined as a result of EDA.
Fields extracted from user behavior data EDA
Behavioral pattern from user behavior data EDA
3.3.1 Associated Field Extraction
The results of extracting user behavior patterns by performing EDA above are shown in Tables 5 and 6. Five activity log files ('Logon.csv', 'File.csv', 'Device.csv', 'Email.csv', 'Http.csv') are associated with user behavior in the experimental data. However, some users only have activity records in 'Logon.csv'. That is, they do not perform any activities except login and logoff. Considering this situation, we extracted only the 'Date', 'PC', and 'Activity' fields common in the five activity log files as feature factor fields associated with user behavior data. We defined feature factors that carry unique significance in anomaly be- havior detection based on user information. Specifically, we selected three core feature factors from the CERT dataset: time, location, and behavior information, as described in Table 7. 'Date' field rep- resents the 'Time' feature factor indicating when an action occurred, 'PC' field represents the 'Location' feature factor indicating where an action took place, and 'Activity' field represents the 'Behavior' feature factor indicating what action occurred.
Fields associated with user behavior feature
3.3.2 Feature Factor Combination
In this chapter, we propose a feature factor combi- nation method to detect user anomalies effectively. A token with a single piece of information may not suffi- ciently represent a user's complex behavior patterns. For example, a simple 'Behavior' feature factor does not capture the time or location context of the behavior. Also, 'Time' feature factor alone does not reveal which behavior occurred at a specific time of day. In this paper, we use a combination of feature factors to generate meaningful tokens optimized for anomaly detection, while exploring effective pre- processing methods to provide optimal input tokens to the model. Figure 2 shows how the input data of the model is created by the feature factor combination method.
In Step 1, we explain the process of extracting ele- ments from the ‘Date’ field to be used as Time feature factors. First, the user behavior patterns for the ‘Logon-Logoff’ period were sorted by occurrence time. we extracted only 'HH:mm' from the 'Date' field and subdivided it into 'Hour', '1-minute', '10-minutes', '15-minutes', and '30-minutes' fields, as shown in Table 8.
Description of elements extracted for use as time feature factors
We defined the field values for the ‘Time’ feature factor elements as follows:
· The 'Hour' field using a 24-hour format with in- teger values between 00 and 23.
· The '1-minute', '10-minute', '15-minute', and '30-minute' fields as integer values between 00 and K-1, where K is the quotient of 60 minutes divided by N minutes. N corresponds to the values (1, 10, 15, 30) that define the time feature coefficients.
In Step 2, we describe the process of constructing a model sequence using the four feature factor combi- nation method. A model input sequence consists of a ‘Location’ feature factor token and a ('Behavior' feature factor + 'Time' feature factor) token. The 'Location' feature factor token corresponds to the val- ue of the 'PC' field, which exists on the same row as the 'Logon' action in the 'Activity' field. The 'Behavior' feature factor token corresponds to the val- ue of the 'Activity' field, and the 'Time' feature factor token corresponds to the combined value of the 'Time' feature factor elements. 'Time' feature factor elements refer to the ('Hour', '1-minute', '10-minute', '15-minute', '30-minute') fields created in Step 1. We propose four feature factor combination methods to provide ('Behavior' feature factor + 'Time' feature factor) as a follow-up token.
The following example explains how to combine the 'Behavior' feature factor token + 'Time' feature factor token:
1. Extracting the ‘Activity’ field value (‘Behavior’ feature factor) and the ‘Hour’, ‘1-minute’, ‘10-mi- nute’, ‘15-minute’ and ‘30-minute’ field values (‘Time’ feature factor elements) from the user's behavior pattern during the ‘Logon-Logoff’ period.
2. Creating one significant time feature factor by com- bining the ‘Hour’ field and the (‘1-minute’, ‘10-mi- nute’, ‘15-minute’ and ‘30-minute’) field.
3. Combining the ‘Behavior’ feature factor with + each ‘Time’ feature factor to create a new token.
4. Appending the created tokens after the ‘Location’ feature tokens to form a model sequence.
This process constructs a new model input se- quence that contains the user's ‘Time’, ‘Location’, and ‘Behavior’ information. we define a sequence of model inputs created by this method of combining fea- ture factors as a ‘Session’.
In Step 3, we provide an example of building ses- sion data using each feature combination method. Session data 1, 2, 3, and 4 in Step 3 are the results of applying methods 1, 2, 3, and 4 in Step 2. The session data created by the feature factor combination method goes through feature preprocessing and vecto- rization processes to be used as model input data.
3.4 Feature Preprocessing and Vectorization
Figure 3 illustrates the steps involved in feature preprocessing and vectorization to prepare session da- ta for input into the model.
Feature factor combination method
Feature preprocessing and vectorization
In Step 1, we prepare the user behavior session data for input to BERT model and initialize BERT tokenizer. To tokenize the session data, we use the byte-pair encoding (BPE)-based ‘WordPiece’ token- izer[40] provided by the HuggingFace Transformer li- brary API.
In Step 2, we apply ‘WordPiece’ tokenization to the input session data. ‘Wordpiece’ tokenization is a method of splitting words into smaller sub-words. BERT model uses special tokens ([CLS], [SEP], etc.) to identify sentence structures and prefixes. The '[CLS]' (classification) token summarizes or catego- rizes information in an entire sequence, and the '[SEP]' (separator) token identifies boundaries be- tween sentences or segments.
In Step 3, we convert the tokenized input session data into 'Input IDs'. Next, we apply a padding tech- nique to the input IDs and generate an 'Attention Mask' and 'Token Type IDs'. 'Input IDs' result from converting each token in the tokenized input session data into a unique integer. We use dynamic and static padding techniques. Dynamic padding pads varia- ble-length 'Input IDs' based on the batch's 'Input IDs' length. Static padding pads all 'Input IDs' to a pre-de- fined maximum length. Next, create an 'Attention Mask' of the 'Input IDs'. The 'Attention Mask' sepa- rates real tokens (1) from padding tokens (0) in the 'Input IDs'. Finally, create 'Token Type IDs' for the 'Input IDs'. The 'Token Type IDs' value is zero in single sentences. In pairs of sentences, the first sen- tence has a 'Token Type IDs' value of 0, and the second sentence has a 'Token Type IDs' value of 1. We utilize 'Token Type IDs' to prevent two log- on-logoff periods in one input session data.
In step 4, we create three types of embeddings ('Token', 'Position', and 'Segment' embeddings) and sum them to make a single input representation vector. Token embeddings and Position embeddings use ‘Input IDs’, and Segment embeddings use ‘Token Type IDs’. Token embeddings represent the meaning of each word or subword (token) in a vector space. Token embeddings capture the semantic character- istics of each token and allow semantic relationships to be learned by representing words with similar meanings with similar vectors. Position embeddings provide information about the position of each token in the sequence. Position embeddings learn order de- pendencies in language, allowing the same word to have different meanings or roles depending on its po- sition in a sentence. Segment embedding utilizes 'Token Type IDs' to separate sentences. It recognizes the boundaries of each sentence and learns the rela- tionships between sentences.
In step 5, the input representation vector (the sum of the three embeddings) and the 'Attention Mask' are combined to form the final representation vector that is the input to BERT model.
In Step 6, we provide the final representation vector as input to BERT model architecture to perform pre-training and transfer learning.
3.5 Pre-Training and Fine-Tuning
3.5.1 BERT Model Architectures
Bidirectional Encoder Representations from Transformers (BERT) model is a natural language processing (NLP) model that performs various tasks. BERT model catches complex semantic relationships, understands context effectively, and shows remarkable ability in representation learning. In addition, it can be pre-trained with large amounts of data, enabling effective transfer learning with only a small amount of data. In this paper, we use BERT Model Architecture as a pre-training model.
To determine the base model of the anomaly de- tection system, we apply various feature factor combi- nation methods and feature preprocessing methods for each BERT model structure (Base, Medium, Small, Mini and Tiny). Table 9 describes BERT model archi- tecture by hidden layer (L) and hidden embedding (H) size.
BERT model names by architecture based on hidden layer (L) and hidden embeddings (H)
3.5.2 Pre-Training
Generally, BERT model is pre-trained using a masked language model (MLM) task and a next sen- tence prediction (NSP) task. MLM task involves train- ing the model by randomly masking some tokens in the input data and then inferring the masked words.
NSP task focuses on identifying the relationship be- tween sentences. Two sentences are provided as input data, and the model is trained to predict whether the second sentence follows the first sentence.
Recently, BERT variants have employed a variety of tasks to train more effectively in context. A Lite BERT (ALBERT)[41] is a lightweight version of BERT developed by Google and the Toyota Technological Institute in Chicago. It aims to reduce model size and improve learning speed while maintaining BERT's performance. Of particular note is that ALBERT re- places the traditional NSP task, inefficient for model- ing sentence coherence, with the sentence order pre- diction (SOP) task. Decoding-enhanced BERT with Disentangled Attention (DeBERTa)[42] is a model de- veloped by Microsoft that aims to improve the atten- tion mechanism of BERT. Specifically, DeBERTa abandons the NSP operation and instead uses a sepa- rate attention mechanism and an improved mask de- coder to handle content and location information bet- ter, improving the model's efficiency and overall performance. Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA)[43] is a model developed by Stanford University and Google that aims to improve how BERT models the mask language to increase the effi- ciency of dictionary learning. It introduces a replace- ment token detection (RTD) task instead of NSP. Robustly Optimized BERT Approach (RoBERTa)[44] is a model developed by Facebook AI Research that aims to optimize BERT's pre-training process. It de- termines that NSP operations are inefficient for down- stream tasks and removes them entirely. Instead, it uses only MLM tasks with dynamic masking, sig- nificantly improving BERT's performance. XLNet[45] is a model developed by Carnegie Mellon University and Google that aims to overcome BERT's limitations. XLNet completely eliminates NSP and in- stead introduces a new approach called permutation language modeling (PLM). This method captures bi-directional context more effectively than NSP, ad- dressing the mismatch between dictionary learning and fine-tuning caused by BERT's [MASK] token.
In this paper, we only use MLM tasks as a pre-training method for a user behavior analysis based anomaly detection system, according to recent re- search trends and a survey of various pre-training strategies. We use dynamic masking to prevent the pre-trained model from overfitting to user behavior patterns and to improve its generalization ability. As the original BERT paper recommended, we randomly select 15% of all tokens. Of the selected tokens, 80% were replaced with the [MASK] special token, 10% with random tokens, and the remaining 10% unchanged. Figure 4 visualizes BERT model's archi- tecture with MLM tasks in pre-training, and the [MASK] token is colored pink.
BERT model architecture MLM tasks in the pre-training
We exclusively use normal behavior session data for pre-training, and evaluate the MLM task using a separate set of normal behavior session data that was not used during the pre-training phase for validation.
Table 10 describes the common hyperparameters of BERT model architecture used for pre-training. In this paper, we set the following hyperparameters to optimize the model's performance: batch size 512 and learning rate 1e-5 for pre-training; dropout ratio 0.3 to avoid overfitting; AdamW optimizer (weight at- tenuation: 0.01) for the model's generalization per- formance; L1 regularization value set to (λ=1e-6) to reduce unnecessary noise by activating only some fea- tures and improve performance. We also applied gra- dient clipping (max_norm=0.5) for stable training of the model.
Pre-training common hyperparameters
3.5.3 Fine-Tuning
Fine-Tuning method is a type of transfer learning. It achieves effective results with limited data by using knowledge from pre-trained models. In this study, we employed BERT's fine-tuning approach to efficiently learn from a small amount of outlier data, using a 3:1 split ratio of normal to abnormal session data for model input. We attached a binary classifier[46] to the model's final layer for anomaly detection, utilizing the [CLS] token representation for downstream tasks. While maintaining most hyperparameters from pre-training (learning rate 1e-5, AdamW optimizer, weight decay 0.01, L1 regularization λ=1e-6, gradient clipping max_norm=0.5), we adjusted the batch size to 64 and increased dropout rate to 0.4 to prevent overfitting given the smaller dataset. CrossEntropyLoss was used as the loss function. Figure 5 illustrates BERT model architecture for clas- sification during fine-tuning, while Table 11 details the complete hyperparameter configuration used in transfer learning.
BERT model architectures in the fine-tuning
Transfer learning common hyperparameters
3.6 Performance Evaluation of the Anomaly Detection System
In this paper, we use the confusion matrix evalua- tion metric to evaluate the performance of a user be- havior analysis based anomaly detection system.
3.6.1 Confusion Matrix Based Performance Metrics
Confusion matrix[47] is a metric primarily used to evaluate the predictive performance of binary classi- fication models. It is often used to recognize specific types of errors or to evaluate the performance of un- balanced datasets[48]. Confusion matrix is a 2x2 matrix structure, and the performance metric consists of four combinations of actual and predicted classes. To com- prehensively evaluate the reliability and efficiency of the anomaly detection system, we use Accuracy, Precision, Recall, F1-Score, and AUC-ROC[49]. Table 12 describes the four combinations of the confusion matrix, and Figure 6 shows the performance metrics according to confusion matrix combination method.
Confusion matrix elements