Detecting Cyber-attack Using Unsupervised Method

Posted by Taoran Ji

In the last blog, I have introduced the importance of early detection of cyber-attack events, the motivation of using social media as a source of information and explored one soultion which is focused on early detecting using tech blogs. Intuitively, tech blogs have are less ``fresh'' than Twitter, since they are usually carefully written by experts. On Twitter, users are free to report and tweet anything they find suspicious without overthinking about the details.

However, using tweets is never a trivial thing. Apart from the textual complexity, I introduced in the last blog, the most annoying feature of the Twitter based research is that, in most cases, there are no positive or negative samples available. In other words, it's hard to collect a proper training and test data set. One straightforward way to solve this problem is to ask researchers to identify and label the positive and negative samples manually. But this doesn't work in a real-world application since the volume of tweets is usually very huge. Thus, reserachers are more inclined to use unsupervised methods to detect and extract useful information from the sea of tweets.

In this blog, I'll explore one work which applies an unsupervised method for the early detection of cyber-attacks.

Definition of Cyber-attack Event

The paper I'll explore in this blog is ``Weakly Supervised Extraction of Computer Security Events from Twitter''. Though it proposed a weakly supervised method for the computer security events detection, I still consider it as an unsupervised method since only very few positive samples are required, and the core contribution in this paper is critical information mining, which works in an unsupervised way.

Before exploring the details of the technology used in this paper, let's first try to answer this most fundamental question: what is the definition of cyber attack event? Intuitively, this problem can be answered from a security angle. Let's see the description from Techpedia.

“ Definition - What does Cyberattack mean? A cyberattack is deliberate exploitation of computer systems, technology-dependent enterprises and networks. Cyberattacks use malicious code to alter computer code, logic or data, resulting in disruptive consequences that can compromise data and lead to cybercrimes, such as information and identity theft.

I don't even want to read the entire description since it doesn't help us to formulate the problem people want to solve. The issue here is that this definition fails to connect the problem with the textual data we have. We need a description which can cast the cyber attack detection problem into the textual data mining or classification field. In the paper, the authors defined the cyber attack event as a tuple (ENTITY, DATE). To further ensure that this event is a security-related event, each event is expected to be one of security categories named after the security keyword (e.g., hack, DDoS).

With this definition, the problem is then transformed to a classification paradigm, that is, filter tweets by event type related keywords and then identify tweets that are related to unknown events. With this idea in mind, the workflow proposed in the paper is very intuitive and easy to understand.

Collection of Seed Events and Candidate Events

Collecting seed events is pretty straightforward, information analysts already prided a set of 10 - 20 seed events for each type of cyber attacks. For instance, the seed instances for DDoS attacks are shown in the following table. As we can see, each entry is presented in the tuple (ENTITY, DATE). In particular, ENTITY here is the victim name in the table.

Not like their names, candidate events are in fact collections of suspicious tweets. This process consists of three submodules. First of all, candidate tweets are extracted from Twitter Streaming API using the keywords like hacked, ddos and breach. Secondly, the extracted tweets will be put into an NLP pipeline, in which each component of tweets will be further analyzed. For instance, date contained in the tweet will be normalized and collected. The contextual environment of keyword will also be collected since the authors think that the neighbors of keywords will also contribute to the classification. The following are some examples of contextual features.

Classification

As we discussed before, there are no classified training data available in this case. Though researchers can get some instances from experts, it needs too many human powers to collect enough training samples. So in this situation, we have a lot of unlabeled events. If we treat all these samples as negative points, then we will train a biased classifier since many of them are in fact positive points. Meanwhile, treating every sample as positive is not a choice, either. The authors gave a smart way to solve this problem. They used the concept of expectation. In particular, they want to maximize the likelihood term over the positive seed instances with a regularization that encourages the expectation over unlabeled data to match the user provided target expectation. First of all, it's easy to understand why we want to maximize the likelihood of seed instances. Second, they do this maximization under one condition, that is, try to make the differences between expectations of unlabeled data and user expectation as small as possible. In this paper, the authors proves that this method has better performance than the traditional classifiers like SVM.

Afterword

Cyber-attack detection and characterization using open-source data is still an open research field. Not like the traditional security field, in this research, people are more focused on using maching learning or data mining methods to identify indicators from the sea of open source information. In this blog, we explore and discuss one work in this field. Though it works better than other general methods like SVM, it still needs to be further improved since its performance depends on the a user-defined expectation value which is very hard to determine in real-world application.