Detecting Cyber-attack Using Unsupervised Method
Posted by Taoran Ji
In the last blog, I have
introduced the importance of early detection of cyber-attack events, the
motivation of using social media as a source of information and explored one
soultion which is focused on early detecting using tech blogs. Intuitively,
tech blogs have are less ``fresh'' than Twitter, since they are usually
carefully written by experts. On Twitter, users are free to report and tweet
anything they find suspicious without overthinking about the details.
However, using tweets is never a trivial thing. Apart from the textual
complexity, I introduced in the last
blog, the most annoying
feature of the Twitter based research is that, in most cases, there are no
positive or negative samples available. In other words, it's hard to
collect a proper training and test data set. One straightforward way to solve
this problem is to ask researchers to identify and label the positive and
negative samples manually. But this doesn't work in a real-world
application since the volume of tweets is usually very huge. Thus,
reserachers are more inclined to use unsupervised methods to detect and
extract useful information from the sea of tweets.
In this blog, I'll explore one work which applies an unsupervised method
for the early detection of cyber-attacks.
Definition of Cyber-attack Event
The paper I'll explore in this blog is ``Weakly Supervised Extraction of
Computer Security Events from Twitter''. Though it proposed a weakly
supervised method for the computer security events detection, I still
consider it as an unsupervised method since only very few positive samples
are required, and the core contribution in this paper is critical
information mining, which works in an unsupervised way.
Before exploring the details of the technology used in this paper, let's
first try to answer this most fundamental question: what is the definition
of cyber attack event? Intuitively, this problem can be answered from
a security angle. Let's see the description from
Techpedia.
“
Definition - What does Cyberattack mean? A cyberattack is deliberate
exploitation of computer systems, technology-dependent enterprises and
networks. Cyberattacks use malicious code to alter computer code, logic
or data, resulting in disruptive consequences that can compromise data
and lead to cybercrimes, such as information and identity theft.
I don't even want to read the entire description since it doesn't help us to
formulate the problem people want to solve. The issue here is that this
definition fails to connect the problem with the textual data we have. We
need a description which can cast the cyber attack detection problem into
the textual data mining or classification field. In the paper, the authors
defined the cyber attack event as a tuple (ENTITY, DATE). To further ensure
that this event is a security-related event, each event is expected to be
one of security categories named after the security keyword (e.g., hack,
DDoS).
With this definition, the problem is then transformed to a classification
paradigm, that is, filter tweets by event type related keywords and then
identify tweets that are related to unknown events. With this idea in mind,
the workflow proposed in the paper is very intuitive and easy to understand.
Collection of Seed Events and Candidate Events
Collecting seed events is pretty straightforward, information analysts
already prided a set of 10 - 20 seed events for each type of cyber attacks.
For instance, the seed instances for DDoS attacks are shown in the following
table. As we can see, each entry is presented in the tuple (ENTITY, DATE).
In particular, ENTITY here is the victim name in the table.
Not like their names, candidate events are in fact collections of suspicious
tweets. This process consists of three submodules. First of all, candidate
tweets are extracted from Twitter Streaming API using the keywords like
hacked, ddos and breach. Secondly, the extracted tweets will be put into an
NLP pipeline, in which each component of tweets will be further analyzed.
For instance, date contained in the tweet will be normalized and collected.
The contextual environment of keyword will also be collected since the
authors think that the neighbors of keywords will also contribute to the
classification. The following are some examples of contextual features.
Classification
As we discussed before, there are no classified training data available in
this case. Though researchers can get some instances from experts, it needs
too many human powers to collect enough training samples. So in this
situation, we have a lot of unlabeled events. If we treat all these samples
as negative points, then we will train a biased classifier since many of
them are in fact positive points. Meanwhile, treating every sample as
positive is not a choice, either. The authors gave a smart way to solve this
problem. They used the concept of expectation. In particular, they want to
maximize the likelihood term over the positive seed instances with
a regularization that encourages the expectation over unlabeled data to
match the user provided target expectation. First of all, it's easy to
understand why we want to maximize the likelihood of seed instances. Second,
they do this maximization under one condition, that is, try to make the
differences between expectations of unlabeled data and user expectation as
small as possible. In this paper, the authors proves that this method has
better performance than the traditional classifiers like SVM.
Afterword
Cyber-attack detection and characterization using open-source data is still
an open research field. Not like the traditional security field, in this
research, people are more focused on using maching learning or data mining
methods to identify indicators from the sea of open source information. In
this blog, we explore and discuss one work in this field. Though it works
better than other general methods like SVM, it still needs to be further
improved since its performance depends on the a user-defined expectation
value which is very hard to determine in real-world application.