Cyber-attack Detection and Characterization Using Social Media

Posted by Taoran Ji

Recent years, cyber-attacks are gaining increasing attention from the public because these attack events are occurring more frequently, with greater scale and more severe damage or influence. In fact, last year, in a lot of news report for data crime, we can see the words "biggest", "largest" or "in the history". In particular, only in 2016, we saw Democratic National Committee (DNC) email leak which revealed a sophisticated understanding of politics, Yahoo data breach which affected over 1 billion accounts, massive DDoS attacks targeting major DNS provider Dyn which impacts the Internet service in the US east. In the last several months of 2017, we saw the WannaCry which spread around the world, and Equifax data breach which leaked around 143 million sensitive information including social security numbers and other personal information. The leaked data from government agencies and large enterprises like Equifax are usually sensitive personal information which may cause further loss of personal property.

Unfortunately, though leaked data may cause further cyber crimes, organizations being attacked usually choose not to notify the public at the first time due to business factors, and it may take very long time for the news report to be involved since they have to collect information which originates outside these organizations. For instance, Yahoo data breach, occurring earlier around August 2013, was only reported in December 2016 though it affected over 1 billion user accounts.

In response to the situations mentioned above, cyber-attack detection and characterization using social media, as a research interest, aims to detect and characterize the ongoing or happened cyber attacks by leveraging conversations and discussions posted on the public social media platforms like Facebook and Twitter. In this blog post, I'll reveal the motivation of the usage of social media and explore the several works done by other researchers.

Motivation: Why use social media?

The motivation of using social media data is that people will report or complain about the abnormal behaviors (eg., login location and time) of their account as well the unhappy user experience of internet services, which can be potential indicators for an ongoing cyber attack. For instance, on Oct 21, Spotify, Netflix along with other online websites suffered a mass DDOS attack which dramatically impacts the access speed. The following screenshots are people reporting and complaining about slow internet on Twitter.

Another case is Yahoo data breach. As I mentioned, though Yahoo data breach occurred earlier around August 2013, it was officially confirmed in December 2016 which is over three years later. On Twitter, there already have many signals showing that many users' Yahoo accounts have abnormal login behaviors, as shown in figures below.

In fact, social media, which is believed to have the ability to turn users into social sensors and empower them to participate in an online ecosystem, has been used in many event detection tasks such as disease outbreaks, civil unrest, and earthquakes. Thus, intuitively, we believe that analysis of such online media can provide insight into a broader range of cyber-attacks such as data breaches, account hijacking.

Exploration of Solutions

It seems that the problem can be easily solved by mining tweets which contain some keywords about cyber-attack events. What a straightforward, simple and beautiful method! But it will never work. Come on. This is social media we are talking about. It never works as you think. Working in social media data is never a trivial task considering the noisy environment provided by the online platform. On Twitter, people will use slang, informal language, hashtags, make up words and misspellings are common things.

What do you see in the above two tweets? Netlix, Twier, #twitterislife and slow. They are signals what we want, but the machine doesn't know what's Netlix, Twier or #twitterislife since they are not in the dictionary. What's more, even though Twitter users write tweets very carefully and double check the spelling before posting, it still won't work. Every word can be ambiguous online. The following two tweets show that how keyword hack is used to express the different meaning.

Now let's see other researchers' work. Most existing work focuses on technology blogs and tweets from security professionals to extract useful information. For instance, Liao et al. build text mining tools to extract key attack identifiers (IP, MD5 hashes) from security tech blogs in their paper "Acing the IOC Game: Toward Automatic Discovery and Analysis of Open-Source Cyber Threat Intelligence".

IOC Game

This paper focuses on the automatic discovery and analysis of IOC (Indicators of Compromise) information presented by security professionals in public sources (e.g., blogs, forums, tweets, etc.). In particular, the authors propose iACE, a framework which can locate an IOC token, get its context, and further analyze their relations through NLP techniques. The following figure is iACE's architecture.

As we can see, the system keeps collecting technical blogs using Blog Scraper (BS), which is essentially a web crawler designed to monitor the rapidly evolving online content. Though technical blogs are less noisy than the Twitter environment, there are still many materials unrelated with IOC, e.g., product promotion, news or software update. Thus Blog Preprocessor (BP) is required to perform pre-processing on the data collection to extract and normalize only technical related content from crawled pages and filter out non-IOC articles. Relevant Content Picker (RCP) will further use context terms and regexes to locate the sentences likely to contain IOC information. In a nutshell, for each IOC blog, BP and RCP's job is to identify and pick sentences, tables, and lists which are likely to include IOCs. Though context terms and regexes can help to find sentences likely involving IOCs, they are insufficient for detection IOCs with high accuracy. Therefore in Relation Checker (RC), dependency tree technique is adopted to improve the accuracy. In participate, the presence of IOC relations between a context term and an IOC candidate within a sentence is examined though a customized kernel classifier which calculates the similarity between two dependency subgraphs based on the point-wise distance. The RC workflow is shown below.

In the evaluation, the authors run their framework on 71,000 articles collected from 45 technical blogs and get a remarkable performance. iACE generates 900K OpenIOC items with a precision of 95% and a recall over 90%, which overperforms other state-of-the-art approaches. In summary, one major technical contribution of this paper is to use dependency parser to identify the IOC item instead of using context terms or regexes alone, and it does dramatically improve the identification accuracy.

Afterword

Cyber-attack detection and characterization using open-source data is still a new but interesting research problem. It's based on the assumption that security professionals or normal web service users will exchange, share their knowledge, experience, and information through social media. Furthermore, the assumption is also proved to work in other papers, e.g., "Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits" and "Weakly Supervised Extraction of Computer Security Events from Twitter". Anyway, this problem can be further divided into several more detailed subproblems, e.g, what information to extract from open source data like IOC item, named entitied, how to get the target "particle", and how to use extracted knowledge elements.