Machine Learning technology Uses Tweets To Predict High-Risk Security Vulnerabilities
Machine learning technology will use the content of tweets to find high-risk security vulnerabilities.(Machine Learning)
At the RSA Security Conference held in San Francisco last week, many vendors who advocate security first put all kinds of marketing-rich “threat intelligence” and “vulnerability management” systems in front of users. And it turns out that the existing regular and free vulnerability information sources are enough to remind system administrators which errors and problems really need to be fixed. The source is updated 24 hours a day, seven days a week-this is Twitter.
A group of researchers experimentally evaluated the value of the bug data stream in Twitter. At the same time, they built free software for tracking related information to eliminate all kinds of software defects that can be solved and evaluate their severity.
Researchers from Ohio State University, security vendor FireEye, and research firm Leidos recently published a paper describing a new type of system that can read software security vulnerabilities mentioned in millions of tweets and then exploit them. The machine learning training algorithm evaluates the threat status represented by the description method and the specific content.
They found that Twitter information can not only be used to predict most of the security vulnerabilities that will appear in the national vulnerability database in the next few days (that is, the official registration platform for various security vulnerabilities tracked by the National Institute of Standards and Technology), but it can also take advantage of natural Language processing technology roughly predicts which vulnerabilities will be assigned a “dangerous” or “high-risk” severity level, with an accuracy rate of over 80%.
Ohio State University professor Alan Ritter pointed out, “We believe that security vulnerabilities are similar to a popular topic on Twitter, and they all have significant trends that can be tracked.” In June this year, the relevant research results will be presented in the North American Chapter of the Association for Computational Linguistics and officially published.
For example, the prototype test they are currently conducting on the Internet shows that Twitter appeared a large number of tweets related to the latest vulnerability in the MacOS system (called “BuggyCow”) and mentioned a type of page that may allow page access. SPOILER attack method (using a deep vulnerability in Intel chips). The Twitter scanner developed by the researchers marked the two as “possibly high-risk.” As of now, neither of these two vulnerabilities have been included in the national vulnerability database.
Of course, they admitted that the current prototype design is not perfect. The current program can only be updated once a day, which includes a lot of repetitive content. Through comparison, we found that some of the results were missed by the national vulnerability database. But Ritter believes that the real progress of this research lies in the automatic analysis of vulnerabilities based on human language while accurately ranking them according to their severity.
This means that it may one day become a powerful information aggregator that system administrators can use to protect their systems from intrusions, or at least a necessary part of commercial vulnerability data sources, and it may even become An unprecedented source of free vulnerability information that is weighted and ranked according to importance. And all of this will become a great boon for the system administrator group.
He explained, “We hope to build a computer program that can read network information and extract early reports of new software vulnerabilities while analyzing users’ overall view of its potential severity. From a practical point of view, developers often face Such a realistic problem in the face of complex analysis results, which one represents a high-risk loophole that may actually cause people to suffer heavy losses?”
In fact, the thinking behind it is nothing new. For many years, people have been thinking about how to summarize software vulnerability data through text messages on the Internet, and it has even been specific to Twitter. However, the use of natural language processing technology to rank the severity of vulnerabilities in tweets represents a major “important turning point.”
Anupam Joshi, a professor at the University of Maryland in Moroccan County who is also concerned about this issue, deeply agrees. He pointed out, “People are paying more and more attention to the discussion of security vulnerabilities on the Internet. People have realized that we can get early warning signals from social platforms such as Twitter, as well as Reddit posts, dark webs, and blog comments. .”
Researchers at Ohio State University, FireEye, and Leidos initially used a subset of 6000 tweets related to security vulnerabilities in the experiment. They showed the relevant results to Amazon Mechanical Turk staff, artificially sorted them by severity, and then filtered out abnormal results that are entirely opposed to most other readers.
Next, the researchers used these tagged tweets as training data for the machine learning engine and further tested its prediction results. Focusing on the security vulnerabilities that may be included in the national vulnerability database within the next five days, the program was able to use the original severity ranking in this database to predict the 100 most serious vulnerabilities in this period with an accuracy rate of up to 78%. For the top 50, its prediction of the severity of the vulnerability is more accurate, with an accuracy rate of 86%.
More importantly, the prediction accuracy rate of the program is as high as 100% for the 10 security vulnerabilities rated by the national vulnerability database as the most severe in the next five days.
Ohio State University’s Ritter warned that although the current test results are very gratifying, the automated tool they created should not be used by any individual or organization as the only source of vulnerability data. At least, people should click on the underlying tweet and its link. Information to confirm the analysis results. He pointed out that “it still needs human intervention.” In his opinion, it is best to incorporate this program into a wide range of vulnerability data sources planned by humans and only serve as one of the sources.
However, in view of the accelerating speed of vulnerability discovery and the increasing amount of information related to vulnerabilities on social media, Ritter believes that this program is expected to become an important tool for finding valuable signals from noise. He concluded, “Today’s security industry is facing the problem of too much information. The core of this program is to establish an algorithm to help everyone sort all the content, to find the essential information.”