250 likes | 420 Views
Predicting zero-day software vulnerabilities through data mining --Second Presentation. Su Zhang. Outline. Quick Review. Data Source – NVD. Six Most Popular/Vulnerable Vendors For Our Experiments. Why The Six Vendors Are Chosen. Data Preprocessing. Functions Available For Our Approach.
E N D
Predicting zero-day software vulnerabilities through data mining--Second Presentation Su Zhang
Outline • Quick Review. • Data Source – NVD. • Six Most Popular/Vulnerable Vendors For Our Experiments. • Why The Six Vendors Are Chosen. • Data Preprocessing. • Functions Available For Our Approach. • Statistical Results • Plan For Next Phase.
Source Database – NVD • National Vulnerability Database • U.S. government repository of standards based vulnerability management data. • Data included in each NVD entry • Published Date Time • Vulnerable software’s CPE Specification • Derived data • Published Date Time Month • Published Date Time Day • Two adjacent vulnerabilities’ CPE diff (v1,v2)Version diff • CPE Specification Software Name • Adjacent different Published Date Time ttpv • Adjacent different Published Date Time ttnv
Six Most Vulnerable/Popular Vendors • Linux: 56925 instances • Sun: 24726 instances • Cisco: 20120 instances • Mozilla: 19965 instances • Microsoft: 16703 instances • Apple: 14809 instances.
Why We Only Choose Instances Of Pop Vendors—Vulnerability Table
Why We Only Choose Instances Of Pop Vendors • Huge size of nominal types (vendors and software) will result in a scalability issue. • Top six take up 43.4% of all instances. • We have too many vendors(10411) in NVD. • The seventh most popular/vulnerable vendor is much less than the sixth. • Vendors are independent for our approach.
Data Preprocessing • NVD data—Training/Testing dataset • Starting from 2005 since before that the data looks unstable. • Correct some obvious errors in NVD(e.g. “cpe:/o:linux:linux_kernel:390”). • Attributes • Published time : Only use month and day. • Version diff: A normalized difference between two versions. • Vendor: Removed.
Data Preprocessing(cont) • Attributes • “Group” vulnerabilities published at the same day- we can guarantee ttnv/ttpv are non-zero values. • ttnv is the predicted attribute. • For each software • Delete its first bunch of instances. • Delete its last bunch of instances.
version diff Calculation • v1= 3.6.4; v2 = 3.6; MaxVersionLength=4; • v1= expand ( v1, 4 ) = 3.6.4.0 • v2 =expand ( v2, 4 ) = 3.6.0.0 • diff(v1, v2) = (3-3) * 1000 +(6-6) * 100-1 +(4-0) * 100-2 +(0-0) * 100-3 = 4 E -4
An Example Vendor, soft, version, month, day, vdiff, ttpv, ttnv • linux,kernel,2.6.18, 05, 02, 0, 70, 5 • linux,kernel,2.6.19.2, 05, 07,1.02E-4,5, 281
Functions Available For Our Approach On Weka • Least Mean Square. • Linear Regression • Multilayer Perceptron. • SMOreg. • RBF Network. • Gaussian Processes.
Several Statistical Results • Function: Linear Regression • Training Dataset: 66% Linux(Randomly picked since 2005). • Test Dataset: the rest 34% • Test Result: • Correlation coefficient 0.5127 • Mean absolute error 11.2358 • Root mean squared error 25.4037 • Relative absolute error 107.629 % • Root relative squared error 86.0388 % • Total Number of Instances 17967
Several Definitions About “Error” • Mean absolute error : • Root mean square error:
Several Definitions About “Error”(Cont) • Relative absolute error: • Root relative squared error:
Several Statistical Results • Function: Least Mean Square • Training Dataset: 66% Linux(Randomly picked since 2005). • Test Dataset: the rest 34% • Test Result: • Correlation coefficient -0.1501 • Mean absolute error 7.6676 • Root mean squared error 30.6038 • Relative absolute error 73.449 % • Root relative squared error 103.6507 % • Total Number of Instances 17967
Several Statistical Results • Function: Multilayer Perceptron • Training Dataset: 66% Linux(Randomly picked since 2005). • Test Dataset: the rest 34% • Test Result: • Correlation coefficient 0.9886 • Mean absolute error 0.4068 • Root mean squared error 4.6905 • Relative absolute error 3.7802 % • Root relative squared error 15.1644 % • Total Number of Instances 17967
Several Statistical Results • Function: RBF Network • Training Dataset: 66% Linux(Randomly picked since 2005). • Test Dataset: the rest 34% • Test Result: • Linear Regression Model • ttnv = -15.3206 * pCluster_0_1 + 21.6205 • Correlation coefficient 0.1822 • Mean absolute error 10.5857 • Root mean squared error 29.048 • Relative absolute error 101.4023 % • Root relative squared error 98.3814 % • Total Number of Instances 17967
Summary Of Current Results • Linear Regression: Not accurate enough but looks promising (correlation coefficient: 0.5127). • Least Mean Square: Probably not good for our approach(negative correlation coefficient). • Multilayer Perceptron: Looks good but it couldn’t provide us with a linear model.
Summary Of Current Results (Cont) • SMOreg: For most vendors, it takes too long time to finish (usually more than 80 hours). • RBF Network: Not very accurate. • Gaussian Processes: Runs out of heap memory for most of our experiments.
Possible Ways To Improve The Accuracy Of Our Models. • Adding CVSS metrics as predictive attributes. • Binarize our predictive attributes (e.g. divide ttnv/ttpv into several categories.) • Use regressionSVM with multiplekernels.
Plan For Next Phase • Try to find out an optimal model for our prediction. • Try to investigate how to apply it with MulVAL if we get a good model. Otherwise, find out the reason why it is not accurate enough.