180 likes | 337 Views
Comparison between classification methods and building reference set. Costeux Jean-Laurent Marcin Pietrzyk COST Meeting in Samos. Why identification ?. Many applications in network security, traffic engineering and monitoring. For example:
E N D
Comparison between classification methods and building reference set Costeux Jean-Laurent Marcin Pietrzyk COST Meeting in Samos
Why identification ? • Many applications in network security, traffic engineering and monitoring. • For example: • To integrate the quality of service in the next generation of networks. • To detect virus attacks. • To stop connections that belong to a well defined type of application (i.e. peer-to-peer application). • Anomaly detection and network troubleshooting .
What methods are available? • Ports [IANA]. • - Possibility of using a well known port number by other applications. • + Still valid for some applications. • + Can be used as discriminator if accompanied with other features. • DPI. • Statistical Methods. • New ideas.
What methods are available? • Ports. • DPI. + Most accurate up to date (or just believed to be such). + Applicable on-line. + Widely used as a ground truth for other methods. • Need of full packet payload (9TB/24H in our case…). • Fails with encryption. • Other issues coming. • Statistical Methods. • New ideas. Internal tools can overcome this
DPI applied to our platform (LIVE) • Bro [BRO] installed on four probes, signatures set from Erman work [SIGCOMM06]. + Live classification. + No need for payload storage. • Still just improvement of wireshark • Internal tool
Traffic breakdown by Bro, with Erman [SIGCOMM06] signatures set • A good paper on traffic recognition. • It's signature list does not give satisfying results applied to ADSL trace. • Encryption in emule more and more popular since 2006.
What methods are available? • Ports. • New ideas. • DPI. • Statistical Methods. [SIGCOMM06 ] [CONEXT06] + Payload independent. + Could classify encrypted traffic. + Small (1%) payload training set can be sufficient to train classifier [IMRG] • More complex processing. • Can they do any better than the training set, or are just more scalable? • Still easy to evade. In any case, we need at first a good test bed we can rely on.
DPI tools as a ground truth ? • internal DPI tool over 20 GB of data
Our reference set Dedicated captures on legacy port, post processed with DPI. • Ftp • Mail • Web • P2P • Unknown • News • Streaming • Control • DB • Games Intersection of both DPI tools, all users. Detected only by internal tool
Our first reference set • 6 minutes of traffic from 20`000 users • ~ 300 '000 flows • 20 GB including payload • Class + over 80 features describing for each flow • Same amount of traffic of each application • In this format we can feed it into statistical tools. E.g. Weka [WEKA] • Unknown traffic is probably mainly eDonkey traffic.
Dataset issues • Still large fraction of unknown traffic • Is it possible to build good reference set on unknown traffic? • Yes for emule, but what when new application comes? • Is our ground truth really a ground truth? • Can statistical tool do any better than it's ground truth? • Even if so how to verify it? • Hand made trace are not representative enough. • Building reference set is still manual. is it possible to automate it?
Other methods available? • Ports. • DPI. • Statistical Methods. • New ideas. • "Googling endpoints" [SIGCOMM08] • Characterize endpointscombining information freely available on the web. • Endpoint profiling at a global scale. • Not yet applicable at micro scale. • Pushing the problem to the edge [PAM08] • Idea of interface flagging packets at the end hosts. • This approach would solve the issue. • But what about the privacy, we would need additional software at each end host. • Combining DPI and statistical properties of the flows (Otarie). • Both promising but not yet suitable for our case.
Combining DPI and statistical properties(Example) • Idea of using few first packets to identify flow [CONEXT06] • Should be combined with DPI
CONCLUSIONS • Good reference sets are very hard to obtain. • Many works are based on quite weak base truth. It is hard to compare results. • Process of building test beds is still "manual" • Differences between tools even for known protocols • Not enough flows for less popular applications (need for several captures) • Even tools which are used to obtain ground truth differs often. • Is it a battle we can win?
References • [IANA] IANA. Internet Assigned Numbers Authority (IANA), " http://www.iana.org/assignments/port-numbers." • [SIGCOMM08] I. Trestian, S. Ranjan, A. Kuzmanovic, and A. Nucci "Unconstrained Endpoint Profiling (Googling the Internet)" In Proceedings of ACM SIGCOMM 2008, Seattle, WA, August 2008. • [PAM08] G. Szabo, D. Orincsay, S. Malomsoky and I. Szabó "On the Validation of Traffic Classification Algorithms" PAM 2008, Belgium • [BRO]www.bro-ids.org • [SIGCOMM06] Erman, M. Arlitt, A. Mahanti, "Traffic Classification Using Clustering Algorithms", Proceedings of the 2006 SIGCOMM workshop on Mining network data, Pisa (Italy), pages 281-286, September 2006 • [CONEXT06]Laurent Bernaille, Renata Teixeira and Kavé Salamatian "Early Application Identification" Conference on Future Networking Technologies (CoNEXT) Lisbon - December, 2006 • [IMRG] IMRG Workshop on Application Classification and Identification (WACI) Cambridge (USA-MA), october 2007 • [WEKA]Weka 3 - Data Mining with Open Source Machine Learning Softwarewww.cs.waikato.ac.nz/ml/weka/