1 / 18

Comparison between classification methods and building reference set

Comparison between classification methods and building reference set. Costeux Jean-Laurent Marcin Pietrzyk COST Meeting in Samos. Why identification ?. Many applications in network security, traffic engineering and monitoring. For example:

zudora
Download Presentation

Comparison between classification methods and building reference set

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparison between classification methods and building reference set Costeux Jean-Laurent Marcin Pietrzyk COST Meeting in Samos

  2. Why identification ? • Many applications in network security, traffic engineering and monitoring. • For example: • To integrate the quality of service in the next generation of networks. • To detect virus attacks. • To stop connections that belong to a well defined type of application (i.e. peer-to-peer application). • Anomaly detection and network troubleshooting .

  3. Port 20 and 21 over our platform (DPI) P2P

  4. What methods are available? • Ports [IANA]. • - Possibility of using a well known port number by other applications. • + Still valid for some applications. • + Can be used as discriminator if accompanied with other features. • DPI. • Statistical Methods. • New ideas.

  5. What methods are available? • Ports. • DPI. + Most accurate up to date (or just believed to be such). + Applicable on-line. + Widely used as a ground truth for other methods. • Need of full packet payload (9TB/24H in our case…). • Fails with encryption. • Other issues coming. • Statistical Methods. • New ideas. Internal tools can overcome this

  6. DPI applied to our platform (LIVE) • Bro [BRO] installed on four probes, signatures set from Erman work [SIGCOMM06]. + Live classification. + No need for payload storage. • Still just improvement of wireshark • Internal tool

  7. Traffic breakdown by Bro, with Erman [SIGCOMM06] signatures set • A good paper on traffic recognition. • It's signature list does not give satisfying results applied to ADSL trace. • Encryption in emule more and more popular since 2006.

  8. What methods are available? • Ports. • New ideas. • DPI. • Statistical Methods. [SIGCOMM06 ] [CONEXT06] + Payload independent. + Could classify encrypted traffic. + Small (1%) payload training set can be sufficient to train classifier [IMRG] • More complex processing. • Can they do any better than the training set, or are just more scalable? • Still easy to evade. In any case, we need at first a good test bed we can rely on.

  9. DPI tools as a ground truth ? • internal DPI tool over 20 GB of data

  10. Our reference set Dedicated captures on legacy port, post processed with DPI. • Ftp • Mail • Web • P2P • Unknown • News • Streaming • Control • DB • Games Intersection of both DPI tools, all users. Detected only by internal tool

  11. Our first reference set • 6 minutes of traffic from 20`000 users • ~ 300 '000 flows • 20 GB including payload • Class + over 80 features describing for each flow • Same amount of traffic of each application • In this format we can feed it into statistical tools. E.g. Weka [WEKA] • Unknown traffic is probably mainly eDonkey traffic.

  12. Dataset issues • Still large fraction of unknown traffic • Is it possible to build good reference set on unknown traffic? • Yes for emule, but what when new application comes? • Is our ground truth really a ground truth? • Can statistical tool do any better than it's ground truth? • Even if so how to verify it? • Hand made trace are not representative enough. • Building reference set is still manual. is it possible to automate it?

  13. Other methods available? • Ports. • DPI. • Statistical Methods. • New ideas. • "Googling endpoints" [SIGCOMM08] • Characterize endpointscombining information freely available on the web. • Endpoint profiling at a global scale. • Not yet applicable at micro scale. • Pushing the problem to the edge [PAM08] • Idea of interface flagging packets at the end hosts. • This approach would solve the issue. • But what about the privacy, we would need additional software at each end host. • Combining DPI and statistical properties of the flows (Otarie). • Both promising but not yet suitable for our case.

  14. Combining DPI and statistical properties(Example) • Idea of using few first packets to identify flow [CONEXT06] • Should be combined with DPI

  15. CONCLUSIONS • Good reference sets are very hard to obtain. • Many works are based on quite weak base truth. It is hard to compare results. • Process of building test beds is still "manual" • Differences between tools even for known protocols • Not enough flows for less popular applications (need for several captures) • Even tools which are used to obtain ground truth differs often. • Is it a battle we can win?

  16. References • [IANA] IANA. Internet Assigned Numbers Authority (IANA), " http://www.iana.org/assignments/port-numbers." • [SIGCOMM08] I. Trestian, S. Ranjan, A. Kuzmanovic, and A. Nucci "Unconstrained Endpoint Profiling (Googling the Internet)" In Proceedings of ACM SIGCOMM 2008, Seattle, WA, August 2008. • [PAM08] G. Szabo, D. Orincsay, S. Malomsoky and I. Szabó "On the Validation of Traffic Classification Algorithms" PAM 2008, Belgium • [BRO]www.bro-ids.org • [SIGCOMM06] Erman, M. Arlitt, A. Mahanti, "Traffic Classification Using Clustering Algorithms", Proceedings of the 2006 SIGCOMM workshop on Mining network data, Pisa (Italy), pages 281-286, September 2006 • [CONEXT06]Laurent Bernaille, Renata Teixeira and Kavé Salamatian "Early Application Identification" Conference on Future Networking Technologies (CoNEXT) Lisbon - December, 2006 • [IMRG] IMRG Workshop on Application Classification and Identification (WACI) Cambridge (USA-MA), october 2007 • [WEKA]Weka 3 - Data Mining with Open Source Machine Learning Softwarewww.cs.waikato.ac.nz/ml/weka/

  17. Thank you!

  18. Accuracy vs. size of training set [IMRG]

More Related