1 / 23

Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

An Effective Similarity Metric for Application Traffic Classification. Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2 {dejavu94, fates, yjwon, johns, jwkhong}@postech.ac.kr April 20, 2010

terris
Download Presentation

Jae Yoon Chung 1 , Byungchul Park 1 , Young J. Won 1 John Strassner 2 , and James W. Hong 1, 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Effective Similarity Metric for Application Traffic Classification Jae Yoon Chung1, Byungchul Park1, Young J. Won1John Strassner2, and James W. Hong1, 2 {dejavu94, fates, yjwon, johns, jwkhong}@postech.ac.kr April 20, 2010 1Dept. of Computer Science and Engineering, POSTECH, Korea 2Division of IT Convergence Engineering, POSTECH, Korea

  2. Contents • Introduction • Related Work • Research Goal • Proposed Methodology • Evaluation • Conclusion and Future Work

  3. Introduction • Traffic classification for network management • Network planning • QoS management • Security • Etc. • Diversity of today’s Internet traffic • New types of network applications • Increase of P2P traffic • Various techniques for avoiding detection • Document classification  Traffic classification • Document classification in natural language processing • Comparing packet payload vectors is analogous to document classification

  4. Related Work • Well-known port-based classification • Low complexity • Low accuracy (approximately 50~70%) • Signature-based classification • High reliability • Exhaustive tasks for searching signatures • E.g.) Snort, LASER • Behavior-based classification • Focusing on traffic patterns and connection behaviors • Questionable accuracy • E.g.) BLINC • Machine Learning-based classification • Utilize statistical information • A huge computing resource consumption • E.g.) SVM, Bayesian Network • Similarity-based classification • Utilize document classification approach • Questionable scalability • E.g.) Flow similarity calculation [IPOM ‘09]

  5. Summary of IPOM 2009 • Proposed new traffic classification approach • Utilize document classification approach using Cosine similarity calculation • Propose new packet representation using Vector Space Model • Propose flow similarity calculation methodology which is to compare packets in flow sequentially • Methodology validation using real-world traffic on our campus backbone network • Cannot classify flows in asymmetric routing environment • No comparison of Cosine similarity and other similarity metrics • Cosine similarity that is common similarity metric for human-document classification • High variation of similarity value according to term-frequency

  6. Research Goals • Propose new traffic classification algorithm • Automation of signature generation step • Generate application vector, which is an alternative signature, using simple vector operation • Make groups according to traffic type and operation within single-application traffic • Accurate and feasible traffic classification algorithm • Classify application traffic using similarity calculation • Solve asymmetric routing classification problem • Validation using real-world network traffic to compare similarity metrics • Complexity analysis • Compare three similarity metrics for traffic classification • Jaccard similarity – counting fragment of signature • Cosine similarity – high weighting scheme for signature • RBF similarity – Euclidean distance between packets

  7. Proposed Methodology

  8. Vector Space Modeling • Vector Space Modeling • An algebraic model representing text documents as vectors • Widely used to document classification • Categorize electronic document based on its content (e.g. E-mail spam filtering) • Document classification vs. Traffic classification • Document classification • Find documents from stored text documents which satisfy certain information queries • Traffic classification • Classify network traffic according to the type of application based on traffic information

  9. Payload Vector Conversion (1/2) • Definition of word in payload • Payload data within an i-bytes sliding window • |Word set| = 2(8*sliding window size) • Definition of payload vector • A term-frequency vector in NLP Payload Vector = [w1 w2 … wn]T

  10. Payload Vector Conversion (2/2) Word Word Word • The word size is 2 and the word set size is 216 • The simplest case for representing the order of content in payloads

  11. Similarity Metrics for Traffic Classification • Jaccard similarity • The size of the intersection of the sample sets X and Y divided by the size of the union of the sample sets X and Y • Cosine similarity • Two vectors X and Y of n dimensions by fining the cosine angle between them • RBF similarity • Radius based function of Euclidean distance between two vectors X and Y

  12. Application Vector Heuristics • Application vector • Represent typical packets that are generated by target applications as the center (basis) of each cluster • Application vector generator • Read packets from the target application trace • Divide the packets into several types of clusters without any pre-processing Traffic cluster 1 Application vector generator Application vector 1 Application trace Application vector 2 Application vector 3 Traffic cluster 2

  13. Application Vector Generation • Unsupervised grouping within single-application traffic • Provide fine-grained classification • Classify single-application traffic according to traffic types Application Traffic Cluster 1 packet6 Application vector 1 packet5 packet4 packet3 packet2 packet1 Cluster 2 Application vector 2

  14. Two-stage Traffic Classification • Packet level clustering • Classify signal packets regardless of flow information • Compare payload vectors with application vectors by calculating similarity value • Mark on each packet with its application and priority • Allow the permutation of packet sequence • Flow level classification • Rearrange packets according to flow information • Ignore mis-clustered packets that are caused by protocol ambiguities • HTTP for Web • HTTP for P2P

  15. Two-stage Traffic Classification BackboneTraffic Stage 1 Stage 2 BitTorrent Traffic BitTorrent FileGuri Traffic Cluster 1 F1 P1 BitTorrent FileGuri Application Vector 1 F1 P2 F1 P3 F1 P1 Flow 1 Flow 2 F1 P4 F1 P2 F2 P1 F2 P1 F1 P1 FileGuri Cluster 2 F1 P3 F2 P2 F1 P2 F2 P2 F1 P3 F2 P3 F2 P1 F2 P3 Application Vector 2 F1 P4 F2 P4 F2 P2 Mis-clustered F1 P4 F2 P4 F2 P4 Melon Cluster 3 Application Vector 3 F2 P3

  16. Evaluation

  17. Classifying Real-world Traffic • Fix-port Applications • Traffic trace on one of two Internet junctions at POSTECH using optical tap • Ground-truth traffic • Some active flows among application traffic distinguished by usage of active port number • Target Applications • FileGuri, ClubBox, Melon, BigFile • Untraceable-port Applications • Traffic Measurement Agent (TMA) • Monitoring the network interface of the host • Recording log data (five-flow tuples, process name, packet count, etc) • Target Applications • eMule, BitTorrent Backbone Traffic Ground-truth Traffic Target Application Traffic Target Application Traffic Ground-truth Traffic

  18. Classification Accuracy • Classification accuracy comparison • Fixed-port application • FileGuri, ClubBox, Melon, BigFile • Untraceable-port application • eMule, BitTorrent • Jaccard similarity • Reliable – count common segment • Cosine similarity • Emphasize common segment – cannot distinguish ambiguous packets • RBF similarity • Difficulty of setting parameter – need guideline how to set parameter • BitTorrent traffic on Backbone network • Traffic over-classification by Cosine similarity • High false positive rate of Cosine similarity

  19. Histogram of Similarity Values

  20. CDF of Distance among Payload Vectors

  21. Complexity Analysis

  22. Conclusion and Future Work • Develop new traffic classification research • Utilizing document classification approach to traffic classification • Unsupervised classification to make cluster within a single-application traffic • Two-stage classification algorithm to solve asymmetric routing classification problem • Linear time complexity • Compare three similarity metrics • Provide guideline for selecting similarity metrics for traffic classification • Provide soft-classification that represents similarity as a numerical value ranges from 0 to 1 • Future Work • Enhance unsupervised classification methodology for automated signature generation • Extract orthogonal application vectors to improve scalability

More Related