190 likes | 294 Views
Smarter Searching for a Network Packet Database William (Bill) Kenworthy School of Information Technology Murdoch University Perth, Western Australia. Content. This is a presentation is about an alternative way to search and/or classify data travelling over a network
E N D
Smarter Searching for a Network Packet Database William (Bill) Kenworthy School of Information Technology Murdoch University Perth, Western Australia
Content • This is a presentation is about an alternative way to search and/or classify data travelling over a network • I will be describing the background, methodology and results of the research • This research covers two seemingly disparate disciplines so as the conference has a communication focus there is some background on bioinformatics to set the scene • This presentation is (almost) maths free! • if you want the maths, see the paper :) 2
Motivation • Test and validate alternate ways to view network data • Better visualise the intrinsic relationships between packets of data based on structure rather than content As part of • Investigating the possibilities inherent in mining data via structure based methods • Searching using statistical ranking of possible answers 3
About • Searching for information in high speed network traffic is difficult - but is basically a "solved" problem! • What is still a problem though is searching for partial, obfuscated or spatially separated (in the data stream) search terms • The work described here is a successful attempt to use characteristics more commonly associated with biological systems to identify areas of interest in a network data stream 4
Searching • Traditional database search results are the result of exact (yes/no) matching based on some regular expression system • e.g., [Bb]ank* • Traditional database search results are the result of exact (yes/no) matching based on some regular expression system • e.g., [Bb]ank* • Instead, the algorithms I am recommending match on the low level structure of a sequence of characters • character value and position/relationship in the stream • character/term substitution • results are ranked according to identity score and include false error rate data 5
What do we mean by "bioinformatics" algorithms • There are useful parallels between the way data is structured in a stream of network data and a biological genome • Target the “structures” within a data stream for searching • Very sophisticated, statistically valid search algorithms were developed for use in searching biological data • Results can be statistically correlated and ranked 7
What is constant? - Structure! • The property of the algorithms developed for bioinformatics that we are using primarily targets “structure”: • IP numbers will change in the header of an IP packet. • BUT the position and placement of other tokens near the IP number does not (fixed size fields) • This property extends to data fields • Example: • DNS data packets will have a similar signature with slight differences depending on the mutable data 8
Structure • 00 => A • 01 => C • 10 => G • 11 => T 9
Methodology • The software used was standard bioinformatics software with the input data modified to suit • Most bioinformatics software has been implemented by large teams over many years – it is not practical for an individual to re-implement it for a different purpose :( • The software used was standard bioinformatics software with the input data modified to suit • Most bioinformatics software has been implemented by large teams over many years – it is not practical for an individual to re-implement it for a different purpose :( • Solution - translate packetised network data into bioinformatics compatible data files via mapping ones and zeros to the DNA alphabet – basic data abstraction 11
What? • What we are proposing is to intelligently identify network traffic in a way that uses relationships between structural elements embedded in the data rather than the literal content of the data • Use this as a method to identify and classify network data into categories against which an event can be notified • We have created a database of known good and bad data samples which allows us to place network data in one of three possible categories: • known good • known bad • unknown • What we are proposing is to intelligently identify network traffic in a way that uses relationships between structural elements embedded in the data rather than the literal content of the data • Use this as a method to identify and classify network data into categories against which an event can be notified 12
Database Creation • Created with isolated island networks using generic PC's with various operating systems • database pollution was a problem • "Good" samples were typical email, database, browsing • "Bad" samples were from PC's intentionally infected with botnet, virus and worm examples • Database is in the form of indexed motifs in a "BLAST" formatted flat file design 13
Process • Processing flow is started by extracting a packet of data to user space (via the linux kernel netfilter nfqueue module) • The packet (as a whole) is transcoded and searched against the database • returned is a set of "motifs" with score and false error rate statistics for each motif matched in the database • Event notification is based on a threshold basis according to an election process for the top rated N hits returned (hits are ranked in order of the identity score) 14
Implementation • The test design has proven less than reliable under higher packet rates • mainly due to inefficient design • Next step is to implement the reference design as a Snort IDS module and link to Snorts event notification process where the well designed data handling processes will alleviate the problems mentioned above 15
The Future • These techniques have wide applicability to search problems where data is structured but mutable • These techniques have wide applicability to search problems where data is structured but mutable • And for something completely different :) • Using a similar process for detecting collusion between student assignments based on detecting structural similarities in software coding styles • Create database of motifs based on code … search! 16
Existing work? • Considering the advantages I have found, very little work has been undertaken in using these algorithms • IBM proposed “An Intrusion-Detection System Based on the Teiresias Pattern-Discovery Algorithm” in 1999 • IBM has proposed using the Teiresias algorithm for SPAM filtering in 2004. • Commentators thought it was “interesting” but little further activity ... 17
Conclusions • It works :) • Known/Unknown sorting might be a unique “niche” application • Ability to statistically rank similarity is a useful tool opening up access to alternate ways to view search results 18
Questions? • William (Bill) Kenworthy • W.Kenworthy@murdoch.edu.au Thank you! 19