180 likes | 281 Views
RESEARCH TRACK. Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data/text mining. INDUSTRIAL TRACK. Density-Based Spam Detector. (Acceptance rate = 40/337 = 12%). Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc.
E N D
RESEARCH TRACK Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery and data/text mining INDUSTRIAL TRACK Density-Based Spam Detector (Acceptance rate = 40/337 = 12%) Hiromitsu FUJIKAWA Katsuyuki YAMAZAKI KDDI R&D Laboratories Inc. 2-1-15 Ohara, Kami-fukuoka, Saitama 356-8502, Japan {fujikawa,yamazaki}@kddilabs.jp Fuminori ADACHI Takashi WASHIO Hiroshi MOTODA ISIR, Osaka University 8-1, Mihogaoka, Ibarakishi, Osaka 567-0047,Japan {adachi,washio,motoda}@ar .sanken.osaka-u.ac.jp
Density-Based Spam Detector • A new spam detection method which use document space density information • The use of document space density • An efficient implementation through the use of a direct-mapped cache • Purpose • For the spam filter which is used in the mail server, it has to be: • High processing speed • maintain Easily • High accuracy • Privacy protection
Hash DB Hash DB System Architecture Unsupervised learning-- solve privacy problem, maintenance problem Hash table-- solve processing speed problem Similarity, threshold-- solve spam filter accuracy problem Hash table Hash feature Read features Find feature by N-gram Mail corpus calculate similarity Write/update features, similar email, email calculate similarity > SPAM threshold SPAM An incoming email
Related work • Bayesian-like approach • Rule-based approach • Checksum data base • http://www.dcc-servers.net/dcc/ • Vector representation • Hash-based text representation • Text retrieval、text compression、spam filtering • Direct-mapped cache is used to replace for LRU
Density-based spam detector • Document space density • Count the number of similar e-mails • By counting the number of similar emails, the simple threshold is enough to distinguish spam from other emails.
Mail System Design • Mass Mail Detector Monitoring network packets Analyzes SMTP traffic between mail servers and reconstructs the text of emails Transfers the text into vector representation
Hash design • Hash based representations • Hash values of each length L substring are calculated, and then the first N of them are used as vector representation of the email
Caching Architecture • Direct-mapped cache architecture • The hash data base store hash values of email and number of similar emails. • The direct-mapped cache is a simple hash table and the algorithms can find the entries of the of emails which have the same hash value through this cache.
Similar emails • To check a single piece of email, in order to find similar previous e-mail which share S% of the same hash values • Algorithm
Experimental results Distribution of similar e-mail follows Zipf’s law Summary of experimental results
Experimental results Recall rate Cache usage Log
Experimental results Comparison with other methods Testing method
Experimental results Bsfilter -2 group of mail list S:528 mails H:1538 mails Training:H+1/2S Testing:1/2S Result: After some period, bsfilter missclassified most of mails Reason: change topics Effect of topic change
Experimental results Effect of On-line Learning
Maintenance and privacy • Supervised learning methods require a positive and negative example of spam • Someone has to check the contents of the mail manually and therefore has the potential to violate privacy. • Although our method requires a white list, maintaining such a white list is relatively easy, especially comparing it to maintaining a black list.
Conclusion • High processing speed • 13000 emails per second (1.25 billion emails per day) • Maintenance free • 98% recall rate and 100% precision • Privacy protection