160 likes | 665 Views
Stylometry Project. May 4, 2007. Pace’s Research Day . TEAM MEMBERS. Rob Goodman, Programmer Currently working at KPMG Completing MS in Computer Science in December 2008 Matt Hahn, Quality Assurance Currently working at Affiliated Computer Services, Inc.
E N D
Stylometry Project May 4, 2007 Pace’s Research Day
TEAM MEMBERS • Rob Goodman, Programmer • Currently working at KPMG • Completing MS in Computer Science in December 2008 • Matt Hahn, Quality Assurance • Currently working at Affiliated Computer Services, Inc. • Completing MS in in Information Technologies in May 2007 • Madhuri Marella, Programmer • Completing MS in Computer Science in May 2007 • Chris Ojar, Team Leader • Currently working at Pace’s Evening Support Office in Pleasantville • Completing MS in Internet Technologies in May 2007
WHAT IS STYLOMETRY? • Unique linguistic styles and writing behaviors of individuals in order to determine authorship • Used to attribute authorship to anonymous or disputed documents, and it has legal as well as academic and literary applications • Uses statistical analysis, pattern recognition, and artificial intelligence techniques. For features, stylometry typically analyzes the text by using word frequencies and identifying patterns in common parts of speech
THE PROGRAM • A pattern recognition system to identify the author of arbitrary email using stylometry features • Phase 1 – Data Collection • Raw data from Keystroke Biometric Project • Plain text emails Phase 2 – Feature Extraction Measurements of punctuation, content format, and keystrokes [when applicable] Normalize features to 0-1 range Phase 3 – Classification k-Nearest-Neighbor using Euclidean distance Defaulted to 10
RAW DATA EXAMPLES File Name: Sandy-biometrics.txt File Name: Goodman-email.txt Dear Ms. Sanderson: I enjoyed our conversation on February 18th at the Family and Child Development seminar on teaching young children and appreciated your personal input about helping children attend school for the first time. This letter is to follow-up about the Fourth Grade Teacher position as discussed at the seminar. I will be completing my Bachelor of Science Degree in Family and Child Development with a concentration in Early Childhood Education at Pace in May of 2007, and will be available for employment at that time…
DIRTY DATA EXAMPLE <Shift> I'm on my second take and <Shift> I'm still writing about the same book <Shift> : <Shift> " <Shift> A <Shift> Million <Shift> Little <Shift> Pieces. <Backspace> <Backspace> <Shift> " <Shift> I'm not sure if <Shift> I am supposed to be typing the same this <Backspace> ng <Shift> I typed on submit <Backspace> ssion <Shift> #1 as <Shift> I am on sb <Backspace> ubmission <Shift> #2, but since <Shift> my sister is skiing in <Shift> Vermont, <Shift> I'll just continued <Backspace> . <Shift> In any event, as a <Backspace> soon as <Shift> I found out the book was not true, <Shift> I couldn't pick it up for a few days. <Shift> Then, it got the best of me. <Shift> It is tu <Backspace> <Backspace> a fact that <Shift> James <Shift> Frey is a great ri <Backspace> <Backspace> writer. <Shift> He holds your interest and attention a <Backspace> so <Shift> I go <Backspace> t b <Backspace> past the fact the <Backspace> <Backspace> at he lied, and continued on. <Shift> I have to say <Shift> I endj <Backspace> <Backspace> joyed the book a lot better as a non-fiction book than <Shift> I did as a fiction novel.
CLEAN DATA EXAMPLE I'm on my second take and I'm still writing about the same book: "A Million Little Pieces." I'm not sure if I am supposed to be typing the same thing I typed on submission #1 as I am on submission #2, but since my sister is skiing in Vermont, I'll just continue. In any event, as soon as I found out the book was not true, I couldn't pick it up for a few days. Then, it got the best of me. It is a fact that James Frey is a great writer. He holds your interest and attention so I got past the fact that he lied, and continued on. I have to say I enjoyed the book a lot better as a non-fiction book than I did as a fiction novel.
THE PROGRAM • A pattern recognition system to identify the author of arbitrary email using stylometry features • Phase 1 – Data Collection • Raw data from Keystroke Biometric Project • Plain text emails • Phase 2 – Feature Extraction • Measurements of punctuation, content format, and keystrokes [when applicable] • Normalize features to 0-1 range Phase 3 – Classification k-Nearest-Neighbor using Euclidean distance Defaulted to 10
THE PROGRAM • A pattern recognition system to identify the author of arbitrary email using stylometry features • Phase 1 – Data Collection • Raw data from Keystroke Biometric Project • Plain text emails • Phase 2 – Feature Extraction • Measurements of punctuation, content format, and keystrokes [when applicable] • Normalize features to 0-1 range • Phase 3 – Classification • k-Nearest-Neighbor using Euclidean distance • Defaulted to 10
DESIGN MODEL START Single Raw Data File? READ RAW DATA Email Reconstructed, Dirty File and Feature Stats Generated Yes Base Data Files of Email reconstructed, Dirty File and Feature Stats Generated with File Name Saved with the Extension of “- Clean Original.” Emails Reconstructed, Dirty Files and Feature Stats Generated in One File with File Name Saved as Batch.year-month-day and military time No Select & Convert Base Data Files to… Compare to Test Case? …DATA SET FILE No Enter Author of Test Case No Do You Accept the Program’s Result? Yes Run Compare Yes READ TEST CASE Save Test Case to Data Set? Yes END No
ANALYSIS MODEL START READ RAW DATA Feature Extraction Feature Statistics Normalized Feature Statistics K Nearest Neighbor Classifier TEST CASE K Nearest Neighbor Identification
PROJECT HOME PAGE http://utopia.csis.pace.edu/cs615/2006-2007/team2/
QUESTIONS Contact cojar@pace.eduor ctappert@pace.edufor more informationor visithttp://utopia.csis.pace.edu/cs615/2006-2007/team2