1 / 25

TCN Spell Checker

TCN Spell Checker. Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh-Gama, Eric Engquist. Team AZP. Team descendant of previous project groups Primary roles by member: Joshua Correa – Project Lead, TCN Liason Eric Engquist – Materials and Metrics Manager

aline-walsh
Download Presentation

TCN Spell Checker

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TCN Spell Checker Team AZP: Mark Biddlecom, Joshua Correa, Jatinder Singh, Zianeh Kemeh-Gama, Eric Engquist

  2. Team AZP • Team descendant of previous project groups • Primary roles by member: • Joshua Correa – Project Lead, TCN Liason • Eric Engquist – Materials and Metrics Manager • Mark Biddlecom – Resource and Process Manager • Zianeh Kemeh-Gama – Schedule Manager • Jatinder Singh – Research Lead • Dr. Ludi – Faculty Advisor • Website: http://www.se.rit.edu/~teamazp/index.htm

  3. TCN • Software development and staffing company based here in Rochester, NY http://www.tcnus.com • Developer of web-based search and knowledge management programs • KnowledgeTrac • Customizable multilingual web search tool • Standalone spider • TecTrac, AppTrac, AuditTrac, HelpTrac, TestTrac • Document and database search and management tools

  4. Document Collaboration Tool • Online repository for management documents • Meeting minutes • Metrics • Research links • Presentations and diagrams • Task and issues for each team member • Email notifications of changes • Custom developed for this project

  5. Spell Checker • Should compensate for mistyped search terms • Match misspelled words with correct spelling • “atourney”  attorney • Match misspelled words with correct results • “atourney”  legal services, lawyers • Meant to make searches more useful for average web search users • 1) Takes in search terms from user • 2) Checks spelling/matches with known search terms • 3) Returns suggestions to search engine

  6. Spell Checker Requirements Functional Requirements: • Look up search terms in a dictionary • Suggest replacements for misspelled terms (closest match) • Add new terms to dictionary • Process phrases (as opposed to single words) • Support multiple dictionaries

  7. Spell Checker Requirements Non-functional Requirements: • Object-oriented design to be implemented as a web service with VB.NET • Adaptability • Must support ability to work with different data stores • Must support the addition of new components • Performance • Analysis of a search string cannot take longer than one second.

  8. Spell Check Process • Load configuration • Load dictionaries (from cache or rebuild) • Apply rules • Parse search string • Apply algorithm to each term • Short-circuit if enough results have been found • Return results set of suggestions

  9. Configuration • Application configuration file • Provides application-level settings (e.g., maximum memory usage, maximum processor time for search) • Points to search configuration file • Search configuration file • Allows control over how memory is used vs. algorithm performance • Defines dictionaries and methodologies • Methodologies include rules

  10. Loaders • Load a set of words for use in dictionaries • Used to create root dictionaries (<root> in the configuration file) • Word sets returned by loaders are not cached, but instead used to create algorithm dictionaries

  11. Formatters • Provide a dictionary specialized for use with a specific algorithm • Created by <dictionary> tags in the configuration file • Dictionaries created by formatters are cached for use between application sessions

  12. Parsers • Split a search string up into a number of terms • For a given rule, the algorithm is applied to each term supplied by the parser

  13. Data Flow

  14. Algorithms – String Similarity • Calculates number of operations to go from one word to another • Insertion, Deletion, Substitution • Few operations  Good Suggestion • Extra features • Swapping operation • Operation weighting

  15. Algorithms – String Similarity • Complexity of O(s1*s2) • S1,s2 lengths of strings being compared • Can be improved to O(s1*k) • K is edit distance

  16. Algorithms - Phonetic • Several rules used to parse English words into a sequence of phonetic sounds • Example: Phonetic  pntk • Parse dictionary, parse search term • String similarity comparison

  17. Deliverable Schedule Iteration 1: February 1st 2005 • Complete system design for system iterations 1-3 • Instructions for installation and integration with TCN client software • Research • Analysis of historic search strings and business names from TCN • Dictionaries (common words) • Word search algorithms • Basic System Implementation • Database integration • Testing

  18. Deliverable Schedule Iteration 2: February 18th 2005 • Suggest replacements for words not in the dictionary • Addition of a new search algorithm to provide more intelligent searches • Closest Match • Using multiple dictionaries • Unit Testing for all written code

  19. Deliverable Schedule Iteration 3: March 21st 2005 • Phonetic Matching • Dynamically add words/phrases to the dictionary • Support phrase searching • Addition of further search algorithms • GUI Configuration tool • Algorithm Optimization

  20. Metrics • Schedule/estimation accuracy • Estimation accuracy (hours per task) • Slippage percentages • Defect statistics and analysis • Severity and complexity of defects • Defect source tracking • Average age of defects

  21. Age of Known Defects

  22. Severity of Defects

  23. Complexity of Defects

  24. Sources of Defects

  25. Research References • “Approximate String Matching” by Ricardo Baeza-Yates at University of Chile • “A Guided Tour to Approximate String Matching” by Gonzalo Navarro at University of Chile, 2001 • “An Extension of Ukkonen’s Enhanced Dynamic Programming ASM Algorithm” by Hal Berghel (U of Arkansas) and David Roach (Acxiom Corp.), 1996

More Related