1 / 30

Text Analytics Workshop Development

Text Analytics Workshop Development. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Development - Foundation Case Study 1 – Internet News Case Study 2 – Tale of two taxonomies

Ava
Download Presentation

Text Analytics Workshop Development

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Analytics WorkshopDevelopment Tom ReamyChief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

  2. Agenda • Development - Foundation • Case Study 1 – Internet News • Case Study 2 – Tale of two taxonomies • Case Study 3 – Software Evaluation and Beyond • Exercises

  3. Text Analytics Development: Foundation • Articulated Information Management Strategy (K Map) • Content and Structures and Metadata • Search, ECM, applications - and how used in Enterprise • Community information needs and Text Analytics Team • POC establishes the preliminary foundation • Need to expand and deepen • Content – full range, basis for rules-training • Additional SME’s – content selection, refinement • Taxonomy – starting point for categorization / suitable? • Databases – starting point for entity catalogs

  4. Knowledge Architecture Audit:Knowledge Map

  5. Taxonomy Development Process:Progressive Refinement

  6. Text Analytics Development: Categorization Process • Starter Taxonomy • If no taxonomy, develop initial high level (see Chart) • Analysis of taxonomy – suitable for categorization • Structure – not too flat, not too large • Orthogonal categories • Content Selection • Map of all anticipated content • Selection of training sets – if possible • Automated selection of training sets – taxonomy nodes as first categorization rules – apply and get content

  7. Text Analytics Development: Categorization Process • First Round of Categorization Rules • Term building – from content – basic set of terms that appear often / important to content • Add terms to rule, apply to broader set of content • Repeat for more terms – get recall-precision “scores” • Repeat, refine, repeat, refine, repeat • Get SME feedback – formal process – scoring • Get SME feedback – human judgments • Text against more, new content • Repeat until “done” – 90%?

  8. Text Analytics Development: Entity Extraction Process • Facet Design – from KA Audit, K Map • Find and Convert catalogs: • Organization – internal resources • People – corporate yellow pages, HR • Include variants • Scripts to convert catalogs – programming resource • Build initial rules – follow categorization process • Differences – scale, “score” • Recall – find all entities • Precision – correct assignment to entity class • Issue – disambiguation – Ford company, person, car

  9. Case Study - Background • Inxight Smart Discovery • Multiple Taxonomies • Healthcare – first target • Travel, Media, Education, Business, Consumer Goods, • Content – 800+ Internet news sources • 5,000 stories a day • Application – Newsletters • Editors using categorized results • Easier than full automation

  10. Case Study - Approach • Initial High Level Taxonomy • Auto generation – very strange – not usable • Editors High Level – sections of newsletters • Editors & Taxonomy Pro’s - Broad categories & refine • Develop Categorization Rules • Multiple Test collections • Good stories, bad stories – close misses - terms • Recall and Precision Cycles • Refine and test – taxonomists – many rounds • Review – editors – 2-3 rounds • Repeat – about 4 weeks

  11. Case Study - Issues • Taxonomy Structure • Aggregate nodes vs. independent nodes • Children Nodes – subset – rare • Depth of taxonomy and complexity of rules • Trade-off need to update and usefulness of categories • Multiple avenues - Facets – source – New York Times – can put into rules or make it a facet to filter results • When to use filter or terms – experimental • Recall more important than precision – editors role

  12. Case Study – Lessons Learned • Combination of SME and Taxonomy pros • Combination of Features – Entity extraction, terms, Boolean, filters, facts • Training sets and find similar are weakest • Somewhat useful during development for terms • No best answer – taxonomy structure, format of rules • Need custom development • Plan for ongoing refinement • This stuff actually works!

  13. Enterprise Environment – Case Studies • A Tale of Two Taxonomies • It was the best of times, it was the worst of times • Basic Approach • Initial meetings – project planning • High level K map – content, people, technology • Contextual and Information Interviews • Content Analysis • Draft Taxonomy – validation interviews, refine • Integration and Governance Plans

  14. Enterprise Environment – Case One – Taxonomy, 7 facets • Taxonomy of Subjects / Disciplines: • Science > Marine Science > Marine microbiology > Marine toxins • Facets: • Organization > Division > Group • Clients > Federal > EPA • Instruments > Environmental Testing > Ocean Analysis > Vehicle • Facilities > Division > Location > Building X • Methods > Social > Population Study • Materials > Compounds > Chemicals • Content Type – Knowledge Asset > Proposals

  15. Enterprise Environment – Case One – Taxonomy, 7 facets • Project Owner – KM department – included RM, business process • Involvement of library - critical • Realistic budget, flexible project plan • Successful interviews – build on context • Overall information strategy – where taxonomy fits • Good Draft taxonomy and extended refinement • Software, process, team – train library staff • Good selection and number of facets • Final plans and hand off to client

  16. Enterprise Environment – Case Two – Taxonomy, 4 facets • Taxonomy of Subjects / Disciplines: • Geology > Petrology • Facets: • Organization > Division > Group • Process > Drill a Well > File Test Plan • Assets > Platforms > Platform A • Content Type > Communication > Presentations

  17. Enterprise Environment – Case Two – Taxonomy, 4 facets • Environment Issues • Value of taxonomy understood, but not the complexity and scope • Under budget, under staffed • Location – not KM – tied to RM and software • Solution looking for the right problem • Importance of an internal library staff • Difficulty of merging internal expertise and taxonomy

  18. Enterprise Environment – Case Two – Taxonomy, 4 facets • Project Issues • Project mind set – not infrastructure • Wrong kind of project management • Special needs of a taxonomy project • Importance of integration – with team, company • Project plan more important than results • Rushing to meet deadlines doesn’t work with semantics as well as software

  19. Enterprise Environment – Case Two – Taxonomy, 4 facets • Research Issues • Not enough research – and wrong people • Interference of non-taxonomy – communication • Misunderstanding of research – wanted tinker toy connections • Interview 1 implies conclusion A • Design Issues • Not enough facets • Wrong set of facets – business not information • Ill-defined facets – too complex internal structure

  20. Taxonomy DevelopmentConclusion: Risk Factors • Political-Cultural-Semantic Environment • Not simple resistance - more subtle • – re-interpretation of specific conclusions and sequence of conclusions / Relative importance of specific recommendations • Understanding project scope • Access to content and people • Enthusiastic access • Importance of a unified project team • Working communication as well as weekly meetings

  21. Text Analytics DevelopmentCase Study 3 – POC – Government Agency • Demo of SAS – Teragram / Enterprise Content Categorization

  22. Conclusion • Enterprise Context – strategic, self knowledge • Importance of a good foundation • Importance of Taxonomy Structure – mapped to use • POC a head start on development • Importance of Text Analytics Vision / Strategy • Infrastructure resource, not a project • Balance of expertise and local knowledge • Importance of Usability for refinement cycles • Difference of taxonomy and categorization • Concepts vs. text in documents

  23. Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com

More Related