490 likes | 667 Views
SCORE. Presentation Overview. Industry Requirements Capabilities System Architecture and Technologies Examples and Scenarios Measures (Quality, Performance, Scalability, Robustness) Deployment Information Questions & Answers: What if Business Development Issues Milestones and Schedules.
E N D
Presentation Overview • Industry Requirements • Capabilities • System Architecture and Technologies • Examples and Scenarios • Measures (Quality, Performance, Scalability, Robustness) • Deployment Information • Questions & Answers: What if • Business Development Issues • Milestones and Schedules
Intelligence Content Management Challenges • The Problem:massive, disparateinformation • Multiple isolated sources of intelligence information (FBI, CIA, etc.) that is not shared or integrated • Large variety (format, media) of open source, partner, FAA and IC information • The Difficulty: inability to have timely actionable info • Amount of data too overwhelming to use constructively • Manual methods of aggregating data not scaleable • => Lack of a “complete picture” to make decisions • Inability to make timely, accurate and actionable conclusions based on information-at-hand • The Solution:Voquette’s Semantic Technology • Technology to analyze and integrate data from disparate sources to provide a near-real time, reliable, scaleable and actionable solution for intelligence and security applications
New Technical Challenges in Enterprise Content Management • Aggregation • Feed handlers/Agents that understand content representation and media semantics • Push-pull, Web-DB-Files, Structured-Semi-structured-Unstructured data of different types from proprietary, partner and open source • Homogenization and Enhancement • Enterprise-wide common and customizable view (information organization) • Domain model, taxonomy/classification, metadata standards • Semantic Metadata– created automatically if possible • Semantic associations/inferences (link analysis) • Semantic Applications (in near real-time) • Search, personalization, alerts, knowledge browsing/inference for improved relevance, intelligent personalization, customization
Voquette’s Unique Capabilities • Semantics (understanding of content and user needs) • Extreme relevance • Knowledge inferencing (semantic associations) • Near real-time • Multiple applications/usage patterns (not just search) • Automation • Scalability in all aspects
Fast main-memory based query engine with APIs and XML output Distributed agents that automatically extract/mine knowledge from trusted sources Toolkit to design and maintain the Knowledgebase Distributed agents that automatically extract relevant semantic metadata from structured and unstructured content Knowledgebase represents the real-world instantiation (entities and relationships) of the WorldModel CACS provides automatic classification (w.r.t. WorldModel) from unstructured text and extracts contextually relevant metadata WorldModel specifies enterprise’s normalized view of information (ontology) Voquette Semantic Technology System Architecture
Workflow Process • WorldModel™ (Domain Model), Taxonomy/Classification, Knowledge base schema • Classifiers • Knowledge and Content Extraction Agents • Automated or human-supervised run-time(for classification and metadata enhancement, knowledge base maintenance) • Semantic Applications All components support incremental extensions.
Technological Innovation • Semantic approach (classification/taxonomy, domain model, entities and relationships) [All components] • Semantic associations/ knowledge inferences • Classification committee (multiple technologies, rather than one size fits all) [CACS] • Scalability throughout with distributed architecture and implementation (number of content and knowledge sources, indexing, etc.) • Main memory implementation, incremental check pointing [SSE]
Example:Domain: IntelligenceSub-domain: People, Org, Places(Other Sub-domains: Financing, Methods & Training, Materials)
Terrorism Intelligence Group Person Event Bank Attack Material Name Alias Alias Email Address Location Time Terrorism Intelligence WorldModel™ (simplified) Intelligence WorldModel™ What is it? WorldModel™: Template infrastructure to organize and index content contextually What does it consist of? Domains (categories) and domain-specific attributes, with geo-spatial and temporal info Setting up a Terrorist Intelligence WorldModel™ • What are the information pieces of possible interest? • (that can be modeled as WorldModel™ attributes) • Groups: Nationalist, Terrorist, Political groups • Person: Terrorist, Suicide Bomber, Hijacker, Personality • Event: Flight hijacking, WTC Crash,Kidnapping, Terrorist training • Bank: Swiss bank, Belgian bank (where groups have accts) • AttackMaterial: Knives, Plastic Explosives, RDX, AK47 Gun • NameAlias: Aliases of terrorists (Osama BL = Usama BL) • AliasEmailAddresses: Email addresses for alias names • Location: Location related with event of interest • Time: Date/time related to event of interest
Terrorism Intelligence Group Person Event Bank Attack Material Name Alias Alias Email Address Location Time Terrorism Intelligence WorldModel™ Intelligence Extractor Agents What is it? Extractor Agents: Intelligent software robots that work on structured content and automatically extract metadata information that is relevant and meaningful to the domain/sub-domain at hand • How do they work? • Intelligence extractor agents use the Intelligence WorldModel™ definition for meaningful • metadata extraction from trusted Intelligence content • Extractor agents exploit the structure of Intelligence content and automatically “pick up” • meaningful Intelligence metadata information (as defined in the WorldModel™) Pick up syntax metadata Extractor Agent For CIA Confidential Content Pick up group name Pick up person Pick up attack material Pick up bank name Pick up location/date/time Pick up name aliases Metadata extracted
Terrorism Intelligence WorldModel™ Group Group Group Group Group Group Group Person Person Person Person Group Person Person Event Alias Alias Alias Country Bank Bank Bank Attack Material Name Alias Event Event Event Event Alias Email Address Location Time Intelligence Knowledge Base What is it? Knowledge Base: Network of Intelligence objects (significant pieces of information) and a representation of the real-world relationships (associations) between them Country involved in originated in has alias (‘Bin Laden” involved in “WTC Crash”) (‘’Al Queda” originated in “Afghanistan”) (‘Bin Laden” has alias “Mohammed”) occurred at Location Email add has email accounts in (‘WTC Crash” occurred at “New York, USA”) (‘Mohammed” has email “mohd@un.com”) (‘’Al Queda” accounts in “Swiss bank”) occurred at Time works with leads (‘WTC Crash” occurred at “0903, 9/11/01”) (‘Irish IRA” works with “Columbian Group”) (‘Bin Laden” leads “Al Queda”) works for (‘Nabil Almarabh” works for “Al Queda”) Intelligence Knowledge Base Definition EmailAdd Has email Has alias Is funded by/works with Works for/ leads Account in Involved in Originated in Occurred at Occurred at Location Time
Intelligence Knowledge Base Definition Has email Has alias Is funded by/works with Group Person Alias Account in Works for/ leads Country Involved in EmailAdd Originated in Bank Event Occurred at Occurred at Location Time Categorization and Auto-Cataloging System (CACS) What is it? CACS: Module that categorizes content and automatically creates metadata of content How does it work? Uses a hybrid of statistical, machine learning and Intelligence knowledge-base techniques Application in Intelligence CACS could be trained to intelligently process Intelligence content to classify the content piece as a terrorism-related event (WTC Crash, Flight hijacking, etc.) CACS Information exchange for metadata creation Structured Intelligence content OR Event: Pentagon Attack Metadata extracted: Terrorist Group: Al Queda Person: Bin Laden Location: Washington, USA Time: 0918 hrs Affiliation Country: Afghanistan Allied Group: Saudi Misaal Person Alias: Mohammed Unstructured Intelligence content
Intelligence Semantic Engine What is it? Semantic Engine: Fast main memory-based front end query engine that enables the end-user to retrieve highly relevant and personalized content via custom APIs • Features and Functionality • Minimal input from security agent – system intelligent enough to provide all possible relevant • content to security agent (type in “Bin Laden” and get all relevant information on him and • other items related to him) • Applications: Search, personalization, alerts, notifications, directory Search Personalization Content Enhancement Technology Semantic Engine User query submitted Directory Alerts/Notifications Intelligent Inference Highly relevant Content returned Confidential Agent Analyst WorkBench Custom Apps.
Scenario 1: Intelligent Analysis of Confidential Email
Scenario 1: Intelligent Analysis of Email (Contd.) • Information underlined in blue are important metadata elements automatically picked up by • the Intelligence extractor agents • Information shown in red boxes are names of terrorists (stored in our Knowledge Base) that • are also automatically picked up by the Intelligence extractor agents • CACS can determine by content analysis that this is a “Terrorist Meeting” information • Intelligent inferencing is possible due to semantic associations of the Knowledge Base Picked up off explicit mention in email “Mohamed Atta met with Abdulaziz Alomari” Works for Works for Voquette Knowledge Associations Al Qaeda Saudi Misaal Inference: Al Qaeda and Saudi Misaal have possibly started working together as allied groups Originated in Originated in Afghanistan Saudi Arabia Inference: Afghanistan and Saudi Arabia have groups that probably collaborate - look for other relationships
Scenario 2: Analyst Workbench • Voquette’s Semantic Technology enables highly relevant and comprehensive • terrorist research • Example: A security agent wishes to perform research on “Bin Laden” (as he is prime suspect) • News/Information directly about Bin Laden is retrieved (that mentions his name explicitly) • News/Information on Al Qaeda is retrieved (Bin Laden Al Qaeda association in KB) • News/Information on WTC Crash is retrieved (WTC Crash Bin Laden association in KB) • News/Information on Mohammed is retrieved (Mohammed Bin Laden ‘alias assoc.’ in KB) • News/Information (intelligence) on Afghanistan is retrieved (Al Qaeda Afghanistan in KB) • News/Information (intelligence) on Swiss bank is retrieved (Al Qaeda Swiss bank in KB) • Combined together, this co-related information is extremely valuable in bringing together • multiple actionable perspectives and point-of-views on one screen • Result: Less time-spending, faster and much better decision making, more security!
Syntax Metadata Semantic Metadata Knowledge Inferencing Workflow Human-assisted inference Same entity led by
Analyst Usage Scenarios/Interfacesfor Knowledge Inference Analysts can possibly use: • Search • Knowledge Base Browser / Directory • Personalization/Alerts • APIs for custom applications All options support Reference Pages, Semantic Associations, Knowledge-based browsing
Intelligence Analyst Browsing Scenario
Core Competencies of Voquette’s Semantic Technology • Content Aggregation, Integration and Normalization • Create a Customized WorldModel™ (domain model with customized domain attributes) • Content Aggregation and integration from multiple sources, formats and media (text/audio/video) • Support push or pull delivery/ingestion of content • Patented extractor agent technology • Metadata extraction from structured, semi-structured and unstructured text (fully automated) • Automatically homogenize content feed tags (fully automated) • Categorization and Auto-Cataloging • Automatically categorize structured and unstructured text • Create contextually relevant semantic metadata from unstructured text (fully automated) • Uniquely uses a hybrid of statistical, machine learning and knowledge-base techniques for classification
Core Competencies of Voquette’s Semantic Technology • Content Enhancement using Knowledge Base • Create and maintain a Customized Knowledge Base for any domain • Automatically create content tags based on text Itself (fully automated) • Automatically enhance content tags based on information outside of text (fully automated) by exploiting Knowledge Base • Provide end user relevant content not only relevant content he asked for, but also relevant content that he did not explicitly ask for, but that he needs to know • Semantic Engine • Fast , main-memory based Semantic Engine • Response Time of the order of 10s of milliseconds • Performance: 1 million queries per hour per server • Real Time Indexing (stories indexed for search/personalization within a minute) • Near real-time search/personalization of new content and breaking news • Information retrieval based on quality and not quantity • Semantic Applications: Search, Directory, Personalization, Alert, Notifications, Custom enterprise applications
Fast main-memory based query engine with APIs and XML output Distributed agents that automatically extract/mine knowledge from trusted sources Toolkit to design and maintain the Knowledgebase Distributed agents that automatically extract relevant semantic metadata from structured and unstructured content Knowledgebase represents the real-world instantiation (entities and relationships) of the WorldModel CACS provides automatic classification (w.r.t. WorldModel) from unstructured text and extracts contextually relevant metadata WorldModel specifies enterprise’s normalized view of information (ontology) SCORE Implementation Architecture
ExampleDomain: Financial ServicesSub-domain: Equity Market(other potential sub-domains: Fixed Income, Mutual Funds, …)
Syntax Metadata Semantic Metadata Content Enhancement Workflow
Asset Syntax Metadata Producer: BusinessWire Source: Bloomberg Date: Sept. 10 2001 Location: San Jose, CA URL: http://bloomberg.com/1.htm Media: Text Semantic Metadata Company: Cisco Systems, Inc. Topic: Company News Asset Syntax Metadata Producer: BusinessWire Source: Bloomberg Date: Sept. 10 2001 Location: San Jose, CA URL: http://bloomberg.com/1.htm Media: Text Semantic Metadata Company: Cisco Systems, Inc. Creates asset (index) out of extracted metadata Appends topic metadata to asset Extractor Agent for Bloomberg Categorization & Auto-Cataloging System (CACS) Scans text for analysis Scans text for analysis Metadata extracted automatically Classifies document into pre-defined category/topic Leverages knowledge to enhance metatagging Knowledge Base Computer Hardware Headquarters Sector San Jose Syntax MetadataAsset Producer: BusinessWire Source: Bloomberg Date: Sept. 10 2001 Location: San Jose, CA URL: http://bloomberg.com/1.htm Media: Text Semantic Metadata Company: Cisco Systems, Inc. Topic: Company News Ticker: CSCO Exchange: NASDAQ Industry: Telecomm. Sector: Computer Hardware Executive: John Chambers Competition: Nortel Networks Headquarters: San Jose, CA Enhanced Content Asset Indexed Executives CEO of Industry John Chambers Telecomm. Cisco Systems Company XML Feed Competes with Exchange Competition NASDAQ Ticker Nortel Networks CSCO Semantic Engine Content Asset Index Evolution
Examples Sports WorldModel™ Equity WorldModel™ Sports Equity Sport Name Company Location Ticker Industry Golf Football Sector Golfer Player Executive Tourney Team Headquarters Definition Domain:Equity Equity-specific attributes: Company Ticker Industry Sector Executive Headquarters League GolfCourse Coach Definition Domain:Sports Sports-specific attributes: Sport Name Location Sub-Domain:Golf Golf-specific attributes: Golfer Tourney Golf Course Sub-Domain:Football Football-specific attributes: Player Team League Coach Voquette WorldModel™ What is it? WorldModel™: Template infrastructure to organize and index content contextually What does it consist of? Domains (categories) and domain-specific attributes
Equity WorldModel™ Extractor Agent For CNNfN Pick up syntax metadata Equity Pick up company Company Pick up ticker Ticker Pick up industry Industry Pick up sector Sector Pick up executives Executive Pick up headquarters Headquarters Metadata extracted Voquette Extractor Agents What is it? Extractor Agents: Intelligent software robots that work on structured content and automatically extract metadata information that is relevant and meaningful to the domain/sub-domain at hand • How do they work? • Extractor agents use the WorldModel™ definition for metadata extraction • Extractor agents exploit the structure of content and automatically “pick up” • meaningful metadata information • Write once, Extract permanently – schedulable according to needs • Can work on Web content, feeds, XML, corporate databases, etc. • Extractor agents specific to structure of content-at-hand
Equity WorldModel™ Knowledge Base Equity Knowledge Base Definition Equity Computer Hardware Headquarters Sector Company Industry San Jose Company Ticker Ticker Sector Executives Located at Belongs to CEO of Industry John Chambers Exchange Exchange Telecomm. Cisco Systems Executives Company Industry CEO of Belongs to Competes with Headquarters Sector Represented by Exchange Competition NASDAQ Executive Trades on Ticker Nortel Networks CSCO Headquarters Voquette Knowledge Base What is it? Knowledge Base: Network of entity objects (significant pieces of information) and a representation of the real-world relationships (associations) between them What does it consist of? Entities (person, location, organization, etc.) and Entity-Relationships • How does it work? • Structured closely to the structure of the WorldModel™ • Entity and relationship template definitions for the domain at hand • Work with knowledge extractor agents to collect instances of entities from trusted sources • Automatically create relationships between instances using type definitions
CACS Information exchange for metadata creation Equity Knowledge Base Definition Structured content Company Industry Ticker Sector Located at Belongs to Topic: Company News Metadata extracted: Company: Convera Ticker: CNVR Exchange: NASDAQ Exchange Executives CEO of Belongs to Headquarters Industry: Content Management Sector: Computer Software Headquarters: Vienna, VA Executives: Ronald Whittier Represented by Trades on Unstructured content Voquette Categorization and Auto-Cataloging System (CACS) What is it? CACS: Module that categorizes content and automatically creates metadata of content How does it work? Uses a hybrid of statistical, machine learning and knowledge-base techniques • Features • Core competency – Not only categorizes, but also catalogs (extracts metadata) • Unique solution for semantic metadata extraction from unstructured content • Flexibly adaptable for diverse domains
Search Personalization Content Enhancement Technology Semantic Engine User query submitted Directory Alerts/Notifications Syndication Highly relevant Content returned Dashboard End Users Custom Apps. Voquette Semantic Engine Semantic Engine What is it? Semantic Engine: Fast main memory-based front end query engine that enables the end-user to retrieve highly relevant and personalized content via custom APIs • Features and Functionality • Minimal input from user – system intelligent enough to provide only relevant content to user • Deep levels of personalization • Applications: Search, personalization, alerts, notifications, directory, routing, syndication • Custom applications: Research Dashboard (demo)
Automatic 3rd party content integration Focused relevant content organized by topic (semantic categorization) Related relevant content not explicitly asked for (semantic associations) Automatic Content Aggregation from multiple content providers and feeds Competitive research inferred automatically Semantic Application Example – Research Dashboard
Value-added Voquette Semantic Tagging COMTEX Tagging Content ‘Enhancement’ Rich Semantic Metatagging Limited tagging (mostly syntactic) • Value-added • relevant metatags • added by Voquette • to existing • COMTEX tags: • Private companies • Type of company • Industry affiliation • Sector • Exchange • Company Execs • Competitors COMTEX Content Enhancement - Value-added metatagging
Voquette Knowledge Base Company name: Merrill Lynch & Co. COMTEX Content Enhancement - Tag Normalization Source A Document with normalized tag Source A Document <company_name=Merrill Lynch, Inc.> <company_name= Merrill Lynch & Co.> <company_name= Merrill Lynch & Co.> Source B Document <company_name=Merrill Lynch Corp.> Source B Document with normalized tag
Classification & Extraction Technology Comparisons (Contd.)
Activity Traditional Effort CET Effort Comments Categorization of Web pages 50 pages/day/editor 1,000 pages/day (with human supervision)[at least an order of magnitude higher without supervision] Much higher quality metadata generation, in addition to higher quantity Metatagging of news feeds ·10-20 feeds (syntactic + semantic metadata) ·100 feeds (syntactic metadata) 5,000-10,000 feeds/day (fully automatic) No human supervision needed Metatagging of internal/enterprise research content 50-100 assets/day/research editor 500-1,000 assets (with human supervision) Human supervision supports higher quality metadata Metatagging of content from multiple internal or external sources Content editors using internally developed tools typically manage 1 to 5 sources Single person can supervise automatic tagging of content from 20-50 sources ROI Comparative Effort Chart
Deployment System Architecture Toolkits (Workstation) Enterprise S/W (Server) Categorization and Auto Cataloging System Knowledge Base Toolkit Extractor Toolkit Semantic Engine Linux/Solaris NT(any system supporting JVM) WorldModel™ Knowledge Base More Developers More Sources . . . Higher Performance, Redundancy, More content
Measures • Quality • Categorization accuracy: Around 90 % (domain and training dependent) • Metadata extraction: limited only by WorldModel™ and KB(for which we have automated maintenance support) • Relevance: near 100% (unlike IR techniques, typical precision/recall limitation do not apply when we havemetadata) • Scalability • Millions of documents per server (for Semantic Engine) • Unlimited number of documents due to distributed index seamlessly spanning multiple servers • Few to hundreds of content sources (distributed SW agents)
Measures (Continued) • Performance • Inclusion of new content source: 2 to 8 hrs • Building WorldModel™ and Knowledge Base: 2 to 8 weeks per domain for an effort leading to useful results (approx. 1 million entities and relationships) • Extraction – several documents per second (processing time) • Near real-time search/personalization of new content and breaking news (sub-minute, due to incremental indexing) • 1 million queries per hour per server, or 1 to 10s of ms query response/inference time due to main-memory indexing/data structures • Robustness • Semantic Engine has not needed rebooted for over 400 days! • Many other engineering solutions (HW/SW redundancy) to meet any SLA
Voquette vs. The Rest Voquette vs. The Rest Pages Read , Classified, Metadata extracted, Normalized & Enhanced Pages Read and Classified Voquette Average Human Voquette Average Human 600 - 10,000 (batch mode) 36,000 – 600,000 864,000 – 14.5 Million 315 Million – 5.2 Trillion Per Minute Per Hour Per Day Per Year 1 60 480 120,000 Per Minute Per Hour Per Day Per Year 1 60 480 120,000 30 1,800 43,200 16 Million Quantitative Measures Reading , Classification, Metadata Extraction, Normalization, Enhancement Reading and Classification
Voquette Specifications Semantic Engine & Knowledge Base Specs Voquette Queries per hour per server Query Response Time (Lightly loaded server) Query Response Time (Heavily loaded server) Semantic associations created per hour Semantic Associations per domain 1 Million 1 to 10 ms 100 to 200 ms 10,000 Over 1 million Quantitative Comparison (Continued)