310 likes | 422 Views
ATIA 6.2. Preliminary Design Review 1 July 2010. United States Army Combined Arms Center. Requirements Function Points Natural Language Processor Relevancy Algorithm Search Personalization Document Population Search Web Service Deployment / Maintenance Process Milestones Q&A. Agenda.
E N D
ATIA 6.2 Preliminary Design Review1 July 2010 United States Army Combined Arms Center
Requirements Function Points Natural Language Processor Relevancy Algorithm Search Personalization Document Population Search Web Service Deployment / Maintenance Process Milestones Q&A Agenda
Requirements This is an extension of the existing ATIA requirements Current Requirements documentation is maintained under CM in project forge
ATIA Service Oriented Architecture ATIA Admin GUI ATIA Catalog GUI MCRM (Doc Mgr) GUI RITMS DTMS PDM TDC ASIS ATIA Web Services/SOA “Cloud” Security Web Service LMS Web Service Legend Search Web Service Repository Web Service Mint Web Service Publication Web Service Register Web Service RITM/IE ATIA 6.2 Performance Web Service Complete ProductList Web Services Security Services – SSO and B2B CATS Web Service Profile Web Service Logging Web Service SIS Web Service Content Web Service Generate Web Service TDC Related Data Source BlackBoard LMS Atlas Pro LMS ALMS Interface Engine ATRRS Magnolia Content Mgmt Sys Jackrabbit Content Mgmt Sys Other Sys Future Dev Semantic Triple Store Oracle DB ATIA ILMS RECBASS ARISS
ATIA 6.2 - Integrated Index and Search ATIA 6.2 will provide a replacement for Generate-WS and a search algorithm module These changes provide several added features over ATIA 6.1 Centralized datastore for registration and indexing Consistent search and relevance Control over semantic term space Customization of search/relevance algorithms Government will own Generate-WS source code
ATIA 6.1 – Eduworks ACE • Eduworks ACE performs • Extract meta-data (Generate-WS) • Star-tree relevancy (SearchEdu-WS) • Relevancy between search terms and documents (SearchEdu-WS) • Limitations • Unable to add/subtract key word/phrases • I/O intensive requests for relevancy • Duplication of data across 2 data stores • Relevancy inconsistencies between triple store and Eduworks • Proprietary and reaching end-of-life
ATIA 6.2 – Key Tasks Building a new search algorithm module Catalog Search Calculate relevancy between search terms and documents inside catalog using the new search algorithm module Generate-WS Support text extraction of common file formats Utilize Natural Language Processor Algorithm to identify most relevant terms from extracted data Store metadata in triplestore Pre-computation of values used in relevancy algorithm Asynchronous call from Publication-WS and Generate-WS Client
Generate-WS Architecture Generate-WS Client (Optional Asynchronous) Phased implementation – Phase 1: Wrap eduWorks Generate-WS with Our Implementation Asynchronous Call To Generate-WS from Publication-WS directly posting messages to JMS Queue. JMS Queue Generate-WS Publication-WS (Asynchronous Call) Triple Store Generate-WS (eduWorks) Content-WS Register-WS
Generate-WS Architecture Generate-WS Client (Optional Asynchronous) Phased implementation – Phase 2: Eliminate eduWorks Provide our own relevancy and key words keeping interface the same. JMS Queue Generate-WS (New Implementation) Publication-WS (Asynchronous Call) Triple Store Content-WS Register-WS
Natural Language Processing Framework ATIA 6.2 will implement a text processing framework with the following key features Ability to integrate new natural language processors e.g. processors for processing non-English documents Flexibility to process new file formats besides common file formats if desired Multithreaded processing pipelines
Selection of Natural Language Processors ATIA 6.2 will use OpenNLP as a open source library for text data extraction Java & Open Source Flexibility to modify for our needs Easy-to-use Java API Decent size user base High accuracy on sentence segmentation Ability to train with customized models Less effort to conduct training
Increasing Relevancy Increasing search relevancy will require a new implementation of the search algorithm The relevancy algorithm will be used by the catalog search and Generate-WS to give consistent results Relevance will be calculated inside the catalog by communicating directly with the AllegroGraph triplestore. The relevance algorithm will apply cosine-similarity methods to our RDF ontology New generate-WS will integrate directly with the relevance algorithm to use the AllegroGraph triplestore as a backend data store.
Doc A (More “Ranger” Emphasis Than Doc B) Query Q “ranger” Doc B “handbook” Cosine Similarity • User enters “ranger handbook” into the search box and the search returns documents A and B. • The documents A and B and the query Q are plotted as vectors in the semantic space. • Term “ranger” has more weight in the query because “handbook” is so common in the catalog shown by the query vector which is more than 45°. • Document A and B both emphasize ranger but Document A has a higher relative emphasis on ranger than Document B • Relevance to the query produces different angles for each document. • In cosine similarity, a smaller angle between a document and the query indicates higher relevance. (Computation is performed for every search result (docs C, D, E, F, ……) to sort all by relevance.)
Increasing Relevancy with User Profile Research Increasing User Relevancy Utilize CAC, MOS, Job Series Log Search & utilize ‘hit’ counts Integrate Closely with new Relevancy Algorithm Index PDM to create job series ‘document’ Based on user AKO supplied MOS/AOC/job series Include this ‘document’ in the cosine similarity calculations Doc A (More “Ranger” Emphasis Than Doc B) Query Q “ranger” Doc B “handbook”
User Feedback • How can we add tagging by users? Rate me 1-5 stars, higher<->lower, ... to increase relevancy • Need this slide expanded
Document Manager Improvements • Provide Form for uploading multiple related documents • Album style upload • Required Entry Title • User can set matching metadata across all documents • User can set individual metadata for each document • Provide form for editing multiple related documents • User can set matching metadata across all documents
Data Collection Improvements • Spidering Issues • Password Protected Repositories • AKO • SharePoint • Depth
Embed RDF in Results • Resource Description Framework (RDF) is a standard model for data interchange on the Web • RDF metadata provides a scalable way to present catalog item data • Catalog HTML pages should contain RDF metadata • Catalog XML data lists should be provided in RDF format • Allows other triple stores to interpret our data
Search WS • Allows Developers to add the Catalog Search to their website • Provides access to catalog search results in various formats • Provides search customization to limit search results • Will enable the creation of multiple catalog gadgets • News feed style gadget that displays New or Obsolete Documents • Popular Documents • Documents that could be useful to the user Rich Site Summary • Provides RSS feed of recent catalog activity
Deployment Process High Availability (PLC) Backup (will CommVault work with AllegroGraph) Clean Stop/Stop Re-aiming links Fix links/data (rollback) • ATIA 6.2 is an extension of the ATIA 6.1 clusters Maintenance Process
Replace Glassfish • Sun fees on Glassfish usage
Weighting Term Relevance Relevancy weight, w, between a document, d, and a term, t, will be determined with tf, term frequency, is a function of the number of occurrences of the phrase in the document idf, inverse document frequency, is a function of the number of documents that the term appears in. idf is used to reduce the relevance weight of terms which occur across many documents Note that the term, t, may not be the same as the search input query. Search queries can be long and are treated as documents themselves with a weight calculated for terms appearing within them. Cosine Similarity is the formula for determining relevance between two documents
Cosine Similarity of Documents • Cosine Similarity is the formula for determining relevance between two documents, A and B. • By treating queries as documents this formula is used to determine relevance between • two documents • document and search query • search query and related terms
Cosine Similarity Example • Cosine similarity can be demonstrated on a two-dimensional chart when there are two search terms • In this case, the query Q = “ranger handbook” and documents A and B are relevant. • The weight of “handbook” is plotted on the horizontal axis and the weight of “ranger” is on the vertical • Although “ranger” and “handbook” each appear in Q exactly once, the terms may not have equal weighting on the query because of the idf • The cosine similarity is a function of the angle between a document and the query Q • Although the endpoints of vectors of Q and B are closer than the endpoints of vectors of Q and A, the cosine similarity of Q and A is stronger • This is because Q has a stronger emphasis on “ranger” and so does document A A “ranger” Q B “handbook”
Applying Cosine Similarity to RDF Ontologies The cosine similarity measure is the common method for determining a text based relevance In our RDF triplestore we will use cosine similarity to determine relevance between RDF resources This is accomplished by using definitions of tf and idf based on the RDF predicates that match our search query
Performance considerations • The performance of this relevance algorithm applied to a triplestore is hindered by the lack of scalar functions in SPARQL. • The mitigation is • Pre-computation of partial values • Refactoring of the algorithm code to reduce roundtrips to the triplestore • Pre-computation during registration will require additional code in the register-WS
Current Document Population Methods • Document Manager • One document at a time • Batch • Spreadsheet • Ziptool