1 / 26

eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003

eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003 Agenda NASA Search Architecture Indexing Statistics Content Discovery Relevance Metadata Robots Exclusion Standards Browsable Categories Recommendations NASA Search Architecture

lotus
Download Presentation

eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. eTouch Systems Presents NASA Portal Search Implementation September 17th 2003

  2. Agenda • NASA Search Architecture • Indexing Statistics • Content Discovery • Relevance • Metadata • Robots Exclusion Standards • Browsable Categories • Recommendations

  3. NASA Search Architecture • Global Traffic Manager (GTM) will load balance cross data center traffic to the Verity Search Boxes.

  4. Content Discovery Cycle

  5. Content Discovery Statistics • Started off with 2600 domains (provided by NASA). • Around 1250 *.nasa.gov domains found inaccessible • Exclusion criteria discovered during manual cleansing • 70 domain level exclusion (provided by NASA) e.g. http://images.ksc.nasa.gov/ • ~2500 specific URL / Directory level exclusions e.g. http://heasarc.gsfc.nasa.gov/listserv* is a mailing list. • File type exclusions e.g. *.old, *.map, *.spec, *.mod, *.log etc. • 50K – 60K documents excluded because of Robots restriction • ~80 duplicate domains e.g.aerospace.arc.nasa.gov& aeronautics.arc.nasa.gov etc. • Exclusion for text only version of documents as duplicate content • Exclusion of documents in binary format • Sample binary format document • 67K documents are of size greater than 1 MB or of size 0 byte.Out of which 11K documents fall within included mime-types (i.e. pdfs/htmls/docs/xls/ppts). • Sample 0 byte document • Sample 1 MB document

  6. Indexing Status • Over 2 million documents crawled. • Total 420K documents indexed. • 1600 *.nasa.gov domains are indexed. • Included mime-types • Dynamic content like *.jsp, *.cfm, *.php are also indexed.

  7. Relevance • Relevance Logic are set of business rules that determines the order/rank of documents in the search results • NASA Search Engine generates VQL (Verity Query Language) Query based on search query to produce relevant search results. • Automatically a score is assigned to each retrieved document based on its relevancy to the search query. • Relevancy depends on the presence of search query in document’s various metadata fields or in the content itself. • Some of the most common words (STOP Words) like is,what,there etc. are removed from search query. These words don’t contribute much to relevancy.

  8. Relevance Logic in detail… • Based on numerous discussions and reviews of NASA’s Content we came up with an efficient and optimum relevance logic. • Relevance Factors in units. Search query present in title as a phrase carries maximum weight. Dc is an acronym for Dublin Core interoperable metadata standards

  9. Metadata • Metadata is a critical component of data which describes its content,quality,condition and other characteristics. e.g. <META NAME="dc.subject" CONTENT="news, events"> • Metadata value can be an appropriate free text or it can selected from controlled vocabulary. e.g. <META NAME=“dc.description" CONTENT=“A Remote-sensing ..”> • Metadata fields used in Simple and Advanced Search “Having these metadata fields in content is very important to achieve closest affinity to NASA Search Relevance Logic.”

  10. NASA Standard Metadata Fields for Search Aliases are alternative names for metadata fields. e.g. On a site description field is defined as <META NAME="dc.description" CONTENT=“…"> and on other site it is <META NAME="description" CONTENT=“….">

  11. Metadata Continued… • Metadata influences relevancy of documents. • For HTML documents, proper image alt text and anchor text enhances Advanced Search capabilities. • Suitable metadata is equally important for PDFs, Microsoft Word Documents, Excel Spreadsheets etc.

  12. Metadata Examples and Guidelines - Recommended Earth Observatory Site can be considered as an example of good quality metadata. http://earthobservatory.nasa.gov/Study/islscp/

  13. Metadata Examples and Guidelines for PDF Documents - Recommended • Search for 2003 strategic plan on search.nasa.gov will return http://www.nasa.gov/pdf/1968main_strategi.pdfon top with 99% relevance. • This document has 2003 strategic plan in its title, subject and as a phrase in content. Properly populated metadata resulted most relevant document on top. http://www.nasa.gov/pdf/1968main_strategi.pdf

  14. Metadata Examples and Guidelines for MSWord Documents - Recommended Suitable title, subject and keywords should be populated for Microsoft Word and Excel Documents. http://science.ksc.nasa.gov/projects/astwg/vfunct07.doc

  15. Metadata Examples - Not Recommended On km.nasa.gov, many documents are having same value for meta data field description. Metadata should be pertinent to content. It improves the efficiency of searching, making it much easier to find something specific and relevant http://km.nasa.gov/

  16. Metadata Examples - Not Recommended On quest.nasa.gov, many documents are having same value for meta data fields description and keywords. http://quest.nasa.gov/women/archive/12-07-99aldas.html

  17. Metadata Examples - Not Recommended Inappropriate population of metadata would negatively affect relevance logic. http://amesnews.arc.nasa.gov/releases/2003/03_24AR.html

  18. Metadata Population Tool - Machine Aided Indexing Metadata can be easily generated using NASA Thesaurus Machine Aided Indexing (MAI) . http://mai.larc.nasa.gov Generate metadata http://www.nasa.gov/vision/earth/environment/HURRICANE_RECIPE.html

  19. Metadata Generation using Machine Aided Indexing NASA Thesaurus Paste content here. Generated keywords Selected keywords as a part of metadata http://mai.larc.nasa.gov/

  20. Robots Exclusion Standards • The Robots Exclusion Protocol A Web site administrator can indicate which parts of the site should not be visited by a spider/robot, by providing a specially formatted file, robots.txt in document root of their site. e.g.User-agent: * Disallow: / This file will not allow any spider/robot to crawl the site. • The Robots META tag This allows HTML authors to indicate to visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required. e.g. <META NAME=“robots" CONTENT=“noindex,follow"> A robot should not index this document but should analyze it for links.

  21. Usage of Robots Exclusion Standards • Provide appropriate robots.txt to allow NASA Spider to crawl desired content. e.g.User-agent: nasak2spider Disallow: User-agent: * Disallow: / This file will allow nasak2spider which is the name of NASA Spider to crawl the site and will deny access to all other robots. User-agent: nasak2spider Disallow:/cgi-bin/ User-agent: * Disallow: / This file will allow nasak2spider to crawl the site except /cgi-bin/ and will deny access to all other robots.

  22. Usage of Robots Exclusion Standards • Put suitable robot meta tags to direct visiting robot to index or follow the document. e.g.<META NAME="robots" CONTENT="index,follow"> < META NAME="robots" CONTENT="noindex,follow"> < META NAME="robots" CONTENT="index,nofollow"> < META NAME="robots" CONTENT="noindex,nofollow"> • Content discovery and cleansing time can be reduced a lot by using Robots Exclusion Standards efficiently.

  23. Tips for Frames using Robot Tags • Generally frames can be divided into three parts. • e.g.On http://www.sti.nasa.gov/ Parent Frame Navigation Frame Content Frame http://www.sti.nasa.gov/

  24. Tips for Frames using Robot Tags • Put appropriate meta data in parent frame html. • As navigation frame doesn’t add any value to content, add robots meta tags directive so as not to index it but follow links in the frame. e.g.< META NAME="robots" CONTENT="noindex,follow"> • Content frame contains desired information, hence robots meta tags directive to index the frame and follow links in it should be added. e.g.< META NAME="robots" CONTENT="index,follow">

  25. Browsable Categories • Defining Browsable Taxonomy is an iterative process which evolves by adding new categories and defining new business rules • Taxonomy and Business Rule Workflow

  26. Recommendations • Populate Appropriate Metadata • Title, Keywords and Description • Meta-tagging should be relevant to the document and the expected search terms • Today - Less than 10% of total documents have the basic metadata associated with it. • Use suggested metadata aliases (if any). • Follow NASA Descriptive Taxonomy and Metadata Guidelines which are available on NASA Support site. • http://portalpub.jpl.nasa.gov:8080/project_doc/IA_and_Taxonomy/NASA_Descriptive_Taxonomy_Spreadsheet_v8.2_04.02.03.xls • Use robots.txt or robots meta tags to direct NASA Spider. • Improve Browsable Categories and Business Rules by contributing your feedback.

More Related