eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003

eTouch Systems Presents NASA Portal Search Implementation September 17th 2003

Agenda • NASA Search Architecture • Indexing Statistics • Content Discovery • Relevance • Metadata • Robots Exclusion Standards • Browsable Categories • Recommendations

NASA Search Architecture • Global Traffic Manager (GTM) will load balance cross data center traffic to the Verity Search Boxes.

Content Discovery Cycle

Content Discovery Statistics • Started off with 2600 domains (provided by NASA). • Around 1250 *.nasa.gov domains found inaccessible • Exclusion criteria discovered during manual cleansing • 70 domain level exclusion (provided by NASA) e.g. http://images.ksc.nasa.gov/ • ~2500 specific URL / Directory level exclusions e.g. http://heasarc.gsfc.nasa.gov/listserv* is a mailing list. • File type exclusions e.g. *.old, *.map, *.spec, *.mod, *.log etc. • 50K – 60K documents excluded because of Robots restriction • ~80 duplicate domains e.g.aerospace.arc.nasa.gov& aeronautics.arc.nasa.gov etc. • Exclusion for text only version of documents as duplicate content • Exclusion of documents in binary format • Sample binary format document • 67K documents are of size greater than 1 MB or of size 0 byte.Out of which 11K documents fall within included mime-types (i.e. pdfs/htmls/docs/xls/ppts). • Sample 0 byte document • Sample 1 MB document

Indexing Status • Over 2 million documents crawled. • Total 420K documents indexed. • 1600 *.nasa.gov domains are indexed. • Included mime-types • Dynamic content like *.jsp, *.cfm, *.php are also indexed.

Relevance • Relevance Logic are set of business rules that determines the order/rank of documents in the search results • NASA Search Engine generates VQL (Verity Query Language) Query based on search query to produce relevant search results. • Automatically a score is assigned to each retrieved document based on its relevancy to the search query. • Relevancy depends on the presence of search query in document’s various metadata fields or in the content itself. • Some of the most common words (STOP Words) like is,what,there etc. are removed from search query. These words don’t contribute much to relevancy.

Relevance Logic in detail… • Based on numerous discussions and reviews of NASA’s Content we came up with an efficient and optimum relevance logic. • Relevance Factors in units. Search query present in title as a phrase carries maximum weight. Dc is an acronym for Dublin Core interoperable metadata standards

Metadata • Metadata is a critical component of data which describes its content,quality,condition and other characteristics. e.g. <META NAME="dc.subject" CONTENT="news, events"> • Metadata value can be an appropriate free text or it can selected from controlled vocabulary. e.g. <META NAME=“dc.description" CONTENT=“A Remote-sensing ..”> • Metadata fields used in Simple and Advanced Search “Having these metadata fields in content is very important to achieve closest affinity to NASA Search Relevance Logic.”

NASA Standard Metadata Fields for Search Aliases are alternative names for metadata fields. e.g. On a site description field is defined as <META NAME="dc.description" CONTENT=“…"> and on other site it is <META NAME="description" CONTENT=“….">

Metadata Continued… • Metadata influences relevancy of documents. • For HTML documents, proper image alt text and anchor text enhances Advanced Search capabilities. • Suitable metadata is equally important for PDFs, Microsoft Word Documents, Excel Spreadsheets etc.

Metadata Examples and Guidelines - Recommended Earth Observatory Site can be considered as an example of good quality metadata. http://earthobservatory.nasa.gov/Study/islscp/

Metadata Examples and Guidelines for PDF Documents - Recommended • Search for 2003 strategic plan on search.nasa.gov will return http://www.nasa.gov/pdf/1968main_strategi.pdfon top with 99% relevance. • This document has 2003 strategic plan in its title, subject and as a phrase in content. Properly populated metadata resulted most relevant document on top. http://www.nasa.gov/pdf/1968main_strategi.pdf

Metadata Examples and Guidelines for MSWord Documents - Recommended Suitable title, subject and keywords should be populated for Microsoft Word and Excel Documents. http://science.ksc.nasa.gov/projects/astwg/vfunct07.doc

Metadata Examples - Not Recommended On km.nasa.gov, many documents are having same value for meta data field description. Metadata should be pertinent to content. It improves the efficiency of searching, making it much easier to find something specific and relevant http://km.nasa.gov/

Metadata Examples - Not Recommended On quest.nasa.gov, many documents are having same value for meta data fields description and keywords. http://quest.nasa.gov/women/archive/12-07-99aldas.html

Metadata Examples - Not Recommended Inappropriate population of metadata would negatively affect relevance logic. http://amesnews.arc.nasa.gov/releases/2003/03_24AR.html

Metadata Population Tool - Machine Aided Indexing Metadata can be easily generated using NASA Thesaurus Machine Aided Indexing (MAI) . http://mai.larc.nasa.gov Generate metadata http://www.nasa.gov/vision/earth/environment/HURRICANE_RECIPE.html

Metadata Generation using Machine Aided Indexing NASA Thesaurus Paste content here. Generated keywords Selected keywords as a part of metadata http://mai.larc.nasa.gov/

Robots Exclusion Standards • The Robots Exclusion Protocol A Web site administrator can indicate which parts of the site should not be visited by a spider/robot, by providing a specially formatted file, robots.txt in document root of their site. e.g.User-agent: * Disallow: / This file will not allow any spider/robot to crawl the site. • The Robots META tag This allows HTML authors to indicate to visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required. e.g. <META NAME=“robots" CONTENT=“noindex,follow"> A robot should not index this document but should analyze it for links.

Usage of Robots Exclusion Standards • Provide appropriate robots.txt to allow NASA Spider to crawl desired content. e.g.User-agent: nasak2spider Disallow: User-agent: * Disallow: / This file will allow nasak2spider which is the name of NASA Spider to crawl the site and will deny access to all other robots. User-agent: nasak2spider Disallow:/cgi-bin/ User-agent: * Disallow: / This file will allow nasak2spider to crawl the site except /cgi-bin/ and will deny access to all other robots.

Usage of Robots Exclusion Standards • Put suitable robot meta tags to direct visiting robot to index or follow the document. e.g.<META NAME="robots" CONTENT="index,follow"> < META NAME="robots" CONTENT="noindex,follow"> < META NAME="robots" CONTENT="index,nofollow"> < META NAME="robots" CONTENT="noindex,nofollow"> • Content discovery and cleansing time can be reduced a lot by using Robots Exclusion Standards efficiently.

Tips for Frames using Robot Tags • Generally frames can be divided into three parts. • e.g.On http://www.sti.nasa.gov/ Parent Frame Navigation Frame Content Frame http://www.sti.nasa.gov/

Tips for Frames using Robot Tags • Put appropriate meta data in parent frame html. • As navigation frame doesn’t add any value to content, add robots meta tags directive so as not to index it but follow links in the frame. e.g.< META NAME="robots" CONTENT="noindex,follow"> • Content frame contains desired information, hence robots meta tags directive to index the frame and follow links in it should be added. e.g.< META NAME="robots" CONTENT="index,follow">

Browsable Categories • Defining Browsable Taxonomy is an iterative process which evolves by adding new categories and defining new business rules • Taxonomy and Business Rule Workflow

Recommendations • Populate Appropriate Metadata • Title, Keywords and Description • Meta-tagging should be relevant to the document and the expected search terms • Today - Less than 10% of total documents have the basic metadata associated with it. • Use suggested metadata aliases (if any). • Follow NASA Descriptive Taxonomy and Metadata Guidelines which are available on NASA Support site. • http://portalpub.jpl.nasa.gov:8080/project_doc/IA_and_Taxonomy/NASA_Descriptive_Taxonomy_Spreadsheet_v8.2_04.02.03.xls • Use robots.txt or robots meta tags to direct NASA Spider. • Improve Browsable Categories and Business Rules by contributing your feedback.

eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003

eTouch Systems Presents NASA Portal Search Implementation September 17 th 2003

Presentation Transcript

NASA/BAE SYSTEMS SpaceWire Efforts MAPLD 2003 September 9-11, 2003

September 17 th Friday

Monday 17 th September 2012

Monday 17 th September 2012

September 17 th - Articles of Confederation

September 17 th 2012

Tuesday, September 17 th mardi le 17 september 2013

September 17 th 2013

Tuesday 17 th September, 2013

September 17 th , 2003 Prof. John Kubiatowicz

GridPP Meeting 17 th September 2002

PHSX 114, Wednesday, September 17, 2003

Oslo, September 13 th to 17 th 2010

Tuesday, September 17 th

SAISD Principal’s Meeting September 17, 2003

CSE 2353 – September 8 th 2003

Monday, September 17 th

September 17 th , 2010

Constitution Day September 17 th

MyFloridaMarketPlace Roundtable September 17, 2003

Friday, September 17 th

September 17 th , 2003 Prof. John Kubiatowicz