1 / 62

Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

Using HTML Metadata to Retrieve Relevant Images from the World Wide Web. Ethan V. Munson University of Wisconsin-Milwaukee. Why is image search important?. The Web is becoming the world’s primary information source Images are one of the Web’s key features

harvey
Download Presentation

Using HTML Metadata to Retrieve Relevant Images from the World Wide Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using HTML Metadata to Retrieve Relevant Images from the World Wide Web Ethan V. Munson University of Wisconsin-Milwaukee

  2. Why is image search important? • The Web is becoming the world’s primary information source • Images are one of the Web’s key features • Few WWW image search engines exist currently • Using textual search engines to find images manually is laborious

  3. A Requirement for Web Image Search • We need an efficient method of discovering and indexing image content. • Two main sources of information about image content: • image processing • associated text • text content • markup

  4. Related work • QBIC (the IBM Almaden Research Center) • indexes and retrieves images according to: • shape • color • texture • object layout • queries are formulated through visual examples • a sample image • user provided sketches

  5. Related work QBIC system

  6. Related work QBIC system

  7. Related work QBIC system

  8. QBIC: Advantages and Disadvantages • Advantages • well-developed visual query language • interesting GUI • queries are based on image appearance • Disadvantages • works only at the primitive feature level (color, texture, shape) • doesn’t recognize semantics of image • very sensitive to camera viewpoint • doesn’t scale up to the Web

  9. Related work • WebSeek(J. Smith & S. Chang, Columbia University) • performs a semi-automated classification of the images • automatically extracts keywords from image file names • computes the keyword histogram • manually creates a subject hierarchy • manually maps the images into the subject hierarchy • User can • browse the categories • search the categories by keyword • search the database using image features • color content

  10. Webseek: Advantages/Disadvantages • Advantages • Large index of Web images • Supports both text and image search • Disadvantages • Not clear that database can scale up • Manual categorization is very expensive • Relevance feedback mechanism is computationally expensive

  11. Related work • WebSeer(M. Swain et al., The University of Chicago) • uses associated text and markup to supplement information derived from analyzing image content • uses multiple kinds of metadata • image file names • alternate text • text of a hyperlink • decides which images are photographs, portraits, or computer generated drawing • research emphasized categorization, not metadata-based search

  12. Why seek new image retrieval methods? • The number of WWW documents is growing rapidly and constantly changing • We need fast and efficient methods for finding images • Image processing is • complex • computationally expensive • limited (misses true image semantics) • unnecessary

  13. Research Goals • Show that images can be found using HTML “metadata” • textual content • HTML tag structure • attribute values • Determine which metadata features are the best clues to image content

  14. The URL Filter • assembles a list of URLs from the results returned by Alta Vista • parses the first page returned by Alta Vista • follows the URLs of results pages, retrieves these pages, and parses them • extracts list of URLs from the results pages

  15. The Crawler • retrieves the pages • saves each page’s HTML source code in a separate file

  16. “Tidy” • converts arbitrary and probably ill-formed HTML into XHTML

  17. XHTML Parser • parses an XHTML document • builds an XHTML parse tree

  18. The Document Analyzer • scans the parse tree for image URLs • an image URL appears in either an image or anchor element • converts relative URLs into absolute URLs • uses various heuristics to determine which URLs point to relevant images

  19. Search Strategies • Image’s file name • Textual content of the TITLE element • Value of the ALT attribute of IMG elements • Textual content of anchor elements • Value of the title attribute of anchor elements • Textual content of the paragraph surrounding an image • Textual content of any paragraph located within the same center element as the image • Textual content of heading elements

  20. Image Retrieval Experiment

  21. Experimental Questions • Which HTML features reveal the most information about image? • Do particular patterns of HTML structure carry useful information? • Do image search results depend on the type of query?

More Related