1 / 32

Video Google: A Text Retrieval Approach to Object Matching in Videos

Video Google: A Text Retrieval Approach to Object Matching in Videos. Josef Sivic and Andrew Zisserman Robotics Research Group, Department of Engineering Science University of Oxford, United Kingdom. Goal. To retrieve those key frames and shots of a video containing a particular object .

loe
Download Presentation

Video Google: A Text Retrieval Approach to Object Matching in Videos

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew ZissermanRobotics Research Group, Department of Engineering ScienceUniversity of Oxford, United Kingdom

  2. Goal • To retrieve those key frames and shots of a video containing a particular object . • With the ease, speed and accuracy.

  3. Outline • Introduction • Object query • Scene query • Challenging problem • Text retrieval overview • Viewpoint invariant description • Building the Descriptors • Building the Visual Word • The Visual Analogy • Visual indexing using text retrievalmethods • Experimental evaluation of scenematching using visual words • Object retrieval • Stop list • Spatial Consistency • Summary and conclusions • Video Google Demo

  4. Introduction - Object query(1/2)

  5. Introduction - Scene query(2/2)

  6. Challenging problem(1/2) • Changes in viewpoint, illumination and partial occlusion • Large data • Real-world data

  7. Challenging problem(2/2)

  8. Text retrieval overview (1/2) • The documents are parsed into words. • Words are represented by their stems • ‘walk’, ‘walking’, ‘walks’ -> ‘walk’ • Stop list to filter common words( ‘the’, ‘an’,…) • Remaining words represent as a vector weighted based on word frequency

  9. Text retrieval overview (2/2) • Inverted file to facilitate efficient retrieval. • An inverted file is structured like an ideal book index. • Text is retrieved by computing its vector of word frequencies, return documents with the closest vectors • Rank the returned documents

  10. Viewpoint invariant description(1/2) • Two types of viewpoint covariant regions are computed for each frame. • SA – Shape Adapted • corner like features • MS – Maximally Stable • blobs of high contrast with respect to their surroundings • Regions computed in grayscale

  11. Viewpoint invariant description(2/2) The MS regions are in yellow. The SA regions are in cyan.

  12. Building the Descriptors(1/2) • SIFT– Scale Invariant Feature Transform • Each elliptical region is represented by a 128-dimensional vector • SIFT is invariant to a shift of a few pixels

  13. Building the Descriptors(2/2) • Removing noise – tracking & averaging • Regions are tracked across sequence of frames using “Constant Velocity Dynamical model” • Any region which does not survive for more than three frames is rejected • Averaging the descriptors throughout the track • Large covariance’s descriptors are rejected

  14. Building the Visual Word(1/2) • Cluster descriptors into K groups using K-mean clustering algorithm • Each cluster represent a “visual word”in the “visual vocabulary” • MS and SA regions are clustered separately • different vocabularies for describing the same scene.

  15. Building the Visual Word(2/2) SA MS

  16. The Visual Analogy Text Visual

  17. Visual indexing using text retrievalmethods(1/2) • tf-idf - ‘Term Frequency – Inverse Document Frequency’ • A vocabulary of k words, then each document is represented by a k-vector

  18. Visual indexing using text retrievalmethods(2/2) • The query vector is given by the visual words contained in a user specified sub-part of a frame • And the other frames are ranked according to the similarity of their weighted vectors to this query vector.

  19. Experimental evaluation of scenematching using visual words(1/5) • Goal • Evaluate the method by matching scene locations within a closed world of shots (‘ground truth set’) • Ground truth set • 164 frames, from 48 shots, were taken at 19 3D location in the movie ‘Run Lola Run’ (4-9 frames from each location) • There are significant view point changes in the frames for the same location

  20. Experimental evaluation of scenematching using visual words(2/5)

  21. Experimental evaluation of scenematching using visual words(3/5) • The entire frame is used as a query region • The performance is measured over all 164 frames • The correct results were determined by hand • Rank calculation

  22. Experimental evaluation of scenematching using visual words(4/5)

  23. Experimental evaluation of scenematching using visual words(5/5)

  24. Object retrieval(1/7) • Goal • Searching for objects throughout the entire movie • The object of interest is specified by the user as a sub part of any frame

  25. Object retrieval – Stop list(2/7) • To reduce the number of mismatches and size of the inverted file while keeping sufficient visual vocabulary.

  26. Object retrieval – Spatial Consistency(3/7) • Querying objects by a subpart of the image, where matched covariant regions in the retrieved frames should have a similar spatial arrangementto those of the outlined region in the query image.

  27. Object retrieval(4/7)

  28. Object retrieval(5/7)

  29. Object retrieval(6/7)

  30. Object retrieval(7/7)

  31. Summary and conclusions • Visual Word and vocabulary analogy • Immediate run-time object retrieval • Future work • Automatic ways for building the vocabulary are needed • Intriguing possibility • latent semantic indexing to find content • automatic clustering to find the principal objects that occur throughout the movie.

  32. Video Google Demo • http://www.robots.ox.ac.uk/~vgg/research/vgoogle/

More Related