210 likes | 480 Views
Recognizing Indoor Scenes. Ariadna Quattoni Antonio Torralba CSAIL, MIT CSAIL, MIT UC Berkeley EECS & ICSI 32 Vassar St., Cambridge, MA 02139 ariadna@csail.mit.edu torralba@csail.mit.edu. OUTLINE. Introduction Indoor Database Scene Prototypes and ROIs (Region of Interest)
E N D
Recognizing Indoor Scenes AriadnaQuattoni Antonio TorralbaCSAIL, MIT CSAIL, MIT UC Berkeley EECS & ICSI 32 Vassar St., Cambridge, MA 02139 ariadna@csail.mit.edutorralba@csail.mit.edu
OUTLINE • Introduction • Indoor Database • Scene Prototypes and ROIs (Region of Interest) • Model & Learning • Experiments & Result • Conclusion
Introduction • Indoor scene recognition is a challenging open problem in high level vision • Due to the variation of indoor scene • Hard to characterized • Ex : some of characterized by global spatial properties (e.g. corridors) • Others characterized by the object they contain (e.g. bookstore) • In this paper , • Propose a prototype based model that can successfully combine both sources of information
Indoor vs. Outdoor Indoorcategories • (RAW:26.5%,Gist:62.9%,Sift:61.9%) Outdoorcategories • (RAW:32.6%,Gist:78.1%,Sift:79.1%) Apply to a dataset of 15 scene categories [9,3,7]
Indoor scene • Why slow progress in this area? • The lack of a large testbed of indoor scenes in which train and test different approaches • Create a new dataset for indoor scene (consists of 67 scenes) • In order to improve indoor scene recognition performance , need to develop image representations specifically tailored for this task • Variation of indoor scene (hard to judge the scene characteristic) • The work is related to work on learning distance functions [4,6,8] for visual recognition
Indoor Database • Image resource: • Online image search tools • Google and Altavista • Online photo sharing sites • Flickr • LabelMe dataset • The database contain 15620 images and minimum resolution of 200x200 pixels
Image Database 12+14 + 14 +11 +15 = 66 + Closet = 67
Prototype & ROI • Each prototype Tk will be composed of mk ROIs • To define a set of candidate ROIs for a given prototype • Asked a human annotator to segment the objects contained in it • Segmented prototype images in a total 2000 manually segmented images • Selected 10 for each prototype that occupy at least 1% of the image size • Produce candidate ROIs from a segmentation obtained using graph-cuts [13]
Image descriptors • Represent prototype image • Use the Gist descriptor (code available online[9]) • results in a vector of 384 dimensions • Comparison between two Gist descriptors is computed using Euclidean distance • Represent ROI • Use pyramid of visual words [14] • Create vector quantized Sift descriptors by applying K-means to a random subset of image (following [7], we use 200 clusters) • Each ROI decomposed into a 2x2 grid and histograms of visual word are computed for each window [7,1,12] • Distances between two regions are computed using histogram intersection as in [7]
ROI Search and Matches Histograms of visual words can be computed efficiently using integral images We assume that if two images are similar their respective ROIs will be toughly aligned (i.e in similar spatial locations)
Model Formulation • Goal: learn a mapping from images x to scene labels y. • For simplicity: • Each is binary label indicating • For model multiclass case: • Training one-versus-all classifier for each scene For each prototypeTk, define a set of features functions: Include a global feature gk(x)which is computed as the L2 norm between the Gist representation of image x and the Gist representation of prototype k Combine all these feature to define a global mapping:
Model Formulation • β:how relevant the similarity to a prototype k • λ:captures the importance of a particular ROI inside a given prototype Define the standard regularized classification objective: Hinge loss function:
Learning • L(β ,λ) from a training set D • Use a gradient-based method for each optimization step • The subgradient with respect to β : • The subgradient with respect to λ :
Experiments • All these method are trained with 331 prototypes • Also trained a one versus all classifier for category d and combine their score into a single prediction, taking the label with Max confidence score • Sample n positive examples and 3n negative examples
Results 67 indoor scene sorted by multiclass average precision (training with 80 images per class and test is done on 20 images per class)
Conclusion • Attract attention to the CV community working on scene recognition for which current algo. seem to perform poorly • Propose a prototype based model that can successfully combine both source (local & global) of information • However, the performances presented in this paper are close to the performance of the first attempts on Caltech 101 [2]