From Tens to Thousands: Efficient Methods for Learning Large-Scale Video Concepts

From Tens to Thousands: Efficient Methods for Learning Large-Scale Video Concepts Rong Yan IBM T. J. Watson Research Center Hawthorne, NY 10532 USA Email: yanr@us.ibm.com

The growth in video search has potential to benefit both enterprise and consumer segments across the world Sources: eMarketer Research, Veronis Suhler StevensonResearch, AccuStream Market research 10/30/2014

Issues of Current Video Search Systems • Based on text metadata and/or manual tags, which are not available for lots of video • Unable to search inside video clips, which are typically associated with clip-level metadata Though there are numerous video-search options, none of them have yet proven to be reliable and accurate Google Video — “Basketball” • Scope of results • Does not broadly search the Web • Does not search inside video • Cannot distinguish matches showing basketball • Favors Google silo • YouTube videos prominent SearchVideo (Blinkx) — “Basketball” • Scope of results • 214,000 matches related to “basketball” • No way to limit results to relevant scenes showing basketball games • Favors own silo • Results limited to partner YouTube — “Basketball” • Scope of results • Similar results to Google Video • Favors own silo • Video quality mixed • User-generated content and user provided video SearchVideo (AOL) — “Basketball” • Scope of results • Top matches all related to Imus comments • Again limited by inability to detect basketball scenes • Favors own silo • Preference for AOL and AOL partner content 10/30/2014

Concept-based Video Search • Exciting new direction • Visual indexing with semantic concept detection • (Semi-)automatically produce frame-level indexing based on statistical learning techniques • Search by text keywordswithout text metadata (Courtesy of Lyndon Kennedy) 10/30/2014

Concept-based Video Search 10/30/2014

Thousands of video concepts are required to produce good performance for concept-based video retrieval • Need ~3000 video concepts to have similar performance with web retrieval • Extrapolate search results on 3 standard large-scale video collections • Concept detection accuracy and combination strategies are calibrated with the state-of-the-art results • Details: [Hauptmann, Yan and Lin] 10/30/2014

TREC Concepts Program Location People Objects Activities Graphics Office People Maps Crowd Court Flag-US Chart Face Meeting Walk Animal Studio Person March Screen Outdoor Events Roles Vehicle Road Weather Explosion/Fire Sky G. Leader Airplane Snow Entertain Car C. Leader Natural Disaster Urban Sports Boat Police Water Bus Mountain Military Desert Truck Prisoner Building Plants Challenges: Efficient (and Effective) Approaches to Detect Thousands of Video Concepts are yet to be Developed • Case study: TRECVID’05-’07 • 39 video concepts are defined • A baseline SVM classifier takes ~7 days to learn on 100,000 frames for 39 concepts using a 2.16GHz Dual-Core CPU • It takes ~3.5 days to generate predictions on 1 million testing frames for 39 concepts • Need 30 machines for 39 concepts if processing 100 frames per second 10/30/2014

New Approaches for a Wide Spectrum of Video Concepts Automatic: Model-shared subspace boosting Domain-IndependentConcepts Semi-automatic: Cross-domain Concept Adaptation Semi-manual: Learning-based Hybrid Manual Annotation Domain-DependentConcepts Out-of-Domain Concepts Learnability Domain Independent Concepts DomainDependentConcepts Out-of-Domain Concepts Digital item 10/30/2014

Roadmap • Motivation and Challenges: Why Efficiency? • (Automatic) Model-shared Subspace Boosting [KDD’07] • (Semi-automatic) Cross-domain Concept Adaptation [MM’07] • (Semi-manual) Learning-based Hybrid Annotation [Submitted] • Conclusions 10/30/2014

Prior Art on Automatic Concept Detection • Standard multi-label classification[City U., 07] [IBM, 07] [Tsinghua, 07] • Need to learn an independent classifier for every possible label using all the data examples and the entire feature space. • Other image annotation methods [Snoek et al., 05] [Torralba et al, 04] • No mechanisms to reduce the redundancy among labels other than making use of the multi-label relations. • Multi-task learning[Ando and Zhang, 05] [Caruana, 97] [Zhang et al., 05] • Treat each label as a single task and use them in an iterative process • Complex and inefficient inference effort to estimate the task parameters 10/30/2014

Related Work: Random Subspace Bagging Features • Improve computational efficiency by removing the redundancy in both data space and feature space • For each concept, select a number of bags of training examples, where each bag is randomly sampled from training data as well as feature space • Learn a base model on each bag of training examples using arbitrary learning algorithms • Add them into a composite classifier • A.k.a. asymmetrical bagging and random subspace, or random forest (w. decision trees) Training Examples M1 M2 Classifiers 10/30/2014

M1 M2 M1 M2 Label 2: Road Label 1: Car Missing Pieces for Random Subspace Bagging • RSBag learns classifiers for each concept separately, which thus cannot reduce information redundancy across multiple concepts • It is possible to share and re-use some base models for different concepts M2 M2 10/30/2014

Model-shared subspace boosting [with J. Tesic and J. Smith] • Model-shared subspace boosting (MSSBoost) iteratively finds the most useful subspace base models, shares them across concepts, and combines them into a composite classifier for each concept. • MSSBoost follows the formulation of LogitBoost [Friedman et al., 1998] • The base models are learned from bootstrapped data samples and selected feature space, which can be trained from any algorithms • The classifier for each concepts is an ensemble of multiple base models • The base models are shared across multiple concepts, so that the same base models can be re-used in different decision functions. 10/30/2014

MSSBoost Algorithm: Overview Labels L2 L1 L3 • Step 1 (model initialization) • initialize a number of base models for each label, where each base model is learned on a label using random subspace and data bootstrapping • Step 2 (iterative update) • Search the model pool for the optimal base model and its weight by minimizing a "joint logistic loss function" over all the concepts • Update the classifier of every concept by sharing and combining the selected model • Replace this model by a new subspace model learned on the same concept 1 1 1 2 2 3 Base Models F1 F2 F3 1 1 1 1 1 2 1 2 3 Composite Classifiers 10/30/2014

Experiments • Two large-scale image/video collections • TRECVID’05 sub-collection, including 6525 keyframes with 39 concepts • Consumer collection, including 8390 images with 33 concepts • Low-level visual features • 166-dimensional color correlogram • 81-dimensional color moments • 96-dimentional co-occurrence texture • RBF-kernel support vector machinesas base models for MSSBoost • 75%-25% training-testing split 10/30/2014

Concept Detection Performance • MSSBoost outperforms baseline SVMs using a small number of base models (100 for 39 labels) w. small data/feature sample ratio (~0.1) • MSSBoost consistently outperforms RSBag and non-sharing boosting (NSBoost) • e.g. # of models to achieve 90% baseline MAP is only 60% of that of RSBag / NSBoost Consumer Collection TRECVID Collection 10/30/2014

Concept Detection Efficiency • MSSBoost vs. baseline SVMs (with the same classification performance) • 60-fold / 170-fold speedup on training and 20-fold / 25-fold speedup on testing Training Time Testing Time 10/30/2014

Roadmap • Motivation and Challenges: Why Efficiency? • (Automatic) Model-shared Subspace Boosting [KDD’07] • Automatically exploit information redundancy across the concepts • (Semi-automatic) Cross-domain Concept Adaptation [MM’07] • (Semi-manual) Learning-based Hybrid Annotation [Submitted] • Conclusions 10/30/2014

Cross-domain Concept Detection • Adapt concept classifiers from one domain to other domains • Domains can be genres, data sources, programs, e.g., “CNN”, “CCTV” • Adapt from auxiliary dataset(s) to a target dataset • Adaptation is more critical for video (than text) • Biggersemantic gap, e.g., “tennis” • More sensitive to domain change • e.g., average precision of “anchor” drops from 0.9 on TREC’04 to 0.5 on TREC’05 10/30/2014

Prior Art on Cross-Domain Detection • Data-level adaptation[Wu et al., 04] [Liao et al., 05] [Dai et al., 07] • Combine auxiliary and target data for training a new classifier • Computational expensive due to the large number of training data • Parametric-level adaptation [Marx et al., 05] [Raina et al., 06] [Zhang et al., 06] • Use the model parameters of auxiliary data as prior distribution • Model must be parametric and be of the same type • Incremental Learning [Syed et al., 99] [Cauwenberghs and Poggio, 00] • Continuously update models with subsets of data • Assume the same underlying distributions without any domain changes • Sample bias correction, concept drift, speaker adaptation... 10/30/2014

Function-level Adaptation [with J. Yang and A. Hauptmann] • Function-level adaptation: modifies the decision function of old models • Flexibility: auxiliary classifier can be “black-box” classifiers of any type • Efficiency: auxiliary data is NOT involved in training • Applicability: even if the auxiliary data is not accessible delta function auxiliary classifier adapted classifier + = auxiliary data target data 10/30/2014

Learning “Delta Function”: Risk Minimization • General framework: regularized empirical loss minimization (1) classification errors measured by any loss function L(y,x), and (2) complexity (norm) of Δf(x), which equals to the distance between auxiliary and adapted classifier in the function space. 10/30/2014

Illustration of function-level adaptation • Intuition:seek the new classification boundary that (1) is close to the original boundary and (2) can correctly classify the labeled examples • Cost factor C to determine the contribution of auxiliary classifiers Auxiliary data Target data 10/30/2014

Adaptive SVMs • Adaptive SVMs: a special case of adaption with hinge loss function • A quadratic programming (QP) problem solved by modified sequential minimal optimization (SMO) algorithm • Training cost: similar to SVMs other than the one-time cost of computing auxiliary prediction • Adapted classifier: 10/30/2014

Experiments • TREC Video Retrieval Evaluation (TRECVID) 2005 • 74,523 video shots, 39 labels, 13 programs from 6 channels • Adapt concepts learned from one program to another program 10/30/2014

Cross-Domain Detection Performance • Average Precision: Adapt > Aggr ≈ Ensemble > Aux > Target • Using knowledge of auxiliary data almost always help in this setting • More classification results in the paper [MM’07] 10/30/2014

Cross-Domain Detection Efficiency • Total training time for 39 concepts and 13 programs • Training cost: Target = Ensemble < Adapt << Aggr • Adaptive SVMs achieve good tradeoff between concept detection effectiveness and efficiency 10/30/2014

Roadmap • Motivation and Challenges: Why Efficiency? • (Automatic) Model-shared Subspace Boosting [KDD’07] • Automatically exploit information redundancy across the concepts • (Semi-automatic) Cross-domain Concept Adaptation [MM’07] • Function-level adaptation with high efficiency and flexibility • (Semi-manual) Learning-based Hybrid Annotation [Submitted] • Conclusions 10/30/2014

Manual Concept Annotation • Limitations of automatic annotation • Needs to have sufficient training data • Sometimes hard to learn from low-level visual features • Popularity of manual annotation • High annotation quality and social bookmarking functionality • Labor-expensive and time-consuming • “Vocabulary mismatch” problem How about speeding up manual annotation? Let users drive, but computers suggest the right words / images / interface to annotate “Book” (Flickr) 10/30/2014

Related Work on Efficient Manual Annotation • Active learning: maximizing automatic annotation accuracy with a minimal amount of manual annotation effort • Aim to optimize the learning performance, instead of annotation time • Automatically annotate most images (inaccurate), of which the learning performance largely depends on underlying low-level features • Annotate most “ambiguous” images, leading to poor user experience • Leveraging other modalities: e.g., speech recognition, semantic network, time/location • Require support from other information sources 10/30/2014

Challenges and Proposed Work • Challenges on investigating manual annotation • No formal time models exist for manual annotation • Require large-scale user study, which can result in a time-consuming annotation process and a high user variance • Provide no guidance on developing better manual annotation approaches • Proposed work • Formal time models for two annotation approaches: tagging / browsing • A much more efficient annotation approach based on these models 10/30/2014

Manual Annotation (I) : Tagging • Allow users to associate a single image at a time with one or more keywords (the most widely used manual annotation approaches) • Advantages • Freely choose arbitrary keywords to annotate • Only need to annotate relevant keywords • Disadvantages • “Vocabulary mismatch” problem • Inefficient to design and type keywords • Suitable for annotating rare keywords 10/30/2014

Formal Time Model for Tagging • Annotation time for one image: • Factors: number of keywords K, time for kth word t’fk , setup time for new image t’s • Total expected annotation time for an image collection • Assumption: the expected time to annotate the kth word is constant tf • User study on TRECVID’05 development data • manually tag 100 images using 303 keywords • If the model is correct, a linear fit should be found in the results • The annotation results fit the model very welltf = 6.8sec, ts = 5.6sec 10/30/2014

Manual Annotation (II) : Browsing • Allow users to associate multiple images with a single word at the same time • Advantages • Efficient to annotate each pair of images / words • No “vocabulary mismatch” • Disadvantages • Need to judge both relevant and irrelevant pairs • Start with controlled vocabulary • Suitable for annotating frequent keywords 10/30/2014

Formal Time Model for Browsing • Annotation time for all images w.r.t. a keyword: • # of relevant images Lk , annotation time for an (ir)relevant image t’p (t’n) • Total expected annotation time for an image collection • Assumption: the expected time to annotate a relevant (irrelevant) image is constant • User study on TRECVID’05 development data • Three users to manually browse images in 15 minutes ( for 25 keywords ) • A linear fit should be found in the results • The annotation results fit the model for all userson average, tp = 1.4sec, tn = 0.2sec 10/30/2014

Learning-based Hybrid Annotation [with A. P. Natsev and M. Campbell] • Combine both tagging and browsing interfaces to optimize the annotation time for manually annotating the image/video collections • Formally model the annotation time as functions of word frequency, time per word, and annotation interfaces • Learning the visual patterns of existing annotation on the fly • Automatically suggest the right images, keyword, and annotation interface (tagging vs. browsing) to the users to minimize overall annotation time • Combine the advantages of both tagging and browsing 10/30/2014

An Illustrative Example for Hybrid Annotation • Users start annotation process from the tagging interface • No limitation on the keywords • (Automatically) switch to the browsing interface to annotate a set of selected images • Predicted as relevant to a given keyword with high confidence • Spend much less time to annotate images without re-typing the same keyword • Switch to the tagging interface when necessary Tagging Browsing 10/30/2014

Simulation Results • Results on two large-scale collections: TRECVID and Corel • More accurate than automatic annotation (100% accurate) • More efficient than tagging / browsing annotation (2-fold speedup) • More effective than tagging / browsing in a given amount of time Corel Collection TRECVID Collection 10/30/2014

Empirical Results • A user spend 1 hour in annotating 10 TRECVID videos using tagging, browsing and hybrid annotation • The proposed time models correctly estimate the true annotation time • Hybrid annotation provides much better annotation results Estimated Annotation Time Empirical Performance 10/30/2014

Conclusions: Efficient Approaches for Learning Large-Scale Video Concepts • Automatic: Model-shared Subspace Boosting • Automatically exploit information redundancy across concepts • Orders of magnitude speedup on both training and testing process • Semi-automatic: Cross-domain Concept Adaptation • Function-level adaptation with high efficiency and flexibility • Fast cross-domain model update with limited number of training data • Semi-manual: Learning-based Hybrid Annotation • Optimize overall annotation time using formal annotation time models • Significantly faster that simple tagging or browsing with accurate annotation 10/30/2014

Thank You!

Backup 10/30/2014

Properties of MSSBoost • Adaptive Newton methods to compute the combination weights αtby minimizing the joint logistic loss function [Proposition 1] • The learning process is guaranteed to converge after a limited number of steps under some general conditions [Theorem 3] • Computational complexity can be considerably reduced by using small sampling ratios, and sharing base models across labels • 100 base models w. data/feature sampling ratio 20% for 40 concepts • Achieve a 50-fold speedup for training, a 10-fold speedup for testing 10/30/2014

Problem Formulation • Adapt classifiers trained on auxiliary datasets to a target dataset • Assumption 1: target data follows a different but related distribution • Assumption 2: limited target examples are additionally collected Target data Auxiliary data high variance apply Bias-variance tradeoff biased train train apply Adapted classifier new classifier Auxiliary classifier adapt 10/30/2014

Example: Synthetic Data Examples • 2-D data examples • 1000 data points w. 3 labels and optimal decision boundary 10/30/2014

Example: Results of Random Subspace Bagging • Random Subspace Bagging • Base model: decision stump (1-level tree) • 8 base models • RS-Bag cannot model the decision boundary well with such a small number of base models 10/30/2014

Example: Results of MSSBoost • Model-Shared Subspace Boosting • 8 base models • MSSBoost can model the decision boundary much better than its non-shared counterpart with the same number of base models • In other words, it can improve classification efficiency without hurting performance 10/30/2014

From Tens to Thousands: Efficient Methods for Learning Large-Scale Video Concepts