Semantic Sound Retrieval: QBSE Method for Accurate Audio Search

Query By Example Semantic Solution Query-by-example is a method for retrieving content from databases: given an example, return similar content. For sound effects audio, the retrieved results could have: similarsound: Query by Acoustic Example (QBAE) or similar meaning: Query by Semantic Example (QBSE). We describe QBSE retrieval and demonstrate that it is both more accurate and more efficient that QBAE. Sounds →Semantics Using learned ‘word’ distributions P(a|wi), compute the posterior probability of word wi, giventrack Assume xm and xnare conditionally independent, given wi: Estimate the song prior , by summing over all words: Normalizing posteriors of all words, we represent a track as a semantic distribution over the vocabulary terms: Audio & Text Features We experiment on the BBC Sound Effects library, a heterogeneous data set of audiotrack / caption pairs: 1305 tracks, 3 to 600 seconds long 348-word vocabulary of semantic concepts (words) Each track has a descriptive caption of up to 13 words Represent each track’s audio as a bag of feature vectors; MFCC features plus 1st and 2nd time deltas 10,000 feature vectors per minute of audio Represent each track’s caption as a bag of words; Binary document vector of length 348 Acoustic Similarity Semantic Models Each database track, d, is represented as a probability distribution over the audio feature space, approximated as a K-component Gaussian mixture model (GMM): The similarity of database tracks to a query track, is based on the likelihood of the audio features of the query under the database track distributions: Rank-order database tracks by decreasing likelihood QBAE complexity grows with the size of the database For the wth word in the vocabulary, estimate P(a|w), a ‘word’ distribution over audio feature vector space. Model P(a|w) with a Gaussian Mixture Model (GMM), estimated using Expectation Maximization. The training data for word distribution P(a|wi) is all feature vectors from all tracks labeled with word wi. The semantic model is a set of ‘word’ GMM distributions References Carneiro & Vasconcelos (2005). Formulating semantic image annotation as a supervised learning problem. IEEE CVPR. Rasiwasia, Vasconcelos & Moreno (2006). Query by Semantic Example. ICIVR. Slaney (2002). Semantic-audio retrieval. IEEE ICASSP. Skowronek, McKinney & van de Par (2006). Ground-truth for automatic music mood classification. ISMIR. Luke Barrington, Antoni Chan, Douglas Turnbull & Gert Lanckriet Electrical & Computer Engineering University of California, San Diego lbarrington@ucsd.edu Audio Information Retrieval using Semantic Similarity Results Semantic Similarity The semantic distributions are points in a semantic space. A natural measure of similarity in this space is the Kullback-Leibler (KL) divergence; Given a query track, QBSE retrieves the database tracks that minimize the KL divergence with the query. The bulk of QBSE computation lies in calculating the semantic distribution for the query track so complexity grows with the size of the vocabulary. http://cosmal.ucsd.edu/cal/

Semantic Sound Retrieval: QBSE Method for Accurate Audio Search

Semantic Sound Retrieval: QBSE Method for Accurate Audio Search

Presentation Transcript

S - R S - R S - R S - R S - R S - R S - R S - R S - R S - R S - R S - R

I NFORMATION C OMMISSION(S)

N atural R esource I nformation S ystems

P S O R I A S I S

S emantic Web

B ANKING I NFORMATION S YSTEMS

A r t i s t a s

DiGIR Di stributed G eneric I nformation R etrieval

QUIRK: QU estion Answering = I nformation R etrieval + K nowledge

S emantic S imilarity for M usic R etrieval

R u s s i a

DIS tributed CO ntent-based V isual I nformation R etrieval

O nline A uburn S tudent I nformation S ystem

G eomatic R egional I nformation S ociety I nitiative

R U S S I A

S A b i r d s

SAPIR S earch in A udio - Visual Content using P 2P I nformation R etrival

S iena C ollege I nternship I nformation S ystem