1 / 18

NM-Tree : Flexible Approximate Similarity Search in Metric and Non-metric Spaces

NM-Tree : Flexible Approximate Similarity Search in Metric and Non-metric Spaces. Tom áš Skopal Jakub Lokoč. Charles University in Prague Department of Software Engineering Czech Republic DEXA 2008, Turin, Italy, Sep 1-5. Presentation outline. Metric similarity search Semimetric tuning

mira
Download Presentation

NM-Tree : Flexible Approximate Similarity Search in Metric and Non-metric Spaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces Tomáš Skopal Jakub Lokoč Charles University in PragueDepartment of Software Engineering Czech Republic DEXA 2008, Turin, Italy, Sep 1-5

  2. Presentation outline • Metric similarity search • Semimetric tuning • M-Tree • NM-Tree • Experimental results • Discussion

  3. query object Metric similarity search • How to search in multimedia databases (MDB)? • textual annotation is expensive and ambiguous • MDB objects are not structured (we cannot use a structured query language, like SQL) • solution: content based similarity searching;similarity between a query and DB object is interpreted as a relevance • similarity is often modelled by a distance function d satisfying metric properties-> metric searching -> metric access methods (MAMs) • the distance functiond is supposed to be computationallyexpensive-> sequential search is unfeasible -> external indexing structures • metric properties are advantageous for indexing, but may be unsuitable for domain experts • mainly thetriangle inequality -> let's d be relaxed to a semimetric (i.e., reflexive, non-negative, symmetric distance) • we use MAMsalso with semimetric distances, but we have to take more or less incorrect behavior into account • false dismissals & false positives for semimetrics

  4. Semimetric tuning • semimetric brings less limitations for domain experts, but… • semimetric doesn’t guarantee triangle inequality for every triplet of objects in the database -> a lot of non-triangle triplets causes a lot of false dismissals in MAMs and vice versa • exact metric searching = all distance triplets must be triangle triplets • non-triangle triplets cause only approximate but faster search • semimetric tuning = changing the proportion of non-triangle triplets (generated by ds) by applying modifierf(real function) on original semimetric, e.g. ds* = f (ds) • ds* should - satisfy triangle inequality to some extent (controlled precision of searching) - generate lower-dimensional distance distributions (faster searching) a a c c b b Triangle triplet: a + b >= c Non-triangle triplet: a + b < c

  5. Properties of the modifier f f is increasing – preservesoriginal query orderings triangle-generating f = concave f = turns more distance triplets into triangle ones = slow but precise searching triangle-violating f = convex f = turns more distance triplets into non-triangle ones = fast but approximate searching parametric T-bases TriGEN – an algorithm for finding an f that satisfies the user-requested retrieval error e and maximizes the search efficiency (lowest intrinsic dimensionality), see [2] Modified distancesmay form the triangle triplet f f guarantees the fish is always more similar to sea-maid than to girl ds* ds* ds ds

  6. range query Q (euclidean 2D space) M-tree • dynamic, balanced, and paged tree structure (like e.g. B+-tree, R-tree) • the leaves are clusters of indexed objects Oj (ground objects) • routing entries in the inner nodes represent hyper-spherical metric regions (Oi , rOi), recursively bounding the object clusters in leaves • the triangle inequality allows to discard irrelevant M-tree branches (metric regions resp.) during query evaluation

  7. M-tree filtering a) basic filtering (expensive)b) parent filtering (cheap) d(R, Q) > rR + rQ |d(P, Q) – d(P, R)| > rR + rQ

  8. NM-tree motivation • separated usage of TriGen and M-tree - limitations • M-treehierarchy depends on the topology of ds (specific to f used) • For another measureds*the database must be re-indexed • To provide user with choice of precision/efficiency tradeoff we need to maintain more M-trees- each for particular dissimilarity measure !!! • T-bases (returned by TriGEN)natively support inverse modification (as proposed in [2])ds = fe(fe(ds , w), -w) notation : fe-1~ fe(ds, -w) ds* We can mimic multiple M-trees with just one M-tree and an appropriate set of modifiers fe -> NM-tree

  9. NM-tree setbacks Native combination of M-tree and TriGEN brings some problems… • determining modifiers for an empty index (no data received) • solution : gather first k objects into a sequential file then find modifiers • guarantee of exact searching • solution : use metric dm = fm(ds) for inserting -> it is necessary to find fm modifier in the initial phase • aggregated distances stored in M-tree as covering radii • cannot be correctly remodified by f • solution : approximate search only at the preleaf & leaf level

  10. NM-tree • structure • the same structure as for the M-tree • maintains modifiers fe for distance modification • construction • Initial phase • inserting into the sequential file until sufficient number of objects is gathered • then finding modifiers for requested retrieval errors including fm for error = 0 • all objects from the sequential file are inserted into the NM-tree • Second phase • inserting into the NM-tree under dm = fm(ds) -> this makes possible exact searching • querying • additional query parameter – an error threshold e (such that fe is available in NM-tree) • Exact search –NM-tree querying under dm (+ result distances remodified by fm-1) • Approximate search – for stored values (in the index) it is necessary to makeconversion from dm to ds using fm-1 and subsequentlyconversion from ds to desired ds* using fe

  11. NM-tree approximate querying • For upper levels the metric search is performed • For the preleaf and leaf level all the distances used for pruning are modified to the desired semimetric depending on the user-defined error threshold e entry modification d2p* = fe(fm-1(d2p)) e_radius* = fe(fm-1(e_radius)) query modification q_radius* = fe(q_radius) e2qd* = fe(e2qd) dm ds*

  12. NM-tree querying example

  13. Experiments • We have compared multiple M-trees (each for semimetric determined by user defined error) with NM-tree • We have performed our tests on two databases • 68,040 32-dimensional Corel features • 250,000 synthetic 2D polygons, each consisting of 5 to 15 vertices • We have tested one semimetric and one metric on each database • COREL – L0.75 and L2 • Polygons – DTW and Hausdorff • For TriGEN we have selected 10 modifiers for each dissimilarity measure and T-errors (values correlated with retrieval error)within [0 – 0,32]

  14. Experimental results

  15. Experimental results

  16. Experimental results

  17. References • [1] Ciaccia P., Patella M., Zezula P.M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: VLDB 1997, pp. 426–435 (1997) • [2] Skopal T.Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Transactions on Database Systems 32(4), 1–46 (2007) • [3] Chávez E., Navarro G.A Probabilistic Spell for the Curse of Dimensionality.In: Buchsbaum, A.L., Snoeyink, J. (eds.) ALENEX 2001. LNCS, vol. 2153, pp. 147–160. Springer, Heidelberg (2001)

  18. ... Thank you for attention Questions ??

More Related