1 / 38

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes. Bilal Hawashin , Farshad Fotouhi Traian Marius Truta Department of Computer Science

koto
Download Presentation

A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes BilalHawashin, FarshadFotouhiTraian Marius Truta Department of Computer Science Wayne State University Northern Kentucky University

  2. Outlines • What is Similarity Join • Long String Values • Our Contribution • Privacy Preserving Protocol For Long String Values • Experiments and Results • Conclusions/Future Work • Contact Information

  3. Motivation Is Natural Join always suitable?

  4. Similarity Join • Joining a pair of records if they have SIMILAR values in the join attribute. • Formally, similarity join consists of grouping pairs of records whose similarity is greater than a threshold, T. • Studied widely in the literature, and referred to as record linkage, entity matching, duplicate detection, citation resolution, …

  5. Our Previous Contribution: Long String Values (ICDM MMIS10) • The term long string refers to the data type representing any string value with unlimited length. • The term long attribute refers to any attribute of long string data type. • Most tables contain at least one attribute with long string values. • Examples are Paper Abstract, Product Description, Movie Summary, User Comment, … • Most of the previous work studied similarity join on short fields. • In our previous work, we showed that using long attributes as join attributes under supervised learning can enhance the similarity join performance.

  6. Example

  7. Our Paper (Motivation) • Some sources may not allow sharing its whole data in the similarity join process. • Solution: Privacy Preserved Similarity Join. • Using long attributes as join attributes can increase the similarity join accuracy. • Up to our knowledge, all the current Privacy Preserved SJ algorithms use short attributes. • Most of the current privacy preserved SJ algorithms ignore the semantic similarities among the values.

  8. Problem Formulation • Our goal is to find a Privacy Preserved Similarity Join Algorithm when the join attribute is a long attribute and consider the semantic similarities among such long values.

  9. Our Work Plan • Phase1: Compare multiple similarity methods for long attributes when similarity thresholds are used. • Phase2: Use the best method as part in the privacy preserved SJ protocol.

  10. Phase1: Finding Best SJ Method for Long Strings with Threshold • Candidate Methods: • Diffusion Maps. • Latent Semantic Indexing. • Locality Preserving Projection.

  11. Performance Measurements F1 Measurement: the harmonic mean between recall R and precision P. Where recall is the ratio of the relevant data among the retrieved data, and precision is the ratio of the accurate data among the retrieved data.

  12. Performance Measurements(Cont.) Preprocessing time is the time needed to read the dataset and generate matrices that could be used later as an input to the semantic operation. Operation time is the time needed to apply the semantic method. Matching time is the time required by the third party, C, to find the cosine similarity among the records provided by both A and B in the reduced space and compare the similarities with the predefined similarity threshold.

  13. Datasets IMDB Internet Movies Dataset: Movie Summary Field Amazon Dataset: Product Title Product Description

  14. Phase1 Results Finding best dimensionality reduction method using Movie Summary from IMDB Dataset (Left) and Product Descriptions from Amazon (Right).

  15. Phase2 Results Preprocessing Time:

  16. Phase2 Results Operation Time for the best performing methods from phase 1. • Matching Time is negligible.

  17. Our Protocol • Both sources A and B share the Threshold value T to decide similar pairs later.

  18. Our Protocol Source A Source B

  19. Find Term_LSV Frequency Matrix for Each Source Ma

  20. Find TD_Weighted Matrix Using TF.IDF Weighting WeightedMa

  21. TF.IDF Weighting TF.IDF weighting of a term W in a long string value x is given as: where tfw,xis the frequency of the term w in the long string value x, and idfwis , where N is the number of long string values in the relation, and nwis the number of long string values in the relation that contains the term w.

  22. MeanTF.IDF Feature Selection • MeanTF.IDF is an unsupervised feature selection method. • Every feature (term) is assigned a value according to its importance. • The Value of a term feature w is given as Where TF.IDF(w, x) is the weighting of feature w in long string value x, and N is the total number of long string values.

  23. Apply MeanTF.IDF on WeightedMaand Get Important Features to Imp_Fea. • Add Random features to Imp_Feato get • Rand_ Imp_Fea. • Rand_ Imp_Feaand Rand_ Imp_Febare • returned to C. • C Finds the intersection and return the • shared important features SF to both A • and B.

  24. Reduced WeightedM Dimensions in Both Sources using SF. SFa

  25. Add Random Vectors to SF Rand_Weighted_a

  26. Find Wa (The Kernel) |Wa| = D x D, where D is total number of columns in Rand_Weighteda

  27. Use Diffusion Maps to Find Red_Rand_Weighted_a • [Red_Rand_Weighted_a,Sa,Va,Aa] = Diffusion_Map(Wa , 10, 1, red_dim), red_dim < D • Red_Rand_Weighted_a=Diffusion Map Representation of first row of Wa Diffusion Map Representation of second row of Wa Diffusion Map Representation of third row of Wa

  28. C Finds Pairwise Similarity Between Red_Rand_Weighted_a and Red_Rand_Weighted_b

  29. If Cos_Sim>T, Insert the tuple in Matched Matched

  30. Matched is returned to both A and B. • A and B remove random vectors from • Matched and share their matrices.

  31. Our Protocol (Part1)

  32. Our Protocol (Part2)

  33. Phase2 Results Effect of adding random columns on the accuracy.

  34. Phase2 Results Effect of adding random columns on the number of suggested matches.

  35. Conclusions • Efficient secure SJ semantic protocol for long string attributes is proposed. • Diffusion maps is the best method (among compared) to semantically join long string attributes when threshold values are used. • Mapping into diffusion maps space and adding random records can hide the original data without affecting the accuracy.

  36. Future Work Potential further works: • Compare diffusion maps with more candidate semantic methods for joining long string attributes. • Study the performance of the protocol on huge databases.

  37. Thank You … Dr. FarshadFotouhi. Dr. Traian Marius Truta.

  38. Questions!

More Related