Overcoming the Quality Curse

Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan Zhang Alumini Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari

Beyond DASFAA 2003 paper .. Improving Quality Improving Efficiency New Domains DASFAA 2003 Video data Image data Speech data Sensor data Entity Search People Search Location Search

Data Cleaning – a vital component of Enterprise Data Processing Workflow Analysis/Mining ETL Quality of Analysis Decisions Quality of Decisions Data Quality of Data • Historical data analyses • Trends, patterns, rules, models, .. • Long term strategies • Business decisions Data Cleaning Quality(Data)  Quality(Decisions) OLTP Point of sale Organizational customer data Data Sources

Entity Resolution Problem Real World Digital World

Standard Approach to Entity Resolution Deciding if two reference u and v co-refer Analyzing their features ? J. Smith John Smith ? Feature 2 Feature 2 ? u v Feature 3 Feature 3 ? js@google.com sm@yahoo.com s (u,v) = f (u,v) “Similarity function” “Feature-based similarity” (if s(u,v) > t then u and v are declared to co-refer)

Measuring Quality of Entity Resolution • Entity dispersion • for an entity, into how many clusters its repr. are clustered, ideal is 1 • Cluster diversity • for a cluster, how many distinct entities it contains, ideal is 1 • Measures: • F-Measure. • B-Cubed F-Measure. • Variation of Information (VI). • Generalized Merge Distance (GMD). • …

The Quality Curse -- Why Standard “Feature-based” Approach leads to Poor Results S Mehrotra has joined the faculty at University of Illinois. He received his PhD from UT, Austin. He got his bachelors from IIT, Kanpur in India Photo Collection of Sharad Mehrotra from Beijing, ChinaJune 2007 SIGMOD Trip S. Mehrotra, PhD from University of Illinois is visiting UT, Austin to give a talk on prefetching on multiprocessor machines. He received his bachelors from India. Sharad Mehrotra, research interests: data management, Professor, UC Irvine Significant entity dispersion. Significant cluster diversity.

Overcoming the Quality Curse (1).. Look more carefully at data for additional evidences

Exploiting Relationships among Entities Author table (clean) Publication table (to be cleaned) ? A1, ‘Dave White’, ‘Intel’ A2, ‘Don White’, ‘CMU’ A3, ‘Susan Grey’, ‘MIT’ A4, ‘John Black’, ‘MIT’ A5, ‘Joe Brown’, unknown A6, ‘Liz Pink’, unknown P1, ‘Databases . . . ’, ‘John Black’, ‘Don White’ P2, ‘Multimedia . . . ’, ‘Sue Grey’, ‘D. White’ P3, ‘Title3 . . .’, ‘Dave White’ P4, ‘Title5 . . .’, ‘Don White’, ‘Joe Brown’ P5, ‘Title6 . . .’, ‘Joe Brown’, ‘Liz Pink’ P6, ‘Title7 . . . ’, ‘Liz Pink’, ‘D. White’ • Context Attraction Principle (CAP): Nodes that are more connected have a higher chance of co-referring to the same entity ER Graph

Exploiting Relationships for ER Ph.D. Thesis, Stella Chen • Formalizing the CAP principle [SDM 05, IQIS 05] • Scaling to large graphs [TODS 06] • Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11] • Not all relationships are equal • E.g., mutual interest in Bruce Lee movies possibly not as important as being colleagues at a university for predicting co-authorship. • Merging relationship evidence with other evidences [SIGMOD ‘09] • Applying to People search on Web [ICDE ‘07, TDKE 08, ICDE 09 (demo)]

Effectiveness of Exploiting Relationships WEPS Multimedia

Smart Video Surveillance Query/ Analysis Event Database CS Building in UC Irvine Semantic Extraction Surveillance Video Database Video collection Camera Array to track human activities

Event Model Event model : Query Examples: Who was the last visitor to Mike Carey’s office yesterday? Who spends more time in Labs – database students or embedded computing students? Query /Analysis when what who Temporal placement Activity recognition Event Database Face recognition event extraction localization Other property Semantic Extraction Surveillance Video Database where

Person Identification Challenge Bob ？ Event model : when what ？ Alice who Temporal placement Activity recognition Face recognition ？ event extraction localization other Other property Person Identification Who ? where

Traditional Approach ？？？ Traditional Approach Face Detection Face Recognition Poor Performance Detect 70 faces/ 1000 images 2~3 images/ person

Rationale for Poor Performance resolution Sampling rate original performance original performance (original) Poor Quality of Data No faces Small faces Low resolution Low temporal Resolution 1 frame/sec 1 frame/sec Drop to Drop to 53% 70% 1/2 frame/sec (1/2 original) Drop to Drop to 30% 35% 1/3 frame/sec (1/3 original)

Effectiveness of Exploiting Relationships WEPS Multimedia [IQ2S PERCOM 2011]

Results on Face Clustering [ACM ICMR 2013 Best Paper Award]

Results High Precision, 662 clusters 31 Real Person, 631 merges 4 Times High Precision, 203 clusters 31 Real Person, 172 merges

Overcoming the Quality Curse (2).. Look outside the box

Exploiting Search Engine Statistics Google Search results of “Andrew McCallum” • Correlations amongst context entities provide additional source of information to resolve entities Search Engine Queries to learn correlations amongst contexts SebastianThrunAND Tom Mitchell Andrew McCallum AND Sebastian ThrunAND Tom Mitchell (Machine Learning OR Text Retrieval) AND (CRF OR UAI 2003) Andrew McCallum AND (Machine Learning OR Text Retrieval ) AND (CRF OR UAI 2003) Sebastian Thrun Machine Learning Text Retrieval Tom Mitchell CRF UAI 2003 Andrew McCallum AND Sebastian Thrun AND (CRF OR UAI 2003)

Exploiting Web Search Engine Statistics Ph.d. Thesis, Rabia Nuray Web Queries to Learn correlations [SIGIR 08] Application to Web People Search [WePS 09] Cluster refinement to overcome the singleton cluster problem [TODS 11-a] Making Web querying robust to server side fluctuations [tech. report] Scaling up the Web Query Technique [TODS 11-a]

Comparing with the State-of-the-art on WEPS-2 Dataset

Observation/Conclusion… • Additional Evidences can be exploited to improve data quality • BUT …it is Expensive!! • Example: Web Queries Approach • Number of queries : 4K2 ( ~ 40K for 100 results) • Very large to submit to a search engine & expect real-time results • ~6-8 minutes (network costs, search engine load) • Solutions: • Local Caching of the Web • Ask only important queries • Reduces to 1-2 min. without degrading quality much

(Near) Future: Addressing the Efficiency Curse … Improving Quality Improving Efficiency New Domains DASFAA 2003 Two complementary approaches • Pay as you go data cleaning – • Progressive algorithm to obtain best quality given budget constraint • Query driven data cleaning – • Perform minimal cleaning to answer query/analyses task. Prevent having to clean unnecessary data.

Overcoming the Quality Curse

Overcoming the Quality Curse

Presentation Transcript

curse

The Titans Curse

THE CURSE CHAPTER 8

The Titans Curse

The Titans Curse

Overcoming a Quality Score Slap

The Bambino Curse

The Resource Curse

The Lunatics Curse

THe MAth CuRsE

Avoiding the Resource Curse

The curse minghan 5i

THE TITANS CURSE

The curse

The Resource Curse

Macbeth: The Curse

Fast approximate POMDP planning: Overcoming the curse of history!

Approximate POMDP planning: Overcoming the curse of history!

The Resource Curse

The Condescending Curse

Overcoming the Curse of Dimensionality with Reinforcement Learning

Curse of the Pharaohs