300 likes | 469 Views
Information Management on the World-Wide Web. Junghoo “John†Cho UCLA Computer Science. The Web and Information Galore. 10 Years Ago. Reading papers for research Stacks of papers Long wait. With Web. Challenges (1). Information overload Too much information, too little time.
E N D
Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science
10 Years Ago • Reading papers for research • Stacks of papers • Long wait
Challenges (1) • Information overload • Too much information, too little time
Information Overload • “XML” to Google • 14 Million matching documents! • “XML” to Amazon • 464 matching books! • Which one to read?
Challenges (2) • Hidden Web • Not indexed by Search Engines • “Hidden” from an average user • Browse every site manually? …
Challenges (3) • Transience
Challenges (4) • Scattered & unstructured data • All Computer Science faculty members and graduate students in the US?
Projects In Our Group • Web Archive • Hidden Web Integration • Page Ranking Algorithm • User Recommendation System
User Recommendation System • 464 books on XML • Which one to read? • The one that my colleagues and friends recommend?
Amazon’s Recommendation System • 1 – 5 star rating by individual users • Books can be sorted by “average user rating”
My Typical Scenario • Sort books by their average user rating • Browse top 20 books to decide what to read
Questions • Is “5 star” by one user better than “4.9 star” by 100 users? • Intuitively, I prefer 4.9 star by 100 users • More “reliable” rating • How much can I trust the rating of a particular person? • How do I know that the person’s rating is reliable
Our Approach • “Inherent quality” or “rating” of a book • How many users recommend the book (i.e., give high rating) if all users have read the book? • More user rating More information on the “quality” of the book • An average user is likely to give high rating for a high-quality book
Probabilistic Rating Model • How likely is the book of “4 star rating”? • Rating probability distribution Probability density Book rating/quality
Update of Rating Probability • As more users provide rating, we update our probability distribution Probability density Book rating/quality
Update of Rating Probability • As more users provide rating, we update our probability distribution After five-star rating by a user Probability density Book rating/quality
Update of Rating Probability • As more users provide rating, we update our probability distribution After one-star rating by a user Probability density Book rating/quality
Update of Rating Probability • As more users provide rating, we update our probability distribution After many ratings Probability density Book rating/quality
Probability of book rating BEFORE user rating Probability of book ratingAFTER user rating Bayesian Inference Theory • Given a user rating UR, what is the inherent rating IR? P ( UR | IR ) P ( IR ) = P ( IR | UR ) P ( UR )
User rating User rating Book quality Book quality Good Bad User Model • The characteristics of a user • Sensitivity: Slope of the curve +1: good, –1 : bad, 0: not useful
User rating User rating Book quality Book quality Positive bias Negative bias User Model • The characteristics of a user • Bias: Average “height” of the curve
Iterative Model Refinement • As more users rate a book, we get better estimates on book quality • As we estimate a book quality better, we get better idea on a user’s sensitivity and bias
User Characteristics Iterative Model Refinement Book Rating Estimate User-provided Rating
Final Recommendation • Recommend the book with the highest expected rating
Initial Results • Our system prefers a 4.9-star book by 100 people to a 5-star book by 1 user • If a user gives random ratings, the system ignores the user’s rating • More thorough evaluation on the way
Other Projects • Web Archive • Hidden Web Integration • Page Ranking Algorithm
Ph.D. Students on the Projects Alex Ntoulas Rob Adams Victor Liu • In Dr Chu’s group
Thank You • Questions?