Information Management on the World-Wide Web

Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science

The Web and Information Galore

10 Years Ago • Reading papers for research • Stacks of papers • Long wait

With Web

Challenges (1) • Information overload • Too much information, too little time

Information Overload • “XML” to Google • 14 Million matching documents! • “XML” to Amazon • 464 matching books! • Which one to read?

Challenges (2) • Hidden Web • Not indexed by Search Engines • “Hidden” from an average user • Browse every site manually? …

Challenges (3) • Transience

Challenges (4) • Scattered & unstructured data • All Computer Science faculty members and graduate students in the US?

Projects In Our Group • Web Archive • Hidden Web Integration • Page Ranking Algorithm • User Recommendation System

User Recommendation System • 464 books on XML • Which one to read? • The one that my colleagues and friends recommend?

Amazon’s Recommendation System • 1 – 5 star rating by individual users • Books can be sorted by “average user rating”

My Typical Scenario • Sort books by their average user rating • Browse top 20 books to decide what to read

Questions • Is “5 star” by one user better than “4.9 star” by 100 users? • Intuitively, I prefer 4.9 star by 100 users • More “reliable” rating • How much can I trust the rating of a particular person? • How do I know that the person’s rating is reliable

Our Approach • “Inherent quality” or “rating” of a book • How many users recommend the book (i.e., give high rating) if all users have read the book? • More user rating  More information on the “quality” of the book • An average user is likely to give high rating for a high-quality book

Probabilistic Rating Model • How likely is the book of “4 star rating”? • Rating probability distribution Probability density Book rating/quality

Update of Rating Probability • As more users provide rating, we update our probability distribution Probability density Book rating/quality

Update of Rating Probability • As more users provide rating, we update our probability distribution After five-star rating by a user Probability density Book rating/quality

Update of Rating Probability • As more users provide rating, we update our probability distribution After one-star rating by a user Probability density Book rating/quality

Update of Rating Probability • As more users provide rating, we update our probability distribution After many ratings Probability density Book rating/quality

Probability of book rating BEFORE user rating Probability of book ratingAFTER user rating Bayesian Inference Theory • Given a user rating UR, what is the inherent rating IR? P ( UR | IR ) P ( IR ) = P ( IR | UR ) P ( UR )

User rating User rating Book quality Book quality Good Bad User Model • The characteristics of a user • Sensitivity: Slope of the curve +1: good, –1 : bad, 0: not useful

User rating User rating Book quality Book quality Positive bias Negative bias User Model • The characteristics of a user • Bias: Average “height” of the curve

Iterative Model Refinement • As more users rate a book, we get better estimates on book quality • As we estimate a book quality better, we get better idea on a user’s sensitivity and bias

User Characteristics Iterative Model Refinement Book Rating Estimate User-provided Rating

Final Recommendation • Recommend the book with the highest expected rating

Initial Results • Our system prefers a 4.9-star book by 100 people to a 5-star book by 1 user • If a user gives random ratings, the system ignores the user’s rating • More thorough evaluation on the way

Other Projects • Web Archive • Hidden Web Integration • Page Ranking Algorithm

Ph.D. Students on the Projects Alex Ntoulas Rob Adams Victor Liu • In Dr Chu’s group

Thank You • Questions?

Information Management on the World-Wide Web