1 / 72

Information Extraction from Social Media

Information Extraction from Social Media. Tim Finin 10 October 2006. Overview. Motivation Blogs and feeds UMBC research Seedling opportunities Conclusion. Motivation.

anne
Download Presentation

Information Extraction from Social Media

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction from Social Media Tim Finin 10 October 2006

  2. Overview • Motivation • Blogs and feeds • UMBC research • Seedling opportunities • Conclusion

  3. Motivation “Social media describes the online tools and platforms that people use to share opinions, insights, experiences, and perspectives with each other.” Wikipedia, Sept 06 It’s a dynamic and growing area, that includes blogs, wikis, forums, photo and video sharing sites, etc.

  4. Motivation • We started looking at blogs a year ago because they were rich in metadata • Encoded in RDF and other formats • We’ve found that blogs and other social media are a rich source of problems and opportunities, including • Information integration on the Web • Modeling trust • Extracting facts, opinions and sentiment • Event and trend detection • If static pages form the Web’s long term memory, then the Blogosphere is its stream of consciousness

  5. Overview • Motivation • Blogs and feeds • UMBC research • Seedling opportunities • Conclusion

  6. State of the Blogosphere • 52 million blogs • Doubling in size every six months • 40 new blog posts per second • 57% of online US teens generate content, 40% read blogs, 20% have them • 53% of companies are blogging • One third of blog posts are in English • Sources: • State of the Blogosphere (Technorati), Fortune 500 Business Blogging Wiki , Pew, 11/05, (Guideware 10/05), UMBC studies

  7. 50,000,000 Weblogs (July 2006) • Doubling in size every 6 months for the past 3 years Weblogs Cumulative: 03/03 – 07/06

  8. June 2006: Posts by language

  9. Feeds • RSS: Really Simple Syndication, RichSite Summary or RDF Site Summary • 1997: David Winer introduced an XMLsyndication format for blogs • 1999: Netscape defined RSS using RDF • Very important for blogs and other social media • An efficient way to distribute new items, changes, updates • Simplifies infrastructure, obviating crawling • Google blogs search is really Google feed search • Feeds for “most recent” blog posts, Wikipedia changes, news articles, sensor information, photos, data elements, etc.

  10. Overview • Motivation • Blogs and feeds • UMBC research • Seedling opportunities • Conclusion

  11. Relevant UMBC Research • Splog detection • Feeds that matter • BlogVox: Extracting opinions from blogs • Modeling influence in blog communities • Semnews: NLP for information extraction on the Web • Semdis: Modelling trust in social networks

  12. Knowing and influencing the market • Your goal is to market Apple’s ipod phone • How can you track the buzz about it? • What are the relevant communities andblogs? • Which communities are fans, which aresuspicious, which are put off by the hype? • Is your advertising having an effect? Thedesired effect? • Which bloggers are influential in this market? Of these, which are already onboard and which are lost causes? • To whom should you send details or evaluation samples?

  13. Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated beliefs and opinions of the masses can have an influence • Influence is polar • Influence is temporal

  14. Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated opinions of the masses can have an influence • Influence is polar • Influence is temporal

  15. Post was Influenced by NPR and eWeek Influence on the Blogosphere

  16. Influence Models for Blogs Blog Graph Influence Graph 1/3 U 2 2 1 3 3 2/5 1/3 V 1/3 1 1 1 1/5 5 5 2/5 4 4 1/2 1/2 Wu,v = Cu,v / dv U links to V => U is Influenced by V

  17. Basic Influence Models Linear Threshold Model Σ wuv ≥ θv w is the active neighbor of v Cascade Model Puv-probability with which a node can activate each of its neighbors, independent of history. Influence Graph 1/3 Active 2 1 3 2/5 1/3 θv 1/3 1 1 1/5 5 2/5 Active 4 Inactive 1/2 1/2

  18. Greedy Node Selection Heuristic • At each time step select the next node to be added to the target set such that it maximizes: • number of ‘influential’ node • adding the new node causes an increase in the activated node set • consistent with Technorati rank Influence Graph 1/3 2 1 3 2/5 1/3 1/3 1 1 1/5 5 2/5 4 1/2 1/2 Distribution of Technorati ranks in the 100 most frequently selected nodes using greedy heuristics (averaged over 50+ runs)

  19. Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated opinions of the masses can have an influence • Influence is polar • Influence is temporal

  20. Influence is topical • Gizmodo is very popular • It’s influential for consumer electronics, e.g., PDAs, mobile phones, gadgets • DailyKOS is very popular • It’s influential for politics, especially liberal politics • What’s a good ontology for blog topics? • How can we categorize blogs w.r.t. a topic ontology?

  21. Readership Based Influence Feeds That Matter: http://ftm.umbc.edu/ • 83K publicly listed subscribers • 2.8M feeds, 500K are unique • 26K users (35%) use folders to organize subscriptions • Data collected in May 2006

  22. Tag Cloud Before Merge

  23. Tag Cloud After Merge

  24. Tag Merging Folder names are used as topics. Lower ranked folder are merged into a higher ranked folder if there is an overlap and a high cosine similarity.

  25. Finding Influential Feeds using “Co-Citations” Feed recommendations Leading blogs about “Politics”. Seed set is top blogs in “politics” from bloglines and blog graph used is from Blogpulse dataset..

  26. Modeling influence in social media • Key individuals in a social network are those that are influential. • Influential nodes often rely on connectors and information propagators for new topics. • Influence is topical. • Aggregated facts and opinions of the masses can have an influence (‘wisdom of the crowds’) • Influence is polar. • Influence is temporal.

  27. Extracting facts and opinions • 2006 TREC blog track: finding opinionated blog posts about a given topic • SemNews: extracting facts from Web documents using the OntoSem NLP system • Note: there are several startups and other companies trying to commercialize opinion mining

  28. TREC Opinion Extraction • Finding opinionated posts, either positive or negative, about a query • 2006 TREC Blog corpus: • 80K blogs • 300K posts • 50 test queries

  29. BlogVox: Opinion Extraction Result Scoring SVM Score Combiner 1 Query Word Proximity Scorer 4 First Occurrence Scorer Query Terms + 2 Query Word Count Scorer 5 Context Words Scorer Opinionated Ranked Results Lucene Search Results 3 Title Word Scorer 6 Lucene Relevance Score External Resources Supporting Lexicons Positive Word List Google Context Words Negative Word List Amazon Review Words

  30. Spam in the Blogosphere • Types: comment spam, ping spam, splogs • Akismet: “87% of all comments are spam” • 75% of update pings are spam (ebiquity 2005) • 56% of blogs are spam (ebiquity 2005) • 20% of indexed blogs by popular blog search engines is spam (Umbria 2006, ebiquity 2005) • Spam blogs (splogs) are weblogs used to promoting affiliated websites or host ads • “Spings, or ping spam, are pings that are sent from spam blogs” 1Wikipedia

  31. Motivation: host ads

  32. Motivation: index affiliates, promote pageRank

  33. Some queries returned mostly splogs hybrid cars cholesterol

  34. Post Content Identification • Baseline Heuristic • SVM Method

  35. Effect of sidebar content

  36. Preliminary results

  37. Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated opinions of the masses can have an influence • Influence is polar • Influence is temporal

  38. Link Polarity / Citation Signal • Linking alone is not indicator of influence • Polarity can indicate the type of influence • All links not made equal • Post • Comment • Trackback • Blogroll • Advertising • Polarity useful in other applications like trust and bias. <books,-0.9> D <Movies, +0.9> B <food, +0.3> <cars,+0.5> <Movies, +0.8> A C <Music, -0.6>

  39. Modeling influence in social media • Key individuals in a social network are those that are influential • Influential nodes often rely on connectors and information propagators for new topics • Influence is topical • Aggregated opinions of the masses can have an influence • Influence is polar • Influence is temporal

  40. Unwind the Influence in Time • Who started the initial wave? • Who jumped on the story at the same time? • How far did the wave propagate? S t1 t2 t3 t1 t4 t5

  41. Visualizing Influence in Time

  42. SemNews: News to OWL • Semantically Search and Browse news • Aggregators collect the RSS news descriptions form various sources. • The sentences are processed by OntoSem and are converted into TMRs • And then into RDF and OWL • Provides intelligent agents with the latest news in a machine readable format • http://semnews.umbc.edu/

  43. Fact Repository Interface Language Processing Data Aggregators 1 11 2 OntoSem RSS Aggregator Ontology & Instance browser 3 4 News Feeds TMRs FR Text Search 12 RDQL Query 13 6 5 OntoSem2OWL Swoogle Index 14 9 Dekade Editor 7 OntoSem Ontology (OWL) Inferred Triples Semantic RSS 15 10 8 Knowledge Editor Environment TMR Semantic Web Tools

More Related