1 / 16

Language and Geography

Language and Geography. Brendan O’Connor Social Media Analysis, 3/18/2010. http://anyall.org/blog/2009/05/where-tweets-get-sent-from/. Analyze Geography and Language. Using Twitter data: (1) Identify author & message locations (2) Side note: opinions about self’s location Applications:

clive
Download Presentation

Language and Geography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Language and Geography Brendan O’Connor Social Media Analysis, 3/18/2010

  2. http://anyall.org/blog/2009/05/where-tweets-get-sent-from/

  3. Analyze Geography and Language Using Twitter data: (1) Identify author & message locations (2) Side note: opinions about self’s location Applications: (3) Analyze language use by geography • Example: find regional dialects (4) Predict geographically embedded real-world phenomena • Example: per-state retail sales

  4. Application: Retail Forecasting

  5. Identify author locations

  6. U.S. State Identification • String-matching approach • Match on • Full names (“Pennsylvania”) • Case-insensitive • Abbreviations (“PA”) • Case-sensitive

  7. Examples AZ Scottsdale, AZ MO St. Louis, MO MI Michigan CA Sacramento, CA FL Jacksonville, FL CA Santa Cruz, CA IN Indianapolis, Indiana CA 2OH!9, California TX Dallas, TX NY new york IL Chicago, IL CT Hartford, CT GA Georgia HI Hawaii WA Seattle, WA, USA CT Watertown, CT CA Bay Area, California DC DC Metro Area IA Iowa NC Raleigh, NC CA California CA southern california GA Atlanta, GA CA Porn Valley, CA TN Newbern, TN CA Westlake Village, CA, USA MS Dourados, MS ME U GOTTA CATCH ME! CA Malibu, California NC North Carolina NY Windsor, NY

  8. Examples AZ Scottsdale, AZ MO St. Louis, MO MI Michigan CA Sacramento, CA FL Jacksonville, FL CA Santa Cruz, CA IN Indianapolis, Indiana CA 2OH!9, California TX Dallas, TX NY new york IL Chicago, IL CT Hartford, CT GA Georgia HI Hawaii WA Seattle, WA, USA CT Watertown, CT CA Bay Area, California DC DC Metro Area IA Iowa NC Raleigh, NC CA California CA southern california GA Atlanta, GA CA Porn Valley, CA TN Newbern, TN CA Westlake Village, CA, USA MS Dourados, MS ME U GOTTA CATCH ME! CA Malibu, California NC North Carolina NY Windsor, NY

  9. Problems? AL AK AS AZ AR CA CO CT DE DC FM FL GA GU HI ID IL IN IA KS KY LA ME MH MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND MP OH OK OR PW PA PR RI SC SD TN TX UT VT VI VA WA WV WI WY

  10. Brazil • Brazilian states have two-letter abbreviation conventions like U.S., and many overlaps • Belém,PA • São Luís, MA • Maceió AL • “SC” • Myrtle Beach, SC • Charleston, SC, U.S.A. • Joinville - SC • Mafra - SC • Palmtios – SC • FLORIANÓPOLIS, SC, BRASIL

  11. U.S. State Identification • String-matching approach • Match on • Full names (“Pennsylvania”) • Case-insensitive • Abbreviations (“PA”) • Case-sensitive • Brazil check • Common words check • DE, ME

  12. Experiment • 4,793,729 messages – stream sample • 2,309,284 unique users • 1,624,983 unique users with non-blank location • Detections • 838,012 U.S. State • 346,553 Latitude, Longitude • 3,163 Five-digit ??Zip Code

  13. OH clevelandohio sadly :( NY Syracuse, NY :) IL Close to ur heart =],Illinois TX S.A TX :D MN Minnesota :) CA California, Newport Beach :) SC JERSEY but in Cola SC 4 now:-) NC Charlotte,NC =( CA Playboy Mansion California. :) NY Bronx,NY :)

  14. States, happy:sad, %happy ND 2:3 0.400 NV 2:3 0.400 MO 6:7 0.462 ID 2:2 0.500 WY 2:2 0.500 RI 5:3 0.625 UT 5:3 0.625 KY 12:6 0.667 MT 2:1 0.667 NE 6:3 0.667 NH 2:1 0.667 SD 4:2 0.667 MA 13:6 0.684 WI 11:5 0.688 WV 7:3 0.700 NM 5:2 0.714 AR 6:2 0.750 PA 19:3 0.864 NC 15:2 0.882 CO 8:1 0.889 TN 16:2 0.889 WA 18:2 0.900 NJ 40:4 0.909 PR 10:1 0.909 OK 11:1 0.917 FL 90:8 0.918 GA 24:2 0.923 ME 12:1 0.923 LA 55:4 0.932 AZ 31:2 0.939 DC 16:1 0.941 NY 146:9 0.942 IL 19:1 0.950 TX 151:5 0.968 CA 211:6 0.972 • CT 12:4 0.750 • DE 6:2 0.750 • SC 10:3 0.769 • MS 11:3 0.786 • OR 11:3 0.786 • IN 19:5 0.792 • AK 4:1 0.800 • GU 4:1 0.800 • KS 12:3 0.800 • MN 17:4 0.810 • IA 9:2 0.818 • OH 41:9 0.820 • HI 15:3 0.833 • MD 22:4 0.846 • MI 29:5 0.853 • AL 18:3 0.857 • VA 25:4 0.862

  15. Emoticon parsing 5658 :) 2032 :D 1391 ;) 845 =) 701 :] 583 :/ 554 =] 461 ;D 437 :P 338 =D 278 :( 245 ;] 197 :-) 138 ;-) 128 =P 122 :p 93 :O 67 ;P 51 :o 44 =/ 42 ;p 33 =p 31 =( 26 :\ 25 ;o 22 =[ 20 :-D 20 :[ 15 :-P 15 ;O 14 =O 11 :-p 9 :-/ 9 ;( 8 :-( 8 ;/ 7 :d 7 ;d 7 :-] 5 =o 3 ;-P 3 :-O 3 ;-D 3 ;[ 2 ;-p 2 =d 2 ;-( 2 =\ 1 :-d 1 :-[ 1 ;\ 1 ;-] 1 =-] 1 =-) NormalEyes = r'[:=]' Wink = r'[;]' NoseArea = r'(|o|O|-)’ HappyMouths = r'[D\)\]]' SadMouths = r'[\(\[]' Tongue = r'[pP]' OtherMouths = r'[doO/\\]’ Happy = NormalEyes + NoseArea + HappyMouths Sad = NormalEyes + NoseArea + SadMouths)

More Related