1 / 35

High-Level Text Analysis and Techniques

Duke University Libraries, Digital Scholarship Text > Data, October 25. Angela Zoss Data Visualization Coordinator 226 Perkins Library angela.zoss@ duke.edu. High-Level Text Analysis and Techniques. Documents as Context. But first,. Angela As Context.

nhung
Download Presentation

High-Level Text Analysis and Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Duke University Libraries, Digital Scholarship Text > Data, October 25 Angela Zoss Data Visualization Coordinator 226 Perkins Library angela.zoss@duke.edu High-Level Text Analysis and Techniques

  2. Documents as Context

  3. But first, Angela As Context

  4. How I learned to love the document. B.A. courses: Linguistics, Communication M.S. courses: Communication, Human-Computer Interaction Employment: arXiv.org Administrator Ph.D. courses: • Bibliometrics/Scientometrics • Computer Mediated Discourse Analysis • Latent Structure Analysis • Natural Language Processing

  5. Now, Documents as Context

  6. Text analysis from… • documents down to words (“low-level”) • words up to documents (“high-level”)

  7. Using documents to learn about language (or other social phenomena) Analyzing documents as records/proxies of language, social structures, events, etc. Linguistic studies: morphology, word counts, syntax, etc. … over time (e.g., Google ngram viewer) language across corpora (e.g., political speeches) Underwood, T. (2012). Where to start with text mining.

  8. Using documents to learn about language Historical culturomics of pronoun frequencies

  9. Using documents to learn about language Universal properties of mythological networks

  10. Using language to learn about documents Analyzing documents as artifacts themselves, with their own properties and dynamics Literary, documentary studies:Structural/rhetorical/stylistic analysisDocument categorization, classificationDetecting clusters of document features (topic modeling) Underwood, T. (2012). Where to start with text mining.

  11. Using language to learn about documents Literary Empires, Mapping Temporal andSpatial Settings in Swinburne

  12. Using language to learn about documents Using Word Clouds for Topic Modeling Results

  13. What are documents? For this discussion, digital versions of works of spoken or written language Examples: books, articles, transcripts, emails, tweets…

  14. Documents as context Documents have: • form(at) • style • provenance • entities • intentions

  15. Studies of Documents

  16. Why study documents? • Describe a corpus • Compare/organize documents • Locate relevant information/filter out irrelevant information

  17. Describing a corpus • Finding regularities/differences across groups of documents • Developing theories of structure, style, etc. that can then be tested or applied • May be manual (content analysis) or computer-assisted (statistical)

  18. Example: Storylines http://xkcd.com/657/

  19. Differences of format, genre, participants… • Articles may have sections, but these will vary by discipline and type of article • Books may be fiction or non-fiction (or both) • Transcripts may refer to multiple speakers, non-text content • …ad infinitum

  20. Example: Literature Fingerprinting Keim, D. A., & Oelke, D. (2007). Literature fingerprinting: A new method for visual literary analysis. In IEEE Symposium onVisual Analytics Science and Technology, VAST 2007 (pp.115-122). doi: 10.1109/VAST.2007.4389004

  21. Organizing documents Detect similarity between documents and a known category (or simply among themselves) Supports browsing, sentiment analysis, authorship detection

  22. Example: Bohemian Bookshelf Thudt, A., Hinrichs, U., & Carpendale, S. (2012). The Bohemian Bookshelf: Supporting Serendipitous Book Discoveries through Information Visualization. In CHI '12: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, to appear.

  23. Similarity based on… • common document attributesauthorship, genre • common language patternstopics, phrases • common entity referencescharacters, citations

  24. Example: Quantitative Formalism Allison, S., Heuser, R., Jockers, M., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Pamphlets of the Stanford Literary Lab (vol. 1).

  25. Example: Clinton’s DNC Speech http://b.globe.com/TogUqq

  26. Example: View DHQ http://digitalliterature.net/viewDHQ/vis3.html

  27. Classification • assigning an object to a single class • often supervised, using an existing classification scheme and a tagged corpus

  28. Example: Relative signatures Jankowska, M., Keselj, V., & Milios, E. (2012). Relative n-gram signatures: Document visualization at the level of character n-grams. In Proceedings of IEEE Conference on Visual Analytics Science and Technology 2012 (pp. 103-112).

  29. Categorization • assigning documents to one or more categories • suggestive of unsupervised clustering techniques • design choices made to fit particular tasks or goals

  30. Example: UCSD Map of Science Börner, K., Klavans, R., Patek, M., Zoss, A. M., Biberstine, J. R., Light, R. P., Larivière, V., & Boyack, K. W. (2012). Design and update of a classification system: The UCSD Map of Science. PLoS ONE, 7(7), e39464.

  31. Example: NIH Map Viewer https://app.nihmaps.org/nih/browser/

  32. Reference systems, infrastructure What do we gain by adding structure?What do we lose?

  33. Summarizing Documents

  34. Text is only one component of a document. Research questions often push us to be creative with how we operationalize constructs. The richness of language and documents is best preserved by using multiple, complementary approaches.

  35. angela.zoss@duke.edu Questions?

More Related