1 / 54

Introduction of the ASV Subproject Report on recent state of work and other activities

Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural Language Processing Group Department of Computer Science University of Leipzig. What do you do with a million books? ‏.

tehya
Download Presentation

Introduction of the ASV Subproject Report on recent state of work and other activities

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction of the ASV Subproject Report on recent state of work and other activities eTRACES Sponsor Meeting Leipzig, 2012/05/07 Marco Büchler Natural Language Processing Group Department of Computer Science University of Leipzig

  2. What do you do with a million books?

  3. We do not have any native speakers for ancient languages like ancient Greek and Latin ... ‏

  4. Agenda • Scope of ULEI's subproject • Who is involved? • ACID for the eHumanities as a paradigm for successful projects

  5. Basics for ULEI's subproject

  6. A fundamental problem: How to find relevant information in massive data?

  7. Two initially associated documents

  8. Documents are linked with a direction

  9. Documents are linked with a direction: such as web links

  10. Documents are linked in both directions: A loop

  11. Detecting relevance: a document can be linked by more than one doc

  12. Detecting relevance: a document can be linked by more than one doc

  13. Detecting relevance on an entire digital library

  14. Computing relevance weights (by reliability) on an entire digital library Source: http://en.wikipedia.org/wiki/PageRank The name of this strategy is Google's PageRank algorithm.

  15. Some aspects about Google's PageRank algorithm • Ranking is done by relevance weights (weighted links to a page) • Benefit for humanities applications: • Ranking does not necessarily need term weights as done with tf.idf • e. g. Shakespeare's „to be, or not to be“ In humanities relevant data, however, we do not have a link structure like in web based html files.

  16. A similar problem: Two initial documents with text re-use

  17. Given e. g. dating information: text re-use with direction I Our assumption: A quotation always implies a given relevance of the quoted author by the quoting author – either in a positive or negative way.

  18. Given e. g. dating information: text re-use with direction II

  19. Given e. g. dating information: text re-use with direction III

  20. An old discipline: Text re-use in traditional humanities Manually produced record of text re-use.

  21. Some research objectives In addition to Google's PageRank: • Differentiate by • Text re-use temperature • Text re-use coverage • Relevance by • high score • low score

  22. Some answers to the intial questions/statements What do you do with a million books? • Cultural heritage of textual re-uses • Text re-use graphs for something like a „Cultural Heritage aware PageRanking“ We do not have any native speakers for ancient languages like ancient Greek and Latin ... - Crowd sourcing provides on historical texts qualitative results, however, humanists are no native speakers - The „Cultural Heritage aware PageRanking“ approach aims to deal with relevance given by native Speakers even if they are not available, nowadays

  23. Who is involved?

  24. Active collaborators eTRACES/ULEI (Prof. Dr. Gerhard Heyer) 'The Team' Interface:projects (Dr. Uwe Crenze) The business partner Fragmentary texts (Dr. Monica Berti) 'The Humanist' Perseus Digital Library (Prof. Dr. Gregory Crane) 'The Content Provider'

  25. „ACID for the eHumanities“

  26. A new paradigm for successful eHumanities projects • The million dollar question: How to manage an eHumanities project successfully? • After 4 years of activities in the eHumanities, you need just four questions: Acceptance: How do you get humanists' acceptance for your techniques? Complexity: Understand the complexity of necessary subtasks! e. g.: What is the archetypus? Interoperability: How can components or data interact with each other? Diversity: Understand your data! e. g.: What does text re-use mean for your digital library? The ACID paradigm for the eHumanities

  27. „ACID for the eHumanities“: Interoperability

  28. „ACID for the eHumanities“: (Data) Interoperability I Perseus DdbDP (XML) vs. Epiduke (XML) Source: Pansch, D. 2010, Data Integration Methods for Structural Heterogeneous Data in an eHumanities' Context, Bachelor thesis, 2010.

  29. „ACID for the eHumanities“: (Data) Interoperability II Source: Pansch, D. 2010, Data Integration Methods for Structural Heterogeneous Data in an eHumanities' Context, Bachelor thesis, 2010.

  30. „ACID for the eHumanities“: (Data) Interoperability III • Several kinds of interoperability issues on • Horizontal: • Data level • Algorithm level • Tool/application level • Vertical: • e. g. between data and algorithm

  31. „ACID for the eHumanities“: Diversity

  32. „ACID for the eHumanities“: (Node) Diversity Understand your data: Understand the re-used text chunks. ( a knowledge thing)

  33. „ACID for the eHumanities“: (Relation) Diversity Understand your data: Understand how text is re-used in your data. (an experience thing)

  34. „ACID for the eHumanities“: Diversity - 6 levels of text re-use Text re-use is about unsupervised quotation detection in textual data. - Level 1: Pre-processing (Cleaned and prepared data) - Level 2: Featuring (Digital fingerprint of a re-use unit) - Level 3: Feature selection (Signature of a digital fingerprint) - Level 4: Linking (Match of re-use units that have at least one feature in common) - Level 5: Scoring (Weighting of linked re-use units) - Level 6: Post-processing (e. g. post selection or views that depend on research questions) Implemented in TRACER (http://etraces.e-humanities.net/TRACER): - Tool available in 2013 - Teaching courses (full week) are planned for 2013 - More than one million permutations of implementations of the 6 levels possible (05/2012)

  35. „ACID for the eHumanities“: Acceptance

  36. Interdisciplinary collaborations: The problem!

  37. Computer Scientists: Change your view for understanding humanists How to get acceptance of humanists if text mining is a black box that can't be looked in?

  38. What we need! Transparency: How to provide user-friendly insights into complex mining techniques and machine learning?

  39. Jumping into the mining process: Level 0 – Initial request

  40. Jumping into the mining process: Level 1 - Preprocessing

  41. Jumping into the mining process: Level 2 - Featuring

  42. Jumping into the mining process: Level 3 - Selection

  43. Jumping into the mining process: Level 4 - Linking

  44. Jumping into the mining process: Level 5 - Scoring

  45. „ACID for the eHumanities“: Complexity

  46. „ACID for the eHumanities“: Complexity I • Archetypus detection means to identify the origin of a thought or a chunk of text (or at least the earliest occurrence). • Sentiment (Acceptance)detection means if a text passage is re-used in a „positive“ or „negative“ way An example: • German: „Gleich und gleich gesellt sich gern.“ • Englisch: „Like will to like.“ „Birds of a feather flock together.“ (“to bring like and like together”) Question: How would/do you use this phrase regarding sentiments in your daily life?

  47. „ACID for the eHumanities“: Complexity II Hom. Od. 17 215-219: As he saw them, he spoke and addressed them, and reviled them in terrible and unseemly words, and stirred the heart of Odysseus: “Lo, now, in very truth the vile leads the vile. As ever, the god is bringing like and like together. Whither, pray, art thou leading this filthy wretch,1 thou miserable swineherd, ...

  48. „ACID for the eHumanities“: Complexity III • German phrase: „jemanden auf's Dach steigen“ • English (literally translated): „to climb onto someone's roof“ • English (semantically translated): „to put someone down“, „tell someone off“ • Understanding the example: • Goes back to a German tradition between 7th and 12th century • Young men went onto other's (and not following the rules of the community ) roof in order to remove it. • Happened especially during (German) carnival and Shrove Tuesday • There was no legal rule about it ... • ... in early Middle-ages, however, this became fundamental part of early adaptions of constitutions

  49. „ACID for the eHumanities“: Complexity III The home is invoilable. Article 13 of the recent German constitution Focus here: Constitution evolution task in different societies.

  50. „ACID for the eHumanities“: Complexity IV Article 13: The home is invoilable. vs. judgement to online observation by federal institutions in context of terrorism ... Das Schutzgut dieses Grundrechts ist die räumliche Sphäre, in der sich das Privatleben entfaltet [...]. Neben Privatwohnungen fallen auch Betriebs- und Geschäftsräume in den Schutzbereich des Art. 13 GG [...]. Dabei erschöpft sich der Grundrechtsschutz nicht in der Abwehr eines körperlichen Eindringens in die Wohnung. Als Eingriff in Art. 13 GG sind auch Maßnahmen anzusehen, durch die staatliche Stellen sich mit besonderen Hilfsmitteln einen Einblick in Vorgänge innerhalb der Wohnung verschaffen, die der natürlichen Wahrnehmung von außerhalb des geschützten Bereichs entzogen sind. Dazu gehören nicht nur die akustische oder optische Wohnraumüberwachung [...], sondern ebenfalls etwa die Messung elektromagnetischer Abstrahlungen, mit der die Nutzung eines informationstechnischen Systems in der Wohnung überwacht werden kann. Das kann auch ein System betreffen, das offline arbeitet. ... Decision about online observation by the German government Source: http://www.bundesverfassungsgericht.de/entscheidungen/rs20080227_1bvr037007.html

More Related