1 / 41

Text-Based Topic Segmentation

Text-Based Topic Segmentation. Vaibhav Mallya EECS 767 Radev. Agenda. Definitions Applications Hearst’s TextTiling Probablistic LSA Unsupervised Bayes Discussion. Definitions.

berke
Download Presentation

Text-Based Topic Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text-Based Topic Segmentation VaibhavMallya EECS 767 Radev

  2. Agenda • Definitions • Applications • Hearst’s TextTiling • Probablistic LSA • Unsupervised Bayes • Discussion

  3. Definitions • Topic Segmentation – Given a single piece of language data how can we effectively divide it into topical chunks? • F.ex: A single news story might cover • Economic situation • A train wreck in Belize • Industrial espionage

  4. Definitions • But what does a topic within a document consist of? • Usually we consider it • Internally consistent subject (nouns, verbs) • Gradual elaboration or exposition on this subject • “Less related” to adjacent topics

  5. Definitions • “Discourse Model” – How do we expect this text was generated, or what is it trying to get across? • Multiple parties sharing points of view? • Single person positing theories? • Debate? • Some algorithms designed for specific discourse models, others more generic • Are results better or worse with one or the other? • How feasible is it to deliver general purpose algorithms? • At the very-least, tokenization strategies must differ (?)

  6. Definitions • Lexical chain – Sequence of related words in text • Somewhat independent of grammatical structure • A good lexical chain captures the “cohesive structure” of the text • John bought a Jag. He loves the car. • Car -> Jag • He -> John

  7. Applications • Applications lie primarily in unstructured dialogue and text • Figuring out how broad-based a news story or article may be • Topic shifts in dialogue (does Google Voice transcription use this?) • Assisting with meeting note transcription

  8. Applications • A lot of topic segmentation is already done by hand and used in search. • Wikipedia, Java: http://www.google.com/search?q=sorting+algorithms

  9. Hearst’s TextTiling • UC Berkeley and Xerox PARC • Early topic segmentation algorithm • Two possible goals • Identify topical units • Label contents meaningfully • Paper focuses on the former – simply identifying unmarked borders

  10. Hearst’s TextTiling • Some prior works model discourse as hierarchical • Topics, sub-topics, sub-sub-topics • Hearst focused on coarse-grained linear model • Hence “tiling”

  11. Hearst’s TextTiling • “The more similar two blocks of text are, the more likely it is the current subtopic continues” • Tokenization • Similarity Determination • Boundary Identification.

  12. Hearst’s TextTiling • 1) Tokenization • Basic tokens are “pseudosentences” aka token-sequences • Token-sequences – strings of tokens of length ‘w’ • Stopword list used (frequent words eliminated) • Each (stemmed) token stored in table, along with how frequently it occurs in each token-sequence

  13. Hearst’s TextTiling • 2) Similarity Determination • Use a sliding window • Compare blocks of token-sequences for similarity • These are “paragraphs” in this scheme • Blocksize parameter = k, • Blockwise similarity calculated via cosine measure

  14. Hearst’s TextTiling • Blocks b1 and b2 • k token-sequences eac • t ranges over all tokenized terms • wt,b1 is weight assigned to term t in block b1 • Weights = frequency in block • High: Closer to 1 • Low: Closer to 0

  15. Hearst’s TextTiling • But this is a sliding window • First, second blocks span [i-k, i]and [i+1, i+k+1] respectively • We are actually assigning number between i,i+1 • Use smoothing with window size of three

  16. Hearst’s TextTiling • 3) Boundary Identification • Now we can use our sequence of similarity scores • Find “changes” over the line to calculate “depth scores” • Find every peak pi • Now find relative height: hi = (pi - pi+1) + (pi - pi-1) • “Highest” hi values correspond to boundaries • As described in paper, some experimentation is necessary; they come up with some threshold value they can use.

  17. Hearst’s TextTiling • Evaluation criteria • Compare against human judgment of topic segments • This paper uses Stargazers, a sci-fi text

  18. Hearst’s TextTiling

  19. Demo • Implementation example • Python Natural Language Toolkit • Not true to the original paper, but a good demonstration (fits on existing paragraph boundaries)

  20. Probabilistic LSA • Brants, Chen, Tsochantaridis • PARC, PARC, Brown University • Applies PLSA to topic segmentation problem • Then selects segmentation points based on the similarity values between pairs of adjacent blocks.

  21. Probabilistic LSA • Review of Latent Semantic Analysis • Matches synonymous words • Begin with a straight high-dimensional word-count matrix • Apply Singular Value Decomposition • Obtain simpler “semantic space” • Similar terms and documents should be close or even adjacent

  22. Probabilistic LSA • Review of Probabilistic Latent Semantic Analysis as described in the paper • Conditional probability between documents d and words w is modeled through latent variable z • P(w|z), P(z|d) • z is a kind of class or topic • Joint probability is then • Then apply Expectation-Maximization to maximize

  23. Probabilistic LSA • 1) Preprocessing • Tokenize (ignoring stop-words) • Normalize (lower-case) • Stem • Identify sentence boundaries

  24. Probabilistic LSA • 2) Blockify • Elementary block is (in this case) a “real” sentence • Blocks are sequences of consecutive elementary blocks • In actual segmentation, use sliding window to create blocks • Each block is composed of constant h number of elementary blocks

  25. Probabilistic LSA • 2) Blockify (continued) • Each block represented by term vector f(w|b) • Experimentally “good” number of latent classes: • Z ~=~ 2*number of human-assigned topics

  26. Probabilistic LSA • 3) Segmentation • Locations between paragraphs are used as starting points • Folding-in performed on each block b to compute distribution • Compute P(z|b), P(w|b) • P(w|b) = Estimated distribution of words for each block b =

  27. Probabilistic LSA • 3) Segmentation (continued) • This is done for all words w • Calculate blockwise similarity, find “dips” (local minima) • Calculate relative size of dip (equation in paper) • A priori knowledge of number of segments N lets us terminate after finding N dips • Otherwise termination is determined by threshold (paper provides value of 1.2)

  28. Probabilistic LSA • Evaluation • Authors choose a fixed training corpus and fixed actual corpus- • They use word error rate and sentence error rate as metrics (still not sure what these are) • WER: Probability that that a randomly chosen pair of words at distance kw words apart is erroneously classified • SER: Same as above but for sentences • Comparison against some other algorithms (including TextTiling) is done as well.

  29. Probabilistic LSA

  30. Probabilistic LSA

  31. Probabilistic LSA

  32. Probabilistic LSA

  33. Unsupervised Bayes • Jacob Eisenstein and Regina Barzilay, CSAIL, MIT • Relatively recent paper (2008)

  34. Unsupervised Bayes • As we’ve seen so far, text has been treated as raw data • “Lexical cohesion” thus far only measure of topics • No semantic information explicitly retained or utilized • For the purposes of topic segmentation, there is one obvious semantic element that somehow could be incorporated:

  35. Unsupervised Bayes • Transition Words and Cue Phrases • “Now”, “Then”, “Next” • “As previously discussed”, “On a Related Note” • Obviously, these give embarrassingly obvious indicators that a topic will probably change

  36. Unsupervised Bayes • This method “situates lexical cohesion within a Bayesian Framework” • Still use a linear discourse structure • Words are drawn from a generative language model • Use known cue phrases as guide

  37. Unsupervised Bayes • [lots of math…]

  38. Unsupervised Bayes • Evaluation functions: • WindowDiff (Pevzner and Hearst, 2002) • P_k (Beeferman et al, 1999) • Both pass a “window” through a document, • Assess whether sentences on “edge” of the window are segmented w.r.teach other • WindowDiff is slightly “stricter”

  39. Unsupervised Bayes

  40. Unsupervised Bayes • Results • Cue phrases are useful, but their total effectiveness is dataset dependent • Writers do not always use cue phrases consistently • Cue phrases may be more useful for speech/meeting transcription and analysis than narration or literature

  41. Discussion • Potential future, or unexplored applications? • Analogues possible in other kinds of text? • Used to assign complexity scores to literature? • Maybe incorporate into Fleisch-Kincaid? • Focus is on complete articles, stories, etc • What about streaming or live news?

More Related