1 / 36

Topics Detection and Tracking

Topics Detection and Tracking. Presented by CHU Huei-Ming 2004/03/17. Reference. Pattern Recognition in Speech and Language Processing Chap. 12 Modeling Topics for Detection and Tracking James Allan University of Massachusetts Amherst Publisher:CRC Pr I Llc Published 2003/02

chi
Download Presentation

Topics Detection and Tracking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17

  2. Reference • Pattern Recognition in Speech and Language Processing • Chap. 12 Modeling Topics for Detection and Tracking • James Allan • University of Massachusetts Amherst • Publisher:CRC Pr I Llc Published 2003/02 • UMass at TDT 2004 • Margaret Connel, Ao Feng, Giridhar Kumaran, Hema Raghavan, Chirag Shah, James Allan • University of Massachusetts Amherst • TDT 2004 workshop

  3. Topic Detection and Tracking (1/6) • The goal of TDT research is to organize news stories by the events that they describe. • The TDT research program began in 1996 as a collaboration between Carnegie Mellon University, Dragon Systems, the University of Massachusetts and DARPA • To find out how well classic IR technologies addressed TDT, they created a small collection of news stories and identified some topics within them

  4. Topic Detection and Tracking (2/6) • Event • something that happen at some specific time and place, along with all necessary preconditions and unavoidable consequenes • Topic • capture the larger set of happenings that are related to some triggering event • By forcing the additional events to be directly related, the topic is prevented from spreading out to include too much news

  5. Topic Detection and Tracking (3/6) • TDT Tasks • Segmentation • Break an audio track into discrete stories, each on a single topic • Cluster Detection (Detection) • Place all arriving news stories into groups based on their topics • If no existing group’s , the system must decide whether to create a new topic • Each story is placed in precisely one cluster • Tracking • Starts with a small set of news stories that a user has identified as being on the same topic • The system must monitor the stream of arriving news to find all additional stories on the same topic

  6. Topic Detection and Tracking (4/6) • New Event Detection (first story detection) • Focuses on the cluster creation aspect of cluster detection • Evaluated on its ability to decide when a new topic (event) appears • Link Detection • Determine weather or not two randomly presented stories discuss the same topic • The solution of this task could be used to solve new event detection

  7. Topic Detection and Tracking (5/6) • Corpora • TDT-2 : in 2002 is being augmented with some Arabic news from the same time period • TDT-3 : it is created for 1999 evaluation , and stories from four Arabic sources are being added during 2002

  8. Topic Detection and Tracking (6/6) • Evaluation • P(target) is the prior probability that a story will be on topic • Cx are the user-specified values that reflect the cost associated with each error • P(miss) and P(fa) is the actual system error rates • Within TDT evaluations, Cmiss=10 , Cfa=1 • P(target) = 1- P(off-toget) = 0.02 (derived from training data)

  9. Basic Topic Model • Vector Space • Represent items as (stories or topics) as vector in a high dimensional space • The most common comparison function is the cosine of the angle between the two vectors • Language Models • A topic is represented as a probability distribution of words • The initial probability estimates come form the maximum likelihood estimate based on the document • Use of topic model • See how likely the particular story could be generated by the model • Compare them directly : symmetric version of Kullback-Leibler divergence

  10. Implementing the Models (1/3) • Name Entities • News is usually about people, so it seems reasonable that their names could be treated specially • Treat the name entities as a separate part of the model and then merge the part • Boost the weight of any words in the stories that come from names, give them a larger contribution to the similarity when the names are in common • Improve the result slightly, no strong stress so far

  11. Implementing the Models (2/3) • Document Expansion • In the segmentation task, a possible segmentation boundary could be checked by comparing the models generated by text on either side • The text could be used as a query to retrieve a few dozen related stories and then the most frequently occurring words from those stories could be used for the comparison • Relevance models results in substantial improvements in the link detection task

  12. Implementing the Models (3/3) • Time decay • The likelihood that two stories discuss the same topic diminished as the stories are further separated in time • In a vector space model, the cosine similarity function can be changed so that it include a time decay

  13. Comparing model (1/3) • Nearest Neighbors • In the vector space model, a topic might be represented as a single vector • To determine whether or not that story is on any of the existing topics we consider the distance between the story’s vector and the closest topic vector • If it falls outside the specified distance, the story is likely to be the seed of a new topic and a new vector can be formed

  14. Comparing model (2/3) • Decision Trees • The best place of decision trees within TDT may be the segmentation task • There are numerous training instances (hand-segmented stories) • Finding features that are indicative of a story boundary is possible and achieves good quality

  15. Comparing model (3/3) • Model-to-Model • Direct comparison of statistical language models that represent topics • Kullback-Leibler idvergence • To finesse the measure, calculate the both ways and add them together • One approach that has been used to incorporate that notion penalized the comparison if the models are too much like background news

  16. Miscellaneous Issues (1/3) • Deferral • All of tasks are envisioned as “on-line” task • The decision about a story is expected before the next story is presented • In fact, TDT provides a moderate amount of look ahead for the tasks • First, stories are always presented to the system grouped into “files” that correspond to about a half hour of news • Second, the formal TDT evaluation incorporates a notion of deferral that allows a system to explore the advantage of deferring decisions until several files have passed.

  17. Miscellaneous Issues (2/3) • Multi-modal Issues • TDT systems must deal with are either written text (newswire) or read text (audio) • Speech recognizers make numerous mistakes, inserting, deleting, and even completely transforming words into other words • The difference of the two modes is the score normalization • The pair of story drawn from different source the distribution is different, in order the score is comparable, a system needs to normalize depends on those modes

  18. Miscellaneous Issues (3/3) • Multi-lingual Issues • The TDT research program has strong interest in evaluating the tasks across multiple languages • 1999~2001 sites were required to handle English and Chinese news story • 2002 sites will be incorporating Arabic as a third language

  19. Using TDT Interactively (1/2) • Demonstrations • Lighthouse is a prototype system that visually portrays inter-document similarities to help the user find relevant material more quickly

  20. Using TDT Interactively (2/2) • Timelines • Using a timeline to show not only what the topic are, but how they occur in time • Using X2 measure to determine whether or not that feature is occurring on that day in a unusual way

  21. UMass at TDT 2004 • Hierarchical Topic Detection • Topic Tracking • New Event Detection • Link Detection

  22. Hierarchical Topic DetectionModel Description (1/8) • This task replaces Topic Detection in previous TDT evaluations • Used vector space model as the based line • Bounded clustering to reduce time complexity and had some simple parameter tuning • Stories in the same event tend to be close in time, we only need to compare a story to its “local” stories instead of the whole collection • Two steps • Bounded 1-NN for event formation • Bounded agglomerative clustering for building the hierarchy

  23. Hierarchical Topic DetectionModel Description (2/8) • Bounded 1-NN for event formation • All stories in the same original language and from the some source are taken out and time ordered • Stories are processed one by one and each incoming story is compared to a certain number of stories(100 for baseline) before it. • Similarity of the current story and the most similar previous story is lager than a given threshold (0.3 for baseline) the current story will be assigned to the event that the most similar previous story belongs to, otherwise, a new event is created • There is a list of events for each source/language class • The event within each class are sorted by time according to the time stamp of the first story

  24. Hierarchical Topic DetectionModel Description (3/8) • Bounded 1-NN for event formation S1 S2 S3 S1 S2 Language A Language B

  25. Hierarchical Topic DetectionModel Description (4/8) • Each source is segmented in to several parts, and sorted by time according to the time stamp of the first story • Sorted event list

  26. Hierarchical Topic DetectionModel Description (5/8) • Bounded agglomerative clustering for building the hierarchy • Take a certain number of events (the number is called WSIZE default is 120) from the sorted event list • At each iteration, find the closest event pair and combine the later event to the earlier one

  27. Hierarchical Topic DetectionModel Description (6/8) • Each iteration find the closest event pair and combine the later event to the earlier one I1 I2 I3 Ir-1 Ir

  28. Hierarchical Topic DetectionModel Description (7/8) • Bounded agglomerative clustering for building the hierarchy • Continues for (BRANCH-1)WSIZE/BRANCH iterations, so the number of clusters left is WSIZE/BRANCH • Take the first half out and get WSIZE/2 new events and agglomerative cluster until WSIZE/BRANCH clusters left • The optimal value is around 3, BRANCH=3 as baseline

  29. Hierarchical Topic DetectionModel Description (8/8) • Then all clusters in the same language but from difference sources are combined • Finally clusters from all languages are mixed and clustered until only one cluster is left, which become the root • Used machine translation for Arabic and Mandarin stories to simplify the similarity calculation

  30. Hierarchical Topic DetectionTraining (1/4) • Training corpus : TDT4 – newswire and broadcast stories Testing corpus : TDT5 – newswire only • Taking newswire stories from the TDT4 corpus includes NYT, APW, ANN, ALH, AFP, ZBN, XIN 420,000 stories TDT-4 Corpus Overview

  31. Hierarchical Topic DetectionTraining (2/4)

  32. Hierarchical Topic DetectionTraining (3/4) • Parameters • BRANCH : average branching factor in the bounded agglomerative clustering algorithm • Threshold : in the event formation to decide if a new event will be created • STOP : in each source, the number of cluster is smaller than square root of the number of story • WSIZE : the maximum window size in agglomerative clustering • NSTORY: Each story will be compared to at most NSTORY stories before it in the 1-NN event clustering, the idea comes from the time locality in event threading

  33. Hierarchical Topic DetectionTraining (4/4) • Among the clusters very close to the root node, some contains thousands of stories. • Both 1-NN and agglomerative clustering algorithms favor large clusters • Modified the similarity calculation to give smaller clusters more chances • Sim(v1,v2) is the similarity of the cluster centroids • |cluster1| is the number of stories in the first story • a is a constant to control how much favor smaller clusters can get

  34. Hierarchical Topic DetectionResult (1/2) • Three runs for each condition: UMASSv1, UMASSv12 and UMASSv19

  35. Hierarchical Topic DetectionResult (2/2) • Small branching factor can reduce both detection cost and travel cost • Small branching factor, there are more clusters with different granularities • The assumption of temporal locality is useful in event threading, more experiments after the submission show larger window size can improve performance

  36. Conclusion • Discussed several of the techniques that systems have used to build or enhance those models and listed merits of many of them • The TDT researchers can extent to which IR technology can be used to solve TDT problems

More Related