Topics Detection and Tracking

Topics Detection and Tracking Presented by CHU Huei-Ming 2004/03/17

Reference • Pattern Recognition in Speech and Language Processing • Chap. 12 Modeling Topics for Detection and Tracking • James Allan • University of Massachusetts Amherst • Publisher:CRC Pr I Llc Published 2003/02 • UMass at TDT 2004 • Margaret Connel, Ao Feng, Giridhar Kumaran, Hema Raghavan, Chirag Shah, James Allan • University of Massachusetts Amherst • TDT 2004 workshop

Topic Detection and Tracking (1/6) • The goal of TDT research is to organize news stories by the events that they describe. • The TDT research program began in 1996 as a collaboration between Carnegie Mellon University, Dragon Systems, the University of Massachusetts and DARPA • To find out how well classic IR technologies addressed TDT, they created a small collection of news stories and identified some topics within them

Topic Detection and Tracking (2/6) • Event • something that happen at some specific time and place, along with all necessary preconditions and unavoidable consequenes • Topic • capture the larger set of happenings that are related to some triggering event • By forcing the additional events to be directly related, the topic is prevented from spreading out to include too much news

Topic Detection and Tracking (3/6) • TDT Tasks • Segmentation • Break an audio track into discrete stories, each on a single topic • Cluster Detection (Detection) • Place all arriving news stories into groups based on their topics • If no existing group’s , the system must decide whether to create a new topic • Each story is placed in precisely one cluster • Tracking • Starts with a small set of news stories that a user has identified as being on the same topic • The system must monitor the stream of arriving news to find all additional stories on the same topic

Topic Detection and Tracking (4/6) • New Event Detection (first story detection) • Focuses on the cluster creation aspect of cluster detection • Evaluated on its ability to decide when a new topic (event) appears • Link Detection • Determine weather or not two randomly presented stories discuss the same topic • The solution of this task could be used to solve new event detection

Topic Detection and Tracking (5/6) • Corpora • TDT-2 : in 2002 is being augmented with some Arabic news from the same time period • TDT-3 : it is created for 1999 evaluation , and stories from four Arabic sources are being added during 2002

Topic Detection and Tracking (6/6) • Evaluation • P(target) is the prior probability that a story will be on topic • Cx are the user-specified values that reflect the cost associated with each error • P(miss) and P(fa) is the actual system error rates • Within TDT evaluations, Cmiss=10 , Cfa=1 • P(target) = 1- P(off-toget) = 0.02 (derived from training data)

Basic Topic Model • Vector Space • Represent items as (stories or topics) as vector in a high dimensional space • The most common comparison function is the cosine of the angle between the two vectors • Language Models • A topic is represented as a probability distribution of words • The initial probability estimates come form the maximum likelihood estimate based on the document • Use of topic model • See how likely the particular story could be generated by the model • Compare them directly : symmetric version of Kullback-Leibler divergence

Implementing the Models (1/3) • Name Entities • News is usually about people, so it seems reasonable that their names could be treated specially • Treat the name entities as a separate part of the model and then merge the part • Boost the weight of any words in the stories that come from names, give them a larger contribution to the similarity when the names are in common • Improve the result slightly, no strong stress so far

Implementing the Models (2/3) • Document Expansion • In the segmentation task, a possible segmentation boundary could be checked by comparing the models generated by text on either side • The text could be used as a query to retrieve a few dozen related stories and then the most frequently occurring words from those stories could be used for the comparison • Relevance models results in substantial improvements in the link detection task

Implementing the Models (3/3) • Time decay • The likelihood that two stories discuss the same topic diminished as the stories are further separated in time • In a vector space model, the cosine similarity function can be changed so that it include a time decay

Comparing model (1/3) • Nearest Neighbors • In the vector space model, a topic might be represented as a single vector • To determine whether or not that story is on any of the existing topics we consider the distance between the story’s vector and the closest topic vector • If it falls outside the specified distance, the story is likely to be the seed of a new topic and a new vector can be formed

Comparing model (2/3) • Decision Trees • The best place of decision trees within TDT may be the segmentation task • There are numerous training instances (hand-segmented stories) • Finding features that are indicative of a story boundary is possible and achieves good quality

Comparing model (3/3) • Model-to-Model • Direct comparison of statistical language models that represent topics • Kullback-Leibler idvergence • To finesse the measure, calculate the both ways and add them together • One approach that has been used to incorporate that notion penalized the comparison if the models are too much like background news

Miscellaneous Issues (1/3) • Deferral • All of tasks are envisioned as “on-line” task • The decision about a story is expected before the next story is presented • In fact, TDT provides a moderate amount of look ahead for the tasks • First, stories are always presented to the system grouped into “files” that correspond to about a half hour of news • Second, the formal TDT evaluation incorporates a notion of deferral that allows a system to explore the advantage of deferring decisions until several files have passed.

Miscellaneous Issues (2/3) • Multi-modal Issues • TDT systems must deal with are either written text (newswire) or read text (audio) • Speech recognizers make numerous mistakes, inserting, deleting, and even completely transforming words into other words • The difference of the two modes is the score normalization • The pair of story drawn from different source the distribution is different, in order the score is comparable, a system needs to normalize depends on those modes

Miscellaneous Issues (3/3) • Multi-lingual Issues • The TDT research program has strong interest in evaluating the tasks across multiple languages • 1999~2001 sites were required to handle English and Chinese news story • 2002 sites will be incorporating Arabic as a third language

Using TDT Interactively (1/2) • Demonstrations • Lighthouse is a prototype system that visually portrays inter-document similarities to help the user find relevant material more quickly

Using TDT Interactively (2/2) • Timelines • Using a timeline to show not only what the topic are, but how they occur in time • Using X2 measure to determine whether or not that feature is occurring on that day in a unusual way

UMass at TDT 2004 • Hierarchical Topic Detection • Topic Tracking • New Event Detection • Link Detection

Hierarchical Topic DetectionModel Description (1/8) • This task replaces Topic Detection in previous TDT evaluations • Used vector space model as the based line • Bounded clustering to reduce time complexity and had some simple parameter tuning • Stories in the same event tend to be close in time, we only need to compare a story to its “local” stories instead of the whole collection • Two steps • Bounded 1-NN for event formation • Bounded agglomerative clustering for building the hierarchy

Hierarchical Topic DetectionModel Description (2/8) • Bounded 1-NN for event formation • All stories in the same original language and from the some source are taken out and time ordered • Stories are processed one by one and each incoming story is compared to a certain number of stories(100 for baseline) before it. • Similarity of the current story and the most similar previous story is lager than a given threshold (0.3 for baseline) the current story will be assigned to the event that the most similar previous story belongs to, otherwise, a new event is created • There is a list of events for each source/language class • The event within each class are sorted by time according to the time stamp of the first story

Hierarchical Topic DetectionModel Description (3/8) • Bounded 1-NN for event formation S1 S2 S3 S1 S2 Language A Language B

Hierarchical Topic DetectionModel Description (4/8) • Each source is segmented in to several parts, and sorted by time according to the time stamp of the first story • Sorted event list

Hierarchical Topic DetectionModel Description (5/8) • Bounded agglomerative clustering for building the hierarchy • Take a certain number of events (the number is called WSIZE default is 120) from the sorted event list • At each iteration, find the closest event pair and combine the later event to the earlier one

Hierarchical Topic DetectionModel Description (6/8) • Each iteration find the closest event pair and combine the later event to the earlier one I1 I2 I3 Ir-1 Ir

Hierarchical Topic DetectionModel Description (7/8) • Bounded agglomerative clustering for building the hierarchy • Continues for (BRANCH-1)WSIZE/BRANCH iterations, so the number of clusters left is WSIZE/BRANCH • Take the first half out and get WSIZE/2 new events and agglomerative cluster until WSIZE/BRANCH clusters left • The optimal value is around 3, BRANCH=3 as baseline

Hierarchical Topic DetectionModel Description (8/8) • Then all clusters in the same language but from difference sources are combined • Finally clusters from all languages are mixed and clustered until only one cluster is left, which become the root • Used machine translation for Arabic and Mandarin stories to simplify the similarity calculation

Hierarchical Topic DetectionTraining (1/4) • Training corpus : TDT4 – newswire and broadcast stories Testing corpus : TDT5 – newswire only • Taking newswire stories from the TDT4 corpus includes NYT, APW, ANN, ALH, AFP, ZBN, XIN 420,000 stories TDT-4 Corpus Overview

Hierarchical Topic DetectionTraining (2/4)

Hierarchical Topic DetectionTraining (3/4) • Parameters • BRANCH : average branching factor in the bounded agglomerative clustering algorithm • Threshold : in the event formation to decide if a new event will be created • STOP : in each source, the number of cluster is smaller than square root of the number of story • WSIZE : the maximum window size in agglomerative clustering • NSTORY: Each story will be compared to at most NSTORY stories before it in the 1-NN event clustering, the idea comes from the time locality in event threading

Hierarchical Topic DetectionTraining (4/4) • Among the clusters very close to the root node, some contains thousands of stories. • Both 1-NN and agglomerative clustering algorithms favor large clusters • Modified the similarity calculation to give smaller clusters more chances • Sim(v1,v2) is the similarity of the cluster centroids • |cluster1| is the number of stories in the first story • a is a constant to control how much favor smaller clusters can get

Hierarchical Topic DetectionResult (1/2) • Three runs for each condition: UMASSv1, UMASSv12 and UMASSv19

Hierarchical Topic DetectionResult (2/2) • Small branching factor can reduce both detection cost and travel cost • Small branching factor, there are more clusters with different granularities • The assumption of temporal locality is useful in event threading, more experiments after the submission show larger window size can improve performance

Conclusion • Discussed several of the techniques that systems have used to build or enhance those models and listed merits of many of them • The TDT researchers can extent to which IR technology can be used to solve TDT problems

Topics Detection and Tracking