1 / 25

Mapping document collections in non-standard geometries

Mapping document collections in non-standard geometries. Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish Academy of Sciences Warsaw. Agenda. Motivation Our approach Architecture User interface

wreed
Download Presentation

Mapping document collections in non-standard geometries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mapping document collections in non-standard geometries Slawomir Wierzchon , Mieczyslaw Klopotek Michal Draminski Krzysztof Ciesielski Mariusz Kujawiak Institute of Computer Science Polish Academy of Sciences Warsaw

  2. Agenda • Motivation • Our approach • Architecture • User interface • Visualization • Map creation • Clustering • Experimental results • Future directions Mining Document Maps

  3. Motivation • The Web as well as intranets become increasingly content-rich: simple ranked lists or even hierarchies of results seem not to be adequate anymore • A good way of presenting massive document sets in an understandable form will be crucial in the near future • The BEATCA project targets atcreation a full-fledged search engine for moderate size document collections (millions of documents)capable of representing on-line replies to queries in user-friendly graphical form on a document map (based on WebSOM approach) Mining Document Maps

  4. Our approach • The presentation method is based on the WebSOM's map idea and is enriched with novel methods of document analysis, clustering and visualization. • A special architecture has been elaborated to enable experiments with various brands of map creation, visualization, clustering and labelling algorithms • B ayesian • E volutionary • A pproach to • T ext • C onnectivity • A nalysis Mining Document Maps

  5. BEATCA architecture • The preparation of documents is done by an indexer, which turns the HTML etc. representation of a document into a vector-space model representation • Indexer also identifies frequent phrases in document set for clustering and labelling purposes • Subsequently, dictionary optimization is performed - extreme entropy and extremely frequent terms excluded • The map creator is applied, turning the vector-space representation into a form appropriate for on-the-fly map generation • ‘The best’ (wrt some similarity measure) map is used bythe query processor in response to the user’s query Mining Document Maps

  6. BEATCA architecture Mining Document Maps

  7. User interface • Search results are presented on a document map • Compact (fuzzy) topical areas are extracted • Query-related summaries are generated on-line • Maps can have one of the following topologies: • the traditional flat map (quadratic or hexagonal cells) • rotating 3D map (torus, sphere, cylinder) • hyperbolic map (Poincarre or Klein projections) • growing map (Growing Neural Gas) Mining Document Maps

  8. User interface Mining Document Maps

  9. Map visualizations in 3D Mining Document Maps

  10. Kohonen learning overview • Unsupervised learning neural network model • Neuron represented by reference vector in document space • Vector element (term dimension) equals TFxIDF • Iterative regression of reference vectors onto document vector space • Similiarity is computed as cosine of angle between corresponding vectors Mining Document Maps

  11. How are the maps created • A modified WebSOM method is used: • compact reference vectors representation • broad-topic initialization method • joint winner search method • multi-level (hierarchical) maps • three-phase document clustering: • initial grouping via PLSA/PHITS • WEBSOM on document groups • fuzzy cell clusters extraction and labelling Mining Document Maps

  12. Reference vector representation • Vectors are sparse by nature • During learning process they become even sparser • Represented as a balanced red-black trees • Tolerance threshold imposed • Terms (dimensions) below threshold are removed • Significant complexity reduction without negative quality impact Mining Document Maps

  13. Topic-sensitive initialization • Inter-topic similarities important both for map learning and visualization/cluster extraction • Simple approach: • Use LSI to select K main broad topics • Select K map cells (evenly spread over the map) as the fixpoints for individual topics • Initialize selected fixpoints with broad topics • Initialize remaining cells with „in-between values” Mining Document Maps

  14. Joint winner search • Global winner search: accurate but slow • Local winner search: faster but can be inaccurate during rapid changes • Start with single phase of global search • Document movements become more smooth during learning process: usually local search is enough • Use global search when occassional sudden moves occur (eg. outliers, neighbourhood width decrease) Mining Document Maps

  15. Hierarchical maps • Bottom-up approach • Feasible (with joint winner search method) • Start with most detailed map • Compute weighted centroids of map areas: #WZÓR# • Use them as seeds for coarser map • Top-down approach is possible but requires fixpoints Mining Document Maps

  16. Clustering document groups • Numerous methods exists but none of them directly applicable: • Extremely fuzzy structure of topical groups in SOM cells • Neccesity of taking into account similiarity measures both in original document space and in the map space • Outlier-handling problem during cluster formation • No a priori estimation of the number of topical groups • Fuzzy C-MEANS on lattice of map cells applied • Graph theoretical approach (density- and distance- based MST) combined with fuzzy clustering • Clustered documents are labeled by weighted centroids of cell reference vectors scaled with between-group entropy Mining Document Maps

  17. Experiments with map convergence • We examined the convergence of the maps to a stable state depending on: • type of alpha function (search radius reduction) • type of winner search method • type of initialization method Mining Document Maps

  18. Convergence – alpha functions Mining Document Maps

  19. Convergence – winner search Mining Document Maps

  20. Experiments with execution time • The impact of the following factors on the speed of map creation was investigated: • Map size (total number of cells) • Optimization methods: • dictionary optimization • reference vector representation • Map quality assessment: • Compare with ‘ideal’ map (e.g. without optimizations) • Identical initialization and learning parameters • Compute sum of squared distances of location of each document on both maps Mining Document Maps

  21. Execution time - map size Mining Document Maps

  22. Execution time - optimizations Mining Document Maps

  23. Future research • Maps for joint term-citation model, taking into account between-group link flow direction • Fully distributed map creation • Adaptive document retrieval and clustering: • Bayesian network based relevance measure • Survival models for document update rate estimation • Dead link propagation methods for page freshness estimation • We also intend to integrate Bayesian and immune system methodologies with WebSOM in order to achieve new clustering effects Mining Document Maps

  24. Future research • Bayesian networks will be applied in particular to: • measure relevance and classify documents • accelerate document clustering processes • construct a thesaurus supporting query enrichment • keyword extraction • between-topic dependencies estimation Mining Document Maps

  25. Thank you! Any questions? Mining Document Maps

More Related