1 / 22

Introduction

Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement.

myra-joseph
Download Presentation

Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement Rie Kubota Ando. Latent semantic space: Iterative scaling improves precision of inter-document similarity measurement. In the 23th Annual International ACM SIGIR Conference (SIGIR'2000), 2000. Presenter: 游斯涵

  2. Introduction • Some studies of applying modified or generalized SVD: (improving the precision of similarities) • SDD ( semi discrete decomposition) • T. G. Kolda and D. P. O’Learly • Proposed to reduce the storage and computational costs of LSI. • R-SVD ( Riemannian SVD) • E. p. Jiang and M. W. Berry • User feedback can be integrated into LSI models. (theoretical of LSI) • MDS (Multidimensional Scaling)、Bayesian regression model、Probabilistic models.

  3. Introduction • Find the problem with SVD • SVD: • The topics underlying outlier documents tend to be lost as we chose lower number of dimensions. • Dimensional reduction comes from two sources: • outlier document • minor term • The thinking of this paper: • not to consider the outlier document as “noise”, all documents assume to be equal. • Try to eliminate noise from the minor terms but not eliminate the influence of the outlier documents. Outlier documents Documents very different from other documents

  4. Introduction

  5. Compare with SVD • Same • Trying to find a smaller set of basis vectors for a reduced space. • Differ • Scale the length of each residual vector • Treat documents and terms in a nonsymmetrical way.

  6. term Algorithm-basis vector creation • Input: term-document matrix D, scaling factor q • Output: basis vectors For ( i=1;until reaching some criterion ;i=i+1) the first unit eigenvector of End for m*m m*n Doc

  7. n n m m Algorithm-basis vector creation m = n = n m

  8. Algorithm-document vector creation • Dimension reduction n n m = k m k There are two important variables in this algorithm: (scaling factor) and (the number of dimensions)

  9. Find the eigenvector of example

  10. example Find it’s eigenvector

  11. Probabilistic model • Basis vectors: • Follows a Gaussian distribution • Multivariate Normal (MVN) Distribution

  12. Probabilistic model • The log likelihood for the document vectors reduced to dimension k is computed as (Ding) Maximize this Negligible because it changes slowly

  13. parameter • : set 1 to 10,increment of 1. • : selection of dimension by log-likelihood

  14. experiment • Test data: • TREC collections • 20 topics • Total umber of documents is 684 disjoint pool2 training data Test data pool1 15 document set 15 document set Each set range from 31~126 Number of topic range from 6~20

  15. Baseline algorithm • Three algorithm • SVD taking the left singular vectors as the basis vector • Term-document without any basis conversion (term frequency) • This paper algorithm

  16. 67.7 62.2 60 evaluation • Assumption • Similarity should be higher for any document pair relevant to the same topic (intra-topic pair).

  17. evaluation • Preservation rate (document length): • Reduction rate (越大越好) : 1 - Preservation rate • Dimensional reduction rate (越大越好) : 1 - ( # of dimensions / max # of dimensions)

  18. Selection dimension • Log-likelihood method: • Training-based method: • Choose the dimension which make the preservation rate closer to the average preservation rate. • Random guess-based method:

  19. result 17.8%

  20. result Dimension reduction rate 43% higher than SVD on average This algorithm shows 35.8% higher reduction rate than SVD

  21. result

  22. conclusion • This algorithm achieved higher precision (up 17.8%) of similarity measurement with higher reduction rate (43% higher) than the baseline algorithm. • Scaling factor can become dynamic to improve the performance.

More Related