1 / 21

Information Retrieval through Various Approximate Matrix Decompositions

Information Retrieval through Various Approximate Matrix Decompositions. Kathryn Linehan Advisor: Dr. Dianne O’Leary. Querying a Document Database. We want to return documents that are relevant to entered search terms Given data: Term-Document Matrix, A

jemma
Download Presentation

Information Retrieval through Various Approximate Matrix Decompositions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary

  2. Querying a Document Database • We want to return documents that are relevant to entered search terms • Given data: • Term-Document Matrix, A • Entry ( i , j ): importance of term i in document j • Query Vector, q • Entry ( i ): importance of term i in the query

  3. Solutions • Literal Term Matching • Compute score vector: s = qTA • Return the highest scoring documents • May not return relevant documents that do not contain the exact query terms • Latent Semantic Indexing (LSI) • Same process as above, but use an approximation to A

  4. Term-Document Matrix Approximation • Standard approximation used in LSI: rank-k SVD • Project Goal: evaluate use of term-document matrix approximations other than rank-k SVD in LSI • Nonnegative Matrix Factorization (NMF) • CUR Decomposition

  5. Matrix Approximation Validation • Let be an approximation to A • As the rank of increases, we expect the relative error, , to go to zero • Matrix approximation can be applied to any matrix A • Preliminary test matrix A: 50 x 30 random sparse matrix • Future test matrices: three large sparse term-document matrices

  6. Nonnegative Matrix Factorization (NMF) • Term-document matrix is nonnegative • W and H are nonnegative k x n m x n m x k

  7. NMF • Multiplicative update algorithm of Lee and Seung found in [1] • Find W, H to minimize • Random initialization for W,H • Convergence is not guaranteed, but in practice it is very common • Slow due to matrix multiplications in iteration

  8. NMF Validation A: 50 x 30 random sparse matrix. Average over 5 runs.

  9. CUR Decomposition • Term-document matrix is sparse • C (R) holds c (r) sampled and rescaled columns (rows) of A • U is computed using C and R C U R c x r r x n , m x n m x c where k is a rank parameter

  10. CUR Implementations • CUR algorithm in [2] by Drineas, Kannan, and Mahoney • Linear time algorithm • Modification: use ideas in [3] by Drineas, Mahoney, Muthukrishnan (no longer linear time) • Improvement: Compact Matrix Decomposition (CMD) in [5] by Sun, Xie, Zhang, and Faloutsos • Other Modifications: our ideas • Deterministic CUR code by G. W. Stewart

  11. Sampling • Column (Row) norm sampling [2] • Prob(col j) = (similar for row i) • Subspace Sampling [3] • Uses rank-k SVD of A for column probabilities • Prob(col j) = • Uses “economy size” SVD of C for row probabilities • Prob(row i) = • Sampling without replacement

  12. Sampling Comparison A: 50 x 30 random sparse matrix. Average over 5 runs. Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)

  13. Computation of U • Linear algorithm U: approximately solves , where [2] • Optimal U: solves

  14. U Comparison A: 50 x 30 random sparse matrix. Average over 5 runs. Legend: Sampling,U

  15. Compact Matrix Decomposition (CMD) Improvement • Remove repeated columns (rows) in C (R) • Decreases storage while still achieving the same relative error [5] A: 50 x 30 random sparse matrix, k = 15. Average over 10 runs.

  16. Deterministic CUR • Code by G. W. Stewart • Uses a RRQR algorithm that does not store Q • We only need the permutation vector • Gives us the columns (rows) for C (R) • Uses optimal U

  17. CUR Comparison A: 50 x 30 random sparse matrix. Average over 5 runs. Legend: Sampling,U,Scaling (Scaling only for without replacement sampling)

  18. Future Project Goals • Finish investigation of CUR improvement • Validate NMF and CUR using term-document matrices • Investigate storage, computation time and relative error of NMF and CUR • Test performance of NMF and CUR in LSI • Use average precision and recall, where the average is taken over all queries in the data set

  19. Precision and Recall • Measurements of performance for document retrieval • Let Retrieved = number of documents retrieved, Relevant = total number of relevant documents to the query, RetRel = number of documents retrieved that are relevant. • Precision: • Recall:

  20. Further Topics • Time permitting investigations • Parallel implementations of matrix approximations • Testing performance of matrix approximations in forming a multidocument summary

  21. References Michael W. Berry, Murray Browne, Amy N. Langville, V. Paul Pauca, and Robert J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics and Data Analysis, 52(1):155-173, September 2007. Petros Drineas, Ravi Kannan, and Michael W. Mahoney. Fast Monte Carlo algorithms for matrices iii: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36(1):184-206, 2006. Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844-881, 2008. Tamara G. Kolda and Dianne P. O'Leary. A semidiscrete matrix decomposition for latent semantic indexing in information retrieval. ACM Transactions on Information Systems, 16(4):322-346, October 1998. Jimeng Sun, Yinglian Xie, Hui Zhang, and Christos Faloutsos. Less is more: Sparse graph mining with compact matrix decomposition. Statistical Analysis and Data Mining, 1(1):6-22, February 2008. [1] [2] [3] [4] [5]

More Related