Word Hashing for Efficient Search in Document Image Collections

Word Hashing for Efficient Search in Document Image Collections Anand Kumar Advisors: Dr. C. V. Jawahar IIIT Hyderabad Dr. R. Manmatha University of Massachusetts, Amherst, USA

Overview • Introduction • The problem • Previous work • Contributions • Searching in document images • Annotation for retrieval • Summary • Future work

Introduction Processing Input Query Scanning Database Image Matching Retrieved Documents Documents Matching images of words, NOT the text (ASCII) words.

Challenges • Direct matching of images is an expensive process. • Represent word images as feature vectors and match. • Representation should capture the characteristics (mainly content) of words. • On every query, searching in large word image database by matching is time consuming. • The scalability issues arise with the increase in size of the document image collection.

Optical Character Recognition (OCR) Input Query Scanning Text Database Text Search Engine Retrieved Documents Documents Basic Directions of Solution • Convert the images into text using recognizers and build index using text search methods. • If the converted text has errors, will the text search methods deliver the expected performance?

Processing Input Query Scanning Database Text Search Engine Retrieved Documents Annotate Words Documents Basic Directions of Solution • Group similar words in the document image collection and annotate (label with text) the groups. • Apply text search methods for accessing the documents. • Is it possible to annotate large groups of words found in a large collection of document images?

The Problem • Building an index using matching or other existing methods is not scalable for even moderate collections. • Given a large collection of document images, how to search efficiently for similar words so that queries are answered quickly (in milli seconds)?

Recognition based methods Chan et al. use bi-gram letter transition model for recognition of words. BYBLOS system uses similar approach for line recognition. The recognizers may fail in presence of degradations. There are no good recognizers and language modeling approaches for Indian languages. Jim Chan, Celal Ziftci, and David A. Forsyth. “Searching Online Arabic Documents”. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR) (2), pages 1455-1462, 2006. Zhidong Lu, Richard Schwartz, Premkumar Natarajan, Issam Bazzi, and John Makhoul. “Advances in the BBN BYBLOS OCR System”. In Proc. of International Conference on Document Analysis and Recognition (ICDAR), pages 337-340, 1999. U. Pal and B.B. Chaudhuri. “Indian Script Character Recognition: A Survey”. Pattern Recognition, 37:1887-1899, 2004. Previous Work

Recognition free methods Word spotting in handwritten documents Words are clustered and the clusters are annotated to enable search. Dynamic time warping (DTW) is used for matching words. George Washington’s handwritten documents. Similar approach for printed Indian language documents. Word spotting in Ottoman documents Successive pruning stages eliminate wrong words. Toni M. Rath and R. Manmatha. “Word Image Matching Using Dynamic Time Warping”. In Proc. of Conference on Computer Vision and Pattern Recognition (CVPR)(2), pages 521-527, 2003. A. Balasubramanian, M. Meshesha, and C. V. Jawahar. “Retrieval from Document Image Collections”. In International Workshop on Document Analysis Systems (DAS), pages 1-12, 2006. Esra Ataer and Pinar Duygulu. “Retrieval of Ottoman documents”. In Multimedia Information Retrieval (MIR) workshop, pages 155-162, 2006. Previous Work

Contribution of This Work • Data is processed quickly using the proposed technique to help search efficiently in large collection. • Effect of word image representation and document types on the proposed technique are analyzed. • Scalability of the proposed method is demonstrated on a collection of Kalidasa’s books. • The group of similar words retrieved using the proposed approach are labeled (automatically) for annotation based access to documents. • A method to improve the automatic word labeling (annotation) accuracy is presented.

Overview • Introduction • The problem • Previous work • Contributions • Searching in document images • Word image representation • Similarity search • Content sensitive hashing • Fitting in retrieval system • Experimental results • Annotation for retrieval • Summary • Future work

Word Image Representation • Profile Features • Ink transitions • Number of black to white pixel transitions in the image row or column. Calculated for both rows and columns. • Projection profiles • Sum over the pixel values of a column

Profile Features Upper word profiles Black pixel distance from top boundary of the image. Lower word profiles Black pixel distance from bottom of the image. If no pixel is found in a column, the value is taken as height of the image. Word Image Representation

Word Image Representation • Region based moments • Central moments • Discrete Fourier Transform (DFT) coefficients. • Projection and word profile features are segmented vertically into four equal parts. • 1D Fourier transform of the segmented profile features is obtained. • n=4 real parts and last n-1=3 imaginary parts of the DFT are taken as features. • Total 84 Fourier coefficients are taken from each image. 3 x (7 x 4) = 84 features x (coefficients x segments) = total coefficients for every image

Similarity search • Given word image representations as vectors (points) in some space, • We need to search for similar vectors (points) i.e., nearest neighbor search (NNS). • k-d tree, B-tree or R-tree can be used for the NNS. • How to handle the slight differences in the representation of similar words? • Approximate nearest neighbor search has to be carried out. • Since the representations are in high dimension (more than 84 in our case), traditional way of searching is inefficient. • Locality sensitive hashing (LSH) is an approximate nearest neighbor search method for sub-linear time complexity. Jon Louis Bentley. “Multidimensional Binary Search Trees Used for Associative Searching”. Communications of the ACM, 18(9):509-517, 1975. Sunil Arya and David M. Mount. “Approximate Nearest Neighbor Queries in Fixed Dimensions”. In SODA '93. pages 271-280, 1993. Rudolf Bayer and E. McCreight. “Organization and Maintenance of Large Ordered Indexes”. Acta Informatica, 1(3):173-189, 1972. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. “Locality-Sensitive Hashing Scheme Based on p-Stable Distributions”. In ACM SOCG, pages 253-262, 2004.

Content Sensitive Hashing • A similarity search problem in which it is not necessary to find exact answer; instead determine approximate answer. • The key idea is: • To hash points using several hash functions so as to ensure that for each function the probability of collision is much higher for objects which are close to each other than for those which are far apart. • When a query point is given, • Hash the query point and retrieve elements stored in buckets containing that point.

Hashing Technique Given: set P of n points and number of hash tables L. for each hash table Ti, i = 1,…,L for each point pj, j=1,…n store pj on bucket gi(pj) of hash table Ti. where gi(p), i=1,…,L is hash function of table Ti Hash function can be combination of other functions. Some Examples: g(v1,…,vk) = a1.v1+…+ak.vk mod M where M is hash table size and a1,…,ak are random numbers from interval [0…M-1] g(p) = h1(p),…,hk(p) where hi(p) = (ai.p+bi)/w, ai is a d dimensional vector gL(p) = v1(p)…vL(p) v(p) = Unaryc(x1)…Unaryc(xd). Unaryc(x) = x 1s followed by c-x 0s vi(p) => select some bits from v(p), i = 1..L Content Sensitive Hashing

ρ L = (n / B) Let n is the size of data and B is the bucket size. log (1/p1) ρ = If p1is the probability that a point is found and p2 is probability that a point is found in given radius r. log (1/p2) k = log (n/B) 1/p2 Content Sensitive Hashing • Querying • To process a query q • we search all indices of g1(q),…,gL(q) and collect all points from L indices of hash tables. • Linear search on the collected points. • Output points within distance R from query.

Content Sensitive Hashing • Example • Let, p={1,3,2}, q={1,2,3}, r={3,1,1}, s={2,1,1} are d=3 dimensional points and c=3 is max value in the dimensions. • v(p) = Unaryc(x1)…Unaryc(xd). • Unaryc(x) = x 1s followed by c-x 0s • v(p) = v(1,3,2) = 100 111110 • A new dimensions d’ = cd = 9 is obtained i.e., a set I = {1,2,3,4,5,6,7,8,9}. • Let number of hash tables L=2, and I1={1,5,6}, I2={2,3,7,9} be L subsets from of I. • Hash function is gL(p) = v1(p)…vL(p) • vi(p) => select Ii bits from v(p), i = 1…L Unary(1) = 100 Unary(3) = 111 Unary(2) = 110

Content Sensitive Hashing • Example • v(p) = v(1,3,2) = 100 111110 • g1(p) = 111, g2(p) = 0010. (7, 2) • v(q) = v(1,2,3) = 100 110111 • g1(q) = 110, g2(q) = 0011. (6, 3) • v(r) = v(3,1,1) = 111 100100 • g1(r) = 100, g2(r) = 1110. (4, 13) • Query s = {2,1,1} • v(s) = 110 100100 • g1(s) = 100, g2(s) = 1010. (4, 10) • Resulting point is r I1={1,5,6} and I2={2,3,7,9} v(p) = v(1,3,2) = 100 111110 g1(p) = 1 1 1

Fitting in Retrieval System Document Images Textual Query Relevant Documents Cross Lingual Pre-processing Segmentation and word detection Word Rendering Hashed Words Feature Extraction Feature Extraction Hashing Offline Process Online Process

Experimental Results Performance on different data sets of English language query results

Experimental Results Performance on different data sets of English language Performance of individual features Performance with combination of features

Experimental Results Searching in Kalidasa’s Collection. Cross-lingual search

Experimental Results Searching in Kalidasa’s Collection. Comparison with Dynamic Time Warping based NNS

Overview • Introduction • The problem • Previous work • Contributions • Searching in document images • Annotation for retrieval • Annotation based search • Annotation correction • Experimental results • Summary • Future work

Labeling word segments with corresponding text words. vijayavaaDa paalakulu maarinaa konni Annotation for Retrieval • Annotation is the process of identifying objects in images and labeling with meaningful description. • Search is easy and efficient in annotated document images. • Challenges • Recognition for annotation may be inaccurate. • Manual annotation is impractical

Annotation for Retrieval • Can we use image search to speed up annotation and increase accuracy? • Image search produces clusters of similar words. • A single representative is required to annotate words of the whole cluster. • Cluster of recognized words can be obtained to get the representative. • The cluster information can be used to obtain correct annotation of the cluster.

Annotation Based Search Word Annotation by Recognition Document Images Relevant Documents Cluster of Word Images Pre-processing Text Search Engine Segmentation and word detection Hashed Words Textual Query Feature Extraction Hashing Online Process Offline Process

ambiderous anbiderous ambidextrous ambidextro4s ambidextrous ambidextrous abidextro4s ambideous ambiderous Annotation Correction Correction by Majority Voting What if too erroneous? ambidextrous ambidextro4s ambidextrous ambidextrous ab idex tro4s ambiderous an biderous ambiderous ambideous recognition Ordered words Word length = 12 ambidextrous Final word Word image cluster Text words of cluster

Annotation Correction Correction by Majority Voting • Input:Cluster C of words. • Output:Representative word WR for C • S = Sort C based on string length • Get M = {S | for all A, B in S edit distance of A and B is less than half of the lengths of A and B} • If l is the length of most of the strings (majority) the cluster representative WR has length l. • For each character i = 1,…,l do • Get all k words of length l • Find majority of characters for position i of WR

Annotation Correction Correction by Alignment ambidextrous a m b i d e x t r o u s a b i d e x t r o 4 s a m b i d e r o u s a m b i d e x t r o u s a m b i d e x t r o 4 s abidextro4s ambiderous ambidextrous ambidextro4s Aligned words Text words of cluster a m b i d e x t r o 4 s a m b i d e x t r o u s Word obtained by majority voting Final word

Annotation Correction Correction by Alignment • Input: Cluster C of Wi = 1,…,n words • Output: Cluster representative WR of C • for each i = 1,…,ndo • for each j = 1,…,ndo • ifj ≠ i then do • Align word Wi and Wj • Record errors Ek, k = 1,…,m in Wi • Record possible correction Gp, p = 1,…,q for Ek from Wj • end if • end for • Find correction Ch = Gp by majority voting • Correct Ek with Ch • O ← O U Wi • end for • Find correct word WR from the alignments O with majority voting.

Experimental Results

Experimental Results Effect of cluster size on the retrieval performance

Summary • Direct hashing of the word features eliminates costly processing before building an index. • Query results can be obtained in milliseconds using the content sensitive hashing (CSH). • Scalability of the proposed method is demonstrated on a collection of Kalidasa’s books. • Two methods to improve the automatic word labeling (annotation) accuracy are presented. • Demonstrated annotation based retrieval technique using the automatic annotations of document images.

Future Work • Indexing of documents images in different fonts. • Searching in Multi-lingual documents is one of the challenging tasks. • Many Indian language documents are translated to other languages. • Usage of cluster information • for improving the accuracy of character recognizers. • Annotation becomes difficult in presence of errors in every recognized word of a cluster. • Need to explore new techniques for annotation

Related Publications • Anand Kumar, C.V.Jawahar and R. Manmatha. "Efficient Search in Document Image Collections". Asian Conference on Computer Vision (ACCV), pages 586-595, November 18-22, 2007, Tokyo, Japan. • C.V.Jawahar and Anand Kumar. "Content Level Annotation of Large Collection of Printed Document Images". International Conference on Document Analysis and Recognition (ICDAR), pages 799-803, September 23-26, 2007, Brazil. • Anand Kumar, A. Balasubramanian, Anoop M. Namboodiri and C.V. Jawahar. "Model-Based Annotation of Online Handwritten Datasets", International Workshop on Frontiers in Handwriting Recognition (IWFHR), October 23-26, 2006, La Baule, France.

Questions ? Thank You

Dynamic Time Warping

Partial Matching

Word Hashing for Efficient Search in Document Image Collections

Word Hashing for Efficient Search in Document Image Collections

Presentation Transcript

MOVING IMAGE COLLECTIONS

Efficient Approximate Search on String Collections Part II

Recognition and Retrieval from Document Image Collections

Document Image Analysis Lecture 11: Word Recognition and Segmentation

Data-dependent Hashing for Similarity Search

Minwise Hashing and Efficient Search

Clustering Algorithms for Perceptual Image Hashing

Efficient Approximate Search on String Collections Part I

MOVING IMAGE COLLECTIONS

Image collections online

Image Hashing for DWT SPIHT Coded Images

Cataloguing image collections in Intute

Web Services for Image Conversion and Document Database Search

Word Hashing for Efficient Search in Document Image Collections

Document Image Processing

Document Image Analysis Lecture 12: Word Segmentation

Efficient Approximate Search on String Collections Part II

Document Collections 2

Document Collections 3

Efficient Approximate Search on String Collections Part II