540 likes | 556 Views
This article provides an overview of various methods used for text line segmentation in historical documents. It discusses the characteristics and representation of text lines, the influence of author style and poor image quality, and the three main axes of document complexity. It also explores preprocessing techniques such as removing non-textual elements and binarization using global or local thresholding. Projection-based methods, including vertical projection profiles, are also examined.
E N D
Survey of Text Line Segmentation Methods of Historical Documents • Article written by Laurence Likeformann-Sulem , AbderrazakZahour, Bruno Taconet(2006) • Presenting: ErezLefel and Koby Israel
1. Introduction • Text line extraction is generally seen as preprocessing step for tasks such as • Document structure extraction • Printed character or handwriting recognition • Text line extraction is most common in ancient and historical documents – printed or handwritten.
2. Characteristics and representation of text lines Some definitions: Baseline: fictitious line which follow and joins the lower part of the character body in a text line. Median line: fictitious line which follow and joins the upper part of the character body in a text line. Upper line: fictitious line which joined the top of ascenders. Lower line: fictitious line which joins the bottom of decenders.
Characteristics and representation of text lines cont. Overlapping components: components which are descenders and ascenders located in the region of an adjacent line Touching component: components which are ascenders and descenders belonging to consecutive lines which are thus connected.
Characteristics and representation of text lines cont. Text line segmentation: labeling process which consists in assigning the same label to spatially aligned units
Characteristics and representation of text lines cont. Influence of author style Baseline fluctuation: the baseline may vary due to writer movement. It may be straight, straight by segments or curved. Line spacing: lines that are widely spaced are easy to find, problem starts when lines spacing is very small. If exists at all. Insertions: words or short text lines may appear between the principal text lines or in the mergins.
Characteristics and representation of text lines cont. Influence of poor image quality Imperfect preprocessing: smudges, present of seeping ink or variable background intensity make image preprocessing difficult and produce binarization errors. Stroke fragmentation and merging: dots and broken strokes due to low quality images and/or binarization may produce many connected components
Characteristics and representation of text lines cont. Three main axes of document complexity for text line segmentation
3.1 Preprocessing • In an ideal situation, text line extraction would be preformed on a clean document: • without background noise and non-textual elements. • the writing would be well contrasted. • With as little fragmentation as possible. • In reality, preprocessing is often necessary. • The Preprocessing methods has to be tailored to each document. • All of the above has to be removed before using any text line extraction method.
Preprocessing –cont. • Non textual elements like : • book bindings • book sides • thumb marks from someone holding the book open • Can be removed upon criteria such as position and intensity level.
Preprocessing –cont. • Other non-textual element such as: • Stamps • Seals • Ornamentation (עיטור, קישוט) • decorated initials • All of these can be removed using knowledge about the shape, the color or the position of these elements
Preprocessing –cont. Extracting text from figures can also be performed using texture or morphological filters Linear graphical elements such as big crosses (called “St Andre’s crosses”) appear in some of Flaubert’s manuscripts. Removing these elements is performed through GUI by Kalman filtering .
Kalman filterFrom Wikipedia, the free encyclopedia In statistics, the Kalman filter is a mathematical method named after Rudolf E. Kalman. Its purpose is to use measurements that are observed over time that contain noise (random variations) and other inaccuracies, and produce values that tend to be closer to the true values of the measurements and their associated calculated values. The Kalman filter has many applications in technology, and is an essential part of the development of space and military technology.
Preprocessing –cont. • Textual but unwanted elements such as bleed through text can be removed by: • Filtering • Combining the back side image with the front side image
Preprocessing –cont. • Binarization using global thresholding • Usually does not work with historical documents, • That’s because the background is not uniform. • Binarization using local thresholding • Determining the threshold value based on the local properties of the image, e.g. pixel by pixel or region by region
Preprocessing –cont. Writing may be faint so that over-segmentation or under-segmentation may occur.
3.2. Projection–based methods Projection-profiles are commonly used for printed document segmentation. The vertical projection-profile is obtained by summing pixel values along the horizontal axis for each y value. The gaps between the text lines in the vertical direction can be observed. Profile(y)
Projection–based methods – cont. • The vertical profile is not sensitive to writing fragmentation. • Other ways for obtaining a profile curve. • Counting connected component • Projecting black/white transition
Projection–based methods – cont. Profile curve can be smoothed by median filter or gaussian filter to eliminate local maxima. The profile curve is then analyzed to find its maxima and minima Cuts are made at significant minima.
Projection–based methods – cont. • Drawbacks: • Short lines will provide low picks • Narrow lines will not produce significant peaks • In the naïve form, can’t handle skew in the text
Projection–based methods – cont. In Shapiro’s work, the global orientation (skew angle) of a handwritten page is first searched by applying a Hough transform on the entire image. Once this skew angle is obtained, projections are achieved along this angle.
3.3. Smearing methods For printed and binarized documents, smearing methods can be applied. Consecutive black pixels along the horizontal direction are smeared: the white space between them is filled with black pixels if their distance is within a predefined threshold. The bounding boxes of the connected components in the smeared image enclose text lines.
3.4. Grouping methods Methods consist in building alignments by aggregating units in a bottom-up strategy. The units may be pixels or higher level, such as connected components, blocks etc. Units are joined together to form alignments. The joining scheme relies on both local and global criteria.
Grouping methods – cont. • Every method has to face the following • Initiating alignments: one or several seeds for each alignment • Defining a unit’s neighborhood for reaching the next unit (it is generally a rectangular or angular area). • Solving conflicts: as one unit may belongs to several alignments under construction a choice has to be made: discard one alignment or keep both alignments.
Grouping methods – cont. Defining a unit’s neighborhood for reaching the next unit:
Grouping methods – cont. Contrary to printed documents, a simple nearest-neighbor joining scheme would often fail to group complex handwritten units, as the nearest neighbor often belongs to another line ?
Grouping methods cont. • When having a conflict, choice has to be made! • Decision can be made by alignment quality measures given. • Decision can be made by comparing the quality measure of the competing units in the neighborhood in the next iteration. • Quality measures generally include the strength of the alignment (number of units included) • Other quality elements may concern component size, component spacing etc.
Grouping methods cont. Example of text lines extracted on church registers
Grouping methods cont. Likforman-sulem and Faure have developed an iterative method based on perceptual grouping for forming alignments, which has been applied to handwritten pages. Anchors are detected by selecting connected components elongated in specific directions (0° , 45° , 90° , 125° ) Each of these anchors becomes the seed of an alignment. First, each anchor, then each alignment, is extended to the left and to the right according to given rules. A penalty is given when the alignment includes anchors of different directions.
3.5. Methods based on the Hough transform The Hough transform is a very popular technique for finding straight lines in images This method can extract oriented text lines and sloped annotations under the assumption that such lines are almost straight
Methods based on the Hough transform – cont. The centroids of the connected components are the units for the Hough transform. A set of aligned units in the image along a line with parameters (ρ, θ) is included in the corresponding cell (ρ, θ) of the Hough domain
3.6. Repulsive-Attractive network method This method is based on Repulsive-Attractive forces. Method works directly on grey-level images and consists in iteratively adapting the y-position of a predefined number of baseline units. This method has been applied to ancient Ottoman document archives and latin texts.
Repulsive-Attractive network method how it works? Baselines are constructed one by one from the top of the image to bottom. Pixels of the image act as attractive forces for baselines. Already extracted baselines act as repulsive forces. The baseline to be extracted is initialized just under the previously examined one, in order to be repelled by it and attracted by the pixels of the line below. The lines must have similar length. The result is a set of baselines, each one passing through word bodies.
Repulsive-Attractive network method cont. Pseudo baselines extracted by a Repulsive-Attractive network on Ancient Ottoman text
3.7. Processing of overlapping and touching components Overlapping and touching components are the main challenged for text line extraction since no white space is left between lines.
Processing of overlapping and touching componentscont. • Detection of ambiguous components can be done in several ways • Components size. • Component belongs to several alignments. • Component belongs to no alignment.
Processing of overlapping and touching componentscont. • Once component is detected as ambiguous it must be classified to one of the two categories above • Component is an overlapping component(belongs to upper/lower alignment) • Component is touching component • In grouping methods (seen in 3.4) its common to use the component ambiguity attribute in order to calculate whether to add the component to the group or not.
Processing of overlapping and touching componentscont. In Likforman-Sulem method, touching and overlapping components are detected after the text line extraction process described in 3.5 (Methods base on Hough transform). These components are those which are intersected by at least two different lines (ρ, θ)corresponding to primary cells of validated alignments.
Processing of overlapping and touching componentscont. • Zahour’s method for detecting touching and overlapping components: • Cut the text into 8 columns. • A projection-profile is performed on each column. • In each histogram, two consecutive minima delimit a text block • Classify text blocks into 3 categories – small, average, big (using k-means algorithm) • Overlapping components necessarily belong to big physical blocks. • Using average text block from average and small groups in order to decide to how many pieces the big text blocks should be cut into.
3.8 Non Latin documents The inter-line space in Latin documents is filled with single dots, ascenders and descenders. The Arabic script is connected and cursive. ancient Arabic documents include diacritical points (ثك) Ancient Hebrew documents can include decorated words
Non Latin documents- cont. In the alphabets of some Indian scripts many basic characters have an horizontal line (the head line) in the upper part
3.8.1 Ancient Arabic documents The writing in these documents is very dense, and the line spacing is quite small. The method developed in Zahouret al. begins with the detection of overlapping and touching components presented in 3.7
3.8.1 Ancient Hebrew documents The manuscripts studied in Likforman-Sulem et al. are written in Hebrew, using “Dfus” letters, as most characters are made of horizontal and vertical strokes. The Scrolls, intended to be used in the synagogue, do not include diacritics, so there is no separation between words or sentences.