Judith A. Molka-Danielsen

IN350: Document Management and Information Steering: Class 5 Text properties and processing,File Organization and Indexes. Judith A. Molka-Danielsen September 10, 2001 Class 5 Notes are based on Chapter 6 and Appendix A of the Article Collection

Review: Guest Lectures by Michael Spring • The Document Processing Revolution by MBS. • How do you define a document? • Revolutions: reprographics, communications • Transition: create,compose,render; WWW/XML • New processing model for e-docs, future doc forms, changes. • XML- a first look (see MBS notes page). • Namespace rules allow for a modular document, reuse. • DTDs are historical, Schema is the future. • Xpath and pointers: used for accessing the document as a tree of nodes. • XSL style – rendering, XSLT uses XSL and FO. Any look. • XSLT transformations of docs to multiple forms. • E-business: B2B applications will use concepts, specifications, and tools based on the XML family.

Modeling Natural Languages and Compression • Information theory - the amount of information in a text is quantified by its entropy, information uncertainty. If one symbol appears all the time it does not convey much information. Higher entropy text cannot be compressed as much lower. • Symbols - There are 26 in the English alphabet and 29 in Norwegian. • Frequency of occurrence - of the symbols in text in different in different languages. In english ’e’ has the highest occurance. Run length encoding schemes such as Huffman encoding can be used to represent the symbols based on frequency of occurance. • Compression for transfering data can be based on the frequency of symbols. More on this in another lecture.

Modeling Natural Languages and Compression • Creation of Indicies are often based on the frequency of occurance of words within a text. • Zipf's Law- named after the Harvard linguistic professor George Kingsley Zipf (1902-1950), is the observation that frequency of occurrence of some event ( P ), as a function of the rank ( i) when the rank is determined by the above frequency of occurrence, is a power-law function Pi ~ 1/ia with the exponent a close to unity. • Zipf’s distribution- frequency of words in a document approximatly follow this distribution. In the English language, the probability of encountering the ith most common word is given roughly by Zipf law for i up to 1000 or so. The law breaks down for less frequent words, since the harmonic series diverges. • Stopwords - a few hundred words take up 50% of the text. These can be disregarded to reduce the space of indices.

Modeling Natural Languages and Compression • Document vocaburlary - is the number of distinct (different) words within a document. • Heaps’ Law - is show in the right graph in the readings Figure 6.2. In general, the number of new words found in a document does increases logarithmically with the increasing text size. So, if you have an encylopedia, probably most of the words are found in the first volume. • Heaps’ Law is also implies the length of words increase logarithmically with the text size, but the average word length is constant. That is because there is a greater occurance of the shorter words.

Text size and Indexing • Today, we discuss in class: What is the purpose of an index and how is it made? Based on: ”Appendix A: File Organization and Storage Structures” • Text Processing: large text collections are often indexed using inverted files. A vector of distinct words forms a vocabulary, with pointers to a list of all the documents that contain the word(s). • Indexes can be compressed to allow quicker access. (We discuss this later with Ch.7 in the collection.)

Judith A. Molka-Danielsen