900 likes | 983 Views
Processing of large document collections. Part 1 Helena Ahonen-Myka Fall 2002. Organization of the course. Classes: 30.9., 28.10., 29.10., 18.11. lectures (Helena Ahonen-Myka): 13.15-17 exercise sessions (Juha Makkonen): 10.15-12 Exercises are given and returned each week
E N D
Processing of large document collections Part 1 Helena Ahonen-Myka Fall 2002
Organization of the course • Classes: 30.9., 28.10., 29.10., 18.11. • lectures (Helena Ahonen-Myka): 13.15-17 • exercise sessions (Juha Makkonen): 10.15-12 • Exercises are given and returned each week • deadline: Thursday midnight • Exam: Tue 3.12. at 16-20 (Auditorio) • Points: exam 40 pts, exercises 20 pts • required: exam 20 pts, exercises 10 pts • exercise sessions give extra points (1/session)
Schedule • 30.9. • preprocessing of text; representation of textual data: vector model; term selection; text categorization • 28.-29.10. • character set issues; text summarization; text compression • 18.11. • text indexing and querying
Schedule • self-study • basic transformations for text data • using Unix tools, Perl and XSLT • get to know some linguistic tools • morphological and grammatical analysis • Wordnet
1. Large document collections • What is a document? • “a document records a message from people to people” (Wilkinson et al., 1998) • each document has content, structure, and metadata (context) • in this course, we concentrate on content • particularly: textual content
Large document collections • large? • some person may have written a document, but it is not possible later to process the document manually -> automatic processing is needed • large w.r.t to the capacity of a device (e.g. a mobile phone) • collection? • documents somehow similar -> automatic processing is possible
Applications • text categorization • text summarization • text compression • text indexing and retrieval • information extraction • question answering • machine translation …
2. Representation of textual information • Text cannot be directly interpreted by the many document processing applications • we need a compact representation of the content • which are the meaningful units of text?
Terms • Words • typical choice • set of words, bag of words • phrases • syntactical phrases (e.g. noun phrases) • statistical phrases (e.g. frequent pairs of words) • usefulness not yet known?
Terms • Part of the text is not considered as terms • very common words (function words): • articles (a, the) , prepositions (of, in), conjunctions (and, or), adverbs (here, then) • numerals (30.9.2002, 2547) • these words can be removed • stopword list • other preprocessing possible • stemming (recognization -> recogn), base words (skies -> sky)
Vector model • A document is usually represented as a vector of term weights • the vector has as many dimensions as there are terms in the whole collection of documents • the weight represents how much the term contributes to the semantics of the document
Vector model • in our sample document collection, there are 118 words (terms) • in alphabetical order, the list of terms starts with: • absorption • agriculture • anaemia • analyse • application • …
Vector model • each document can be represented by a vector of 118 dimensions • we can think a document vector as an array of 118 elements, one for each term, indexed, e.g. 0-117
Vector model • let d1 be the vector for document 1 • record only which terms occur in document: • d1[0] = 0 -- absorption doesn’t occur • d1[1] = 0 -- agriculture -”- • d1[2] = 0 -- anaemia -”- • d1[3] = 0 -- analyse -”- • d1[4] = 1 -- application occurs • ... • d1[21] = 1 -- current occurs • …
Weighting terms • Usually we want to say that some terms are more important than the others -> weighting • weights usually range between 0 and 1 • binary weights may be used • 1 denotes presence, 0 absence of the term in the document
Weighting terms • if a word occurs many times in a document, it may be more important • but what about very frequent words? • often the TF*IDF function is used • higher weight, if the term occurs often in the document • lower weight, if the term occurs in many documents
Weighting terms: TF*IDF • TF*IDF = term frequency * inversed document frequency • weight of term tk in document dj: • where • #(tk,dj): the number of times tk occurs in dj • #Tr(tk): the number of documents in Tr in which tk occurs • Tr: the documents in the collection
Weighting terms: TF*IDF • in document 1: • term ’application’ occurs once, and in the whole collection it occurs in 2 documents: • tfidf (application, d1) = 1 * log(10/2) = log 5 ~ 0.7 • term ´current´occurs once, in the whole collection in 9 documents: • tfidf(current, d1) = 1 * log(10/9) ~ 0.05
Weighting terms: TF*IDF • if there were some word that occurs 7 times in doc 1 and only in doc 1, the TF*IDF weight would be: • tfidf(doc1word, d1) = 7 * log(10/1) = 7
Weighting terms: normalization • in order for the weights to fall in the [0,1] interval, the weights are often normalized (T is the set of terms):
Effect of structure • Either the full text of the document or selected parts of it are indexed • e.g. in a patent categorization application • title, abstract, the first 20 lines of the summary, and the section containing the claims of novelty of the described invention • some parts of a document may be considered more important • e.g. higher weight for the terms in the title
Term selection • a large document collection may contain millions of words -> document vectors would contain millions of dimensions • many algorithms cannot handle high dimensionality of the term space (= large number of terms) • usually only a part of terms are used • how to select terms that are used? • term selection (often called feature selection or dimensionality reduction) methods
Term selection • Goal: select terms that yield the highest effectiveness in the given application • wrapper approach • the reduced set of terms is found iteratively and tested with the application • filtering approach • keep the terms that receive the highest score according to a function that measures the ”importance” of the term for the task
Term selection • Many functions available • document frequency: keep the high frequency terms • stopwords have been already removed • 50% of the words occur only once in the document collection • e.g. remove all terms occurring in at most 3 documents
Term selection functions: document frequency • document frequency is the number of documents in which a term occurs • in our sample, the ranking of terms: • 9 current • 7 project • 4 environment • 3 nuclear • 2 application • 2 area … 2 water • 1 use …
Term selection functions: document frequency • we might now set the threshold to 2 and remove all the words that occur only once • result: 29 words of 118 words (~25%) selected
Term selection: other functions • Information-theoretic term selection functions, e.g. • chi-square • information gain • mutual information • odds ratio • relevancy score
3. Text categorization • Text classification, topic classification/spotting/detection • problem setting: • assume: a predefined set of categories, a set of documents • label each document with one (or more) categories
Text categorization • for instance • Categorizing newspaper articles based on the topic area, e.g. into the following 17 “IPTC” categories: • Arts, culture and entertainment • Crime, law and justice • Disaster and accident • Economy, business and finance • Education • Environmental issue • Health • …
Text categorization • categorization can be hierarchical • Arts, culture and entertainment • archaeology • architecture • bullfighting • festive event (including carnival) • cinema • dance • fashion • ...
Text categorization • ”Bullfighting as we know it today, started in the village squares, and became formalised, with the building of the bullring in Ronda in the late 18th century. From that time,...” • class: • Arts, culture and entertainment • Bullfighting • or both?
Text categorization • Another example: filtering spam • ”Subject: Congratulation! You are selected! • It’s Totally FREE! EMAIL LIST MANAGING SOFTWARE! EMAIL ADDRESSES RETRIEVER from web! GREATEST FREE STUFF!” • two classes only: Spam and Not-spam
Text categorization • Two major approaches: • knowledge engineering -> end of 80’s • manually defined set of rules encoding expert knowledge on how to classify documents under the given gategories • machine learning, 90’s -> • an automatic text classifier is built by learning, from a set of preclassified documents, the characteristics of the categories
Text categorization • Let • D: a domain of documents • C = {c1, …, c|C|} : a set of predefined categories • T = true, F = false • The task is to approximate the unknown target function ’: D x C -> {T,F} by means of a function : D x C -> {T,F}, such that the functions ”coincide as much as possible” • function ’ : how documents should be classified • function : classifier (hypothesis, model…)
We assume... • Categories are just symbolic labels • no additional knowledge of their meaning is available • No knowledge outside of the documents is available • all decisions have to be made on the basis of the knowledge extracted from the documents • metadata, e.g., publication date, document type, source etc. is not used
-> general methods • Methods do not depend on any application-dependent knowledge • but: in operational applications all kind of knowledge can be used (e.g. in spam filtering) • content-based decisions are necessarily subjective • it is often difficult to measure the effectiveness of the classifiers • even human classifiers do not always agree
Single-label, multi-label TC • Single-label text categorization • exactly 1 category must be assigned to each dj D • Multi-label text categorization • any number of categories may be assigned to the same dj D • Special case of single-label: binary • each dj must be assigned either to category ci or to its complement ¬ ci
Single-label, multi-label TC • The binary case (and, hence, the single-label case) is more general than the multi-label • an algorithm for binary classification can also be used for multi-label classification • the converse is not true
Single-label, multi-label TC • in the following, we will use the binary case only: • classification under a set of categories C = set of |C| independent problems of classifying the documents in D under a given category ci, for i = 1, ..., |C|
Hard-categorization vs. ranking categorization • Hard categorization • the classifier answers T or F • Ranking categorization • given a document, the classifier might rank the categories according to their estimated appropriateness to the document • respectively, given a category, the classifier might rank the documents
Machine learning approach • A general inductive process (learner) automatically builds a classifier for a category ci by observing the characteristics of a set of documents manually classified under ci or ci by a domain expert • from these characteristics the learner extracts the characteristics that a new unseen document should have in order to be classified under ci • supervised learning (= supervised by the knowledge of the training documents)
Machine learning approach • The learner is domain independent • usually available ’off-the-shelf’ • the inductive process is easily repeated, if the set of categories changes • manually classified documents often already available • manual process may exist • if not, it is still easier to manually classify a set of documents than to build and tune a set of rules
Training set, test set, validation set • Initial corpus of manually classified documents • let dj belong to the initial corpus • for each pair <dj, ci> it is known if dj should be filed under ci • positive examples, negative examples of a category
Training set, test set, validation set • The initial corpus is divided into two sets • a training (and validation) set • a test set • the training set is used to build the classifier • the test set is used for testing the effectiveness of the classifier • each document is fed to the classifier and the decision is compared to the manual category
Training set, test set, validation set • The documents in the test set are not used in the construction of the classifier • alternative: k-fold cross-validation • k different classifiers are built by partitioning the initial corpus into k disjoint sets and then iteratively applying the train-and-test approach on pairs, where k-1 sets construct a training set and 1 set is used as a test set • individual results are then averaged
Training set, test set, validation set • Training set can be split to two parts • one part is used for optimising parameters • test which values of parameters yield the best effectiveness • test set and validation set must be kept separate
Inductive construction of classifiers • A ranking classifier for a category ci • definition of a function that, given a document, returns a categorization status value for it, i.e. a number between 0 and 1 • documents are ranked according to their categorization status value
Inductive construction of classifiers • A hard classifier for a category • definition of a function that returns true or false, or • definition of a function that returns a value between 0 and 1, followed by a definition of a threshold • if the value is higher than the threshold -> true • otherwise -> false
Learners • probabilistic classifiers (Naïve Bayes) • decision tree classifiers • decision rule classifiers • regression methods • on-line methods • neural networks • example-based classifiers (k-NN) • support vector machines
Rocchio method • learner • for each category, an explicit profile (or prototypical document) is constructed • the same representation as for the documents • benefit: profile is understandable even for humans