Web Corpus Construction

Web Corpus Construction Summary of Roland & Bildhauer (2013) with techniques applied to LA Times Columnist Corpus Construction and Processing Kristen Howe

Web Corpora • Static collection of a # of documents downloaded from the web • Intended for work in: • empirical linguistics • computational linguistics • search engine applications

Why use Web Corpora? Problems With Existing Corpora: • unbalanced  not suitable for research on genre or register • limited accessibility or export options • size may be too small • too expensive Some Preexisting Web Corpora: • Leipzig Corpora Collection • http://corpora.informatik.uni-leipzig.de/?dict=eng_news_2008 • available in many languages (over 20) • includes data from newspapers and Wikipedia • sentences shuffled for legal reasons  only useful at word and sentence level • WaCky Corpora • http://wacky.sslmit.unibo.it/doku.php?id=start • Available for English, French, German, Italian • over a billion words • includes data from Wikipedia • POS tagged, lemmatized, dependency parse

Why use Web Corpora?

Collecting Data: Web Crawling • Start with a set of seed URLs • Use links between web pages to discover huge numbers of links • Recursively follow some or all links from downloaded pages Some Limitations: • Deep Web (Unlinked pages, private web, etc.) • Robot Exclusion • Link Rot

National Top Level Domains • TLD: highest level of the domain name (.com, .org, etc) • A National TLD is the label used by a specific country • Examples: • Australia  .au • Estonia  .ee • India  .in • Using National TLDs is a crude way to build a monolingual corpus • For countries with more than one dominant language (e.g. India) language filtering or classification is required • Doesn’t work equally well for all national TLDs especially when the TLD doesn’t have high prestige (e.g. .us)

National TLDs and Dialect Comparison? • Example: use .uk and .ca to compare British and Canadian English • Not necessarily safe to assume documents from a national TLD are representative of a specific dialect  remains an open question • Corpus of English web documents classified according to Google geo-location information (Davies, 2013): • http://corpus2.byu.edu/glowbe/

Crawling: Basic Steps • Collect a set of URLs to start with • Configure the crawler software to accept only desired content: • TLD restrictions • file name patterns • MIME types • file sizes • languages • encodings • Decide on crawler politeness settings (e.g. the # of requests within a certain period of time) • Run the crawler and observe the progress

Crawler Components • Fetcher: massively multi-threaded component which downloads documents corresponding to URLs in the “frontier” • Parser/Harvester: extracts new URLs from web pages • URL Filters: code which discards URLs which are duplicates or don’t conform to certain criteria (e.g. length, robot exclusion) • Frontier: data structures which store, queue, and prioritize URLs and pass them to the “fetcher”

Freely Available Industrial-Strength Crawlers

Project Summary: Creating a Small Corpus for Authorship Classification Goal: Authorship classification of LA Times columnists A full-scale web-crawler is not necessary, articles can be extracted by using just one seed URL:http://www.latimes.com/local/columnists/#axzz2s85Zw6wx

Project Summary: Building LA Times Corpus I SEED URL: http://www.latimes.com/local/columnists/#axzz2s85Zw6wx • Use python library urllib to open the seed URL • Use a regular expression to find all URLs on the page and store in a list • url_list=re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I) • Remove duplicates from the list link1: http://www.latimes.com/nation/politics/politicsnow/ link2: http://www.latimes.com/entertainment/envelope/cotown/ link3: http://www.latimes.com/news/opinion/la-columnist-jnewton,0,3127679.columnist … linkx • Use regular expressions to restrict links to those containing “la-columnist” but not “bio.columnist” • This gives us a list of links referring to the pages of specific columnists link1 (Newton): http://www.latimes.com/news/opinion/la-columnist-jnewton,0,3127679.columnist link2 (Lazarus): http://www.latimes.com/business/la-columnist-dlazarus,0,3677159.columnist … linkx (Author_Name)

Project Summary: Building LA Times Corpus II List of links referring to pages of specific columnists: [banksURL, daumURL, dwyreURL, elliottURL…authorURL] • Use python library urllib to open all links corresponding to pages of columnists • Use a regular expression to find all URLs on the page and store in a list • url_list=re.findall('''href=["'](.[^"']+)["']''', urllib.urlopen(myurl).read(), re.I) • Remove duplicates from the list • Use a regular expression to find the author’s last name: • author=re.findall("la-columnist-[a-z]([a-z]+)", newurl) link1: http://www.latimes.com/sports/hockey/nhl/ link2: http://www.latimes.com/news/opinion/opinionla/la-oe-morrison-abrams-20130612,0,3286143.column … linkx • Use regular expressions to restrict links to those containing .column + the author’s last name • We now have a list of links pertaining to specific articles for each columnist link1: http://www.latimes.com/features/la-hm-erskine20-2008dec20,0,5884345.column link2: http://www.latimes.com/news/local/la-hm-erskine3-2008jul03,0,4566804.column … linkx:

Project Summary: Building LA Times Corpus III List of links referring to articles by specific columnists: [banks_article1_URL, banks_article2_URL…author_article#_URL] • Use python library urllib to open each link <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" http://www.w3.org/TR/html4/loose.dtd> <html itemscopeitemtype="http://schema.org/NewsArticle" xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb=http://www.facebook.com/2008/fbml> <head> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta itemprop="dateModified" content="2013-12-12 18:32:50-0800"/> <meta itemprop="datePublished" content="2013-12-12 18:30:00-0800”/> <meta http-equiv="Content-Language" content="en-US" /> • Need to remove “boilerplate” and extract clean text….

Project Summary: LA Times Corpus Summary • Used links containing “la-columnist” on the seed page to find pages for each columnist • On each columnist’s page, used links containing “.column” and the author’s last name to extract links for articles • Used urllib to open each link; each article is written to a file, stored in a folder for each specific columnist

Post-Processing (non-linguistic cleanup) • HTML stripping • Character References • e.g. €  € • need translation table to convert references to Unicode • Character Encoding: • UTF-8 is most common • ASCII is comprehensive enough for English texts • Each download should be checked for its encoding and converted if necessary

Post-Processing (non-linguistic cleanup) De-hyphenation • MERGE: remove hyphen and concatenate the string: graphing graphing • CONCATENATE: keep hyphen and concatenate: self- esteem  self-esteem • DASHIFY: insert space between the hyphen and replace it with a dash: some cases- and interpretation  some cases – and interpretation • NULL: leave as is

Post-Processing (non-linguistic cleanup) De-hyphenation Options… • Use available word list (noise and hapaxlegomena) • Look at frequency of 1 & 2-grams bootstrapped from corpus:

Post-Processing (non-linguistic cleanup) Additional Post-Processing Steps: • URLs, emails, addresses, phone #s • remove or replace with generic tag (e.g. ###-###-####, @email.blank) • Reduction of repetition • Characters: • ????, agreeeeeeees,noooooooo!!! • Depending on application, remove or standardize (i.e. max 3 repeated characters) • Multiple Lines or blocks

Post-Processing (non-linguistic cleanup) Duplicate Removal: • Jaccard Coefficient: Measures the similarity of two sets (e.g. sets of unigrams, bigrams, etc.) • If similarity is above a certain threshold, then it could be excluded from the corpus Example using unigrams: a=sets(doc1.split()) b=sets(doc2.split()) similarity=float(len(a.intersection(b))/len(a.union(b)))

Post Processing: Boilerplate Removal • Navigational elements (e.g. menus) • Linked content • Addresses, contact information Automatic Boilerplate Removal Features  • Higher precision = cleaner document • Higher recall= more complete document

Post Processing:Boilerplate Removal Software removing boilerplates:

Post Processing: Language Identification • Two approaches: (1) character n-gram statistics (2) function word statistics • Multilingual documents: n-gram approach can be used to classify blocks within one document • Non-text detection (tag clouds, dictionary lists, lists of products/product advantages): function word method can be used • New research focus: identifying language for very short texts like messages in microblogging or query strings

Project Summary: LA Times Corpus Clean-up • Used the python wrapper for boilerpipe to open the URL, extract the text, and save the text to a folder for each author: https://github.com/ptwobrussell/python-boilerpipe/ Boilerplate Usage: from boilerpipe.extract import Extractor extractor=Extractor(extractor="ArticleExtractor", url=myurl) extracted_text=extractor.getText()

Project Summary: LA Times Corpus Clean-up

Project Summary: LA Times Corpus Clean-up • Total Articles: 2,523 Create a custom date range From: HELENE ELLIOTT Anne Pankowski is not used to not making the cut The 19-year-old from Laguna Hills is on the bubble for U.S. women's Olympic hockey team. Two players won't make the squad for Sochi. Comments 0 Anne Pankowski, who just turned 19, is currently the youngest player on the U.S. women's hockey team. (NancieBattaglia / June 24, 2013) By Helene Elliott December 12, 2013, 6:30 p.m. As a natural athlete, Anne Pankowski of Laguna Hills was accustomed to being among the best at any sport she tried. That held true when she followed her older brother John into hockey, first on roller skates and then on ice. She always played with older kids and, because there weren't many girls' teams, she played alongside boys into her freshman year at Santa Margarita Catholic High. "I think there was a lot support. The boys that I always played with were respectful and weren't trying to hit me or do that kind of stuff, so that was good," she said…..

Project Summary: LA Times Corpus Clean-up II Use regular expressions to replace: • Author’s Name • Media Credits • Dates • E-mails ** After initial substitutions, use list of top n-grams to identify other boilerplate words and phrases

Project Summary: LA Times Corpus Clean-up II

Project Summary: LA Times Corpus Clean-up II Removing Duplicates • Common N-grams show that there may be some duplicated articles e.g. “delay National Zoo panda cam” occurs 43 when 6-grams are counted • Problem with boilerplate removal (“See more stories » ”) section which includes headlines or descriptions of other articles. • Solution remove sections in between “See more stories” and the rest of the article, which is separated by a new line at the end: bad_text_block=re.findall("See\smore\sstories(?:.+?)\nX\n(?:.+?)\n", text) ["See more stories \xc2\xbb\nX\nSyria and the perils of proxy war Syria's Assad: Still the wrong choice Moonwalking in Syria What to do about Syria The diplomatic track in Syria To protect Syria's antiquities \xe2\x80\x94 don't buy them A mathematical approach to Syria On foreign policy, a consistently inconsistent president For 'Buck' McKeon, it's Syria or the sequester Obama's bait-and-switch on Syria Treaties \xe2\x80\x94 as American as Ben Franklin In America, not isolationism but skepticism Trying diplomacy in Syria The Syria dilemma: Can global atrocities be ranked? The Putin Doctrine The end of U.S. exceptionalism After Obama's Syria speech, readers call for action Obama, the reluctant warrior on Syria Obama's speech on Syria: Why our analysts are mostly hopeful Securing WMD? It's been done Russian plan just a 'get out of jail free' card for Syria's Assad The road to Damascus How not to deal with Syria John Kerry's me-and-my-big-mouth solution on Syria Five reasons not to attack Syria, and one elegant solution To strike, or not to strike, Syria? A dilemma for Syria's minorities Congress and the power to declare war Don't use U.S. credibility as a reason to attack Syria As Obama hesitates, Israel worries The Syria vote's political stakes Making the case against Syria 'My Dinner with Assad' haunts John Kerry Finally, on Syria, we've rediscovered the Constitution Iraq Syndrome afflicts critics of intelligence on Syria Syria: Regretting the 'red line' Obama and the power to go to war On Syria, Congress should have a say On Syria, a measured response The risk of taking on Syria On Syria, let's be clear: What we're about to do is go to war Striking Syria: Memories of Iraq spark reader skepticism Chemical weapons and Syria: How do you deter a desperate despot? Enough with the phony 'red line' on chemical weapons in Syria Enforcing a 'red line' in Syria Goldberg: Syria's religious war AUTHOR The long haul in Syria AUTHOR Obama plays for time to avoid 'red line' AUTHOR Inching closer to entanglement in Syria AUTHOR Inching toward Syria AUTHOR A ticking clock on Syria AUTHOR A measured U.S. response in Syria\n"]

Project Summary: LA Times Corpus Clean-up II Removing Duplicates • Remove complete duplicates by changing list of articles to a set • Remove near duplicates by using Jaccard Similarity with unigrams and threshold of 0.9: New Document Total: 2,497

Linguistic Processing: Tokenization Not every instance of white space marks a word boundary: • High degree of lexicalization  Entire sequence can be considered a single token (e.g. European Union) Punctuation • Typically a separate token except for with abbreviations (e.g. Mr.), emoticons (e.g. ): ), etc…

Linguistic Processing:Tokenization Rule-based: • tokenizers match a number of regular expressions against the input in a particular order • common to use string lists for strings that shouldn’t be split apart (abbreviations, multi-word expressions) • Sentiment-aware tokenizer using regular expressions: • http://sentiment.christopherpotts.net/code-data/happyfuntokenizing.py Machine learning-based: • Regularities of punctuation are extracted automatically from text

Linguistic Processing:Noisy Data • faulty punctuation, omitted punctuation, omitted whitespace after punctuation • non-standard orthography: errors or writing conventions used in certain web genres (e.g. dunno, lol) • incomplete cleansing of HTML markup • strings resisting straight forward tokenization • foreign-language material • non-words * noise results in an unrealistic count of lexical types, particularly hapaxlegomena

Linguistic Processing: Orthographic Normalization • Detect misspelled words • character n-grams • lexicon look-up • Replacement: • Edit distance to real word fixes most misspellings • Some replacements require looking at context and semantics (e.g. tru has same edit distance between true and try) • Approaches: • weight edit options differently (add, remove, replace) • weight according to individual characters (e.g. vowels v. consonants) • give less weight to more frequent errors • use the context of surrounding words (POS and word n-grams) • use domain specific knowledge

Linguistic Processing: POS Tagging • Challenges: • noise • lexical characteristics of web texts (domain specific terms) • emoticons • lexicon gaps • named entities • Reducing errors: • sort unknown words by highest frequency and hand-tag words • extend lexicon with genre specific spellings • simplify tagset (NN, NNS, NP, NPS  N) • retrain tagger on web data • Twitter tagger: http://www.ark.cs.cmu.edu/TweetNLP/

Available Software for Linguistic Post-Processing • TreeTagger (HMM tagger) • RFTagger (POS & morphological analyzer for German, Czech, Slovene, Hungarian) • FreeLing • Stanford Tagger (Maximum Entropy tagger) • Frog (memory based tagger and morphological analyzer) • Ucto (tokenizes Unicode) • Splitta (sentence boundary detection based on Naïve Bayes or SVM) • NLTK • Apache Open NLP

Project Summary: Linguistic Processing • Sentence Boundary Detection • Splitta: https://code.google.com/p/splitta/

Project Summary: Linguistic Processing • Tokenization & POS Tagging • NLTK http://www.nltk.org

Project Summary: Linguistic Processing • Parsing • Apache Open NLP: https://opennlp.apache.org/documentation/1.5.3/manual/opennlp.html • Training Models: http://opennlp.sourceforge.net/models-1.5/

Corpus Evaluation Rough Quality Checks: • Word and sentence length: massive abnormalities in distribution can be detected by visual inspection of plotted data • Duplication: massive duplication of larger n-grams can be indicative of short-comings in (near-)duplicate detection and problems with boilerplate removal • Project: Both of these quality checks were done prior to linguistic processing

Project Summary: Constructing LA Times Corpus • Extracted web pages from LA Times Columnists • Used boilerpipe to remove HTML tags and extract text • Looked at n-gram frequencies to identify extra boilerplate phrases and duplicates • Used regular expression substitution to remove remaining boilerplate • Used Jaccard Similarity to remove near-duplicates • Identified sentence boundaries using splitta • Used NLTK to tokenize and POS tag each document • Parsed each article with Open NLP • All files are stored by author and completing the linguistic processing steps will facilitate extracting features such as: • Frequencies of function words • POS n-grams • Frequencies of syntactic productions (e.g. NP  DT NN) • Average word and sentence length • Lexical Density, etc.

Homework • Create a web corpus that can be used for authorship classification. Your corpus should contain text from at least three authors and at least 50 documents for each author.* • When choosing your authors, authors should be from one genre. This can be any genre EXCEPT newspaper articles. By genre, I mean the type of text (blog, newspaper, forum, novel, tweets, movie reviews, etc.) NOT topic of the text. STEPS: • Identity three authors whose texts you can extract from a single webpage such as a blog (you may have one seed URL or a separate seed URL for each author) • If you are using python, you can use urllib to open your seed URL, otherwise find a way to open webpages in your programming language of choice • Write a regular expression to find and restrict additional hyperlinks. These hyperlinks should correspond to pages of specific text files (if the page corresponds to a specific author, repeat step 2) • Extract the text from each URL using boilerpipe or another type of software (see slide 22 for a list of software) • Use splitta or another software of your choice to detect sentence boundaries • Use regular expressions, NLTK, OpenNLP, or other software to tokenize the text • Once the text has been extracted, split into sentences and tokenized, write the text to a file. Each file should contain one sentence per line and each file should be stored in a folder that corresponds to the author. • You should make sure that your corpus doesn’t contain duplicates or boilerplate features. One way to do this is to look at n-gram frequencies. If large n-grams occur frequently, the n-gram is probably a boilerplate feature or part of a duplicated text. Remove these by using Jaccard coefficients and/or regular expressions. * If texts are long (e.g. novel) you can split the document into documents of at least 20 sentences to achieve the minimum of 50 documents. **You may use parts of my code on slide 25 for reference.

Corpus Format

Genre Example 1: Literature or Poems from Gutenberg • Extract links from author pages: • http://www.gutenberg.org/ebooks/author/60 (Jules Verne) • Make sure that the same books aren’t being extracted • Make sure that the same language is used for all texts

Genre Example 2: Blogs • Website with lists of top 10 blogs by genre: http://www.blogs.com • Movies: http://www.blogs.com/topten/top-10-film-blogs/ • Birding: http://www.blogs.com/topten/top-10-bird-and-birding-blogs/ • Etc… • Make sure that blogs are written by only one author • Files can be from a single blog genre or multiple genres

Homework • Create a zipped file of your corpus • Turn in a short PowerPoint or PDF to be presented in which you explain: • The genre of your text corpus • The number of authors and files per author • The steps you took to 1) find documents for each author, 2) extract text, 3) split sentence boundaries, 4) tokenize text, and 5) remove duplicates and boilerplate items. • You should include links for any software used

Sources: Schäfer, Roland & Felix Bildhauer (2013). Web corpus construction. In Graeme Hirst (ed.), Synthesis Lectures on Human Language Technologies, Lecture #22. San Rafael, CA: Morgan & Claypool Publishers

Web Corpus Construction

Web Corpus Construction

Presentation Transcript

WEB Construction Technology: Constrained Web Environments

The Web as a Parallel Corpus

The Web as a Parallel Corpus

Principles of corpus construction

The Web as a Parallel Corpus

Corpus

Corpus Exploitation f rom Wikipedia f or Ontology Construction

Corpus Linguistics

ESP Materials Derived from a Web-based Corpus

An Introduction to the Web as Corpus

Introduction to the EUSES Web Macro Scenario Corpus

What's on the Web? The Web as a Linguistic Corpus

A Web Application for Customized Corpus Delivery

Web Construction

The Web as a Parallel Corpus

Interactive web-based learning of corpus-generated phrases

The Web as a Parallel Corpus

The Construction of Anglo-Norman Text Corpus

Corpus annotation

'Corpus Construction' as an alternative logic of sampling

Elicitation Corpus