Beautiful Soup for Web Scraping

CSCE 590 Web Scraping Lecture 6 • Topics • More Beautiful Soup • Crawling • Crawling by Hand • Readings: • Chapter 3 January 26, 2017

Overview • Last Time: • Regular Expressions Again • BeautifulSoupfindall revisited • Today: • A little more Beautiful Soup • Starting to Crawl • Finding links • References • Chapter 2 • Chapter 3 • .

Chapter 2 Examples

Example HTML page for examples 3-6 • http://www.pythonscraping.com/pages/page3.html • Structure • Html • body • div id=wrapper • h1 Totally Normal • div id=content • table id=giftList • tr • th • … • th • tr id=gift1 http://www.pythonscraping.com/pages/page3.html

Navigation examples • bsObj.body.h1 – finds the first “h1” tag that is a descendant of the body tag • bsObj.div.findAll (“img”) find first “div” tag then find all contained img tags

HTML Div and Span • In HTML, span and div elements are used to define parts of a document so that they are identifiable when no other HTML element is suitable. • While other HTML elements such as p (paragraph), em (emphasis) and so on accurately represent the semantics of the content, • the use of span and div leads to better accessibility for readers and easier maintainability for authors. • Div = Document division, but really just a semantics-less container for changing attributes such as color or associating an image with its caption • Span represents an inline portion of a document, for example words within a sentence. https://en.wikipedia.org/wiki/Span_and_div

http://www.w3schools.com/html/html5_semantic_elements.asp

Some HTTP Status Codes • https://docs.python.org/3/library/http.html

#Chapter 2 – findDescendants.py • # Actually this code is find children • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://www.pythonscraping.com/pages/page3.html") • bsObj = BeautifulSoup(html, "html.parser") • for child in bsObj.find("table",{"id":"giftList"}).children: • print(child) • Replacing the call to children with a call to descendants catches many more tags

#Chapter 2: 4-findSiblings.py • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen("http://www.pythonscraping.com/pages/page3.html") • bsObj = BeautifulSoup(html, "html.parser") • for sibling in bsObj.find("table",{"id":"giftList"}).tr.next_siblings: • print(sibling) • Finds the table, then the first “tr” tag (the first row) then repeatedly moves to the next row and prints

Code snippets from Text • All of the code from the text is available online (again): • https:// github.com/ REMitchell/ python-scraping. • Mitchell, Ryan. Web Scraping with Python: Collecting Data from the Modern Web (Kindle Location 103). O'Reilly Media. Kindle Edition. • We will in class start omitting essential parts of the code inorder to shorten it for presentation (of course it will not run as shown!) • Imports – the usual suspects • Exceptional handling !! • The Code in the text is not production code because of things like the exception handling! Web Scraping with Python: Collecting Data from the Modern Web by Ryan Mitchell

Chapter 2: 5-findParents.py • print(bsObj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

Chapter 2: 6-regularExpressions.py • … • images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) • for image in images: • print(image["src"])

Chapter 2: 7-lambdaExpressions.py • … • tags = bsObj.findAll(lambda tag: len(tag.attrs) == 2) • for tag in tags: • print(tag)

Accessing Attributes • myTag.attrs – returns the dictionary of attributes and values for myTag • So • myTag.attrs[‘href’] would return the value of the href attribute of myTag

Chapter 3: Crawling • Each web document is the rootnode of a tree. • Well actually a graph • Well actually a multigraph • Well actually a multigraph with loops • 6-degrees of Kevin Bacon example (Wikipedia style) • Find shortest path in Wikipedia pages from a given one to the page for Kevin bacon

How do you want to crawl? • First we must limit the search somehow. Why? • Depth First • Breadth First • Random

dfs • #!/usr/bin/env python3 • # -*- coding: utf-8 -*- • """ • Created on Thu Jan 26 06:47:35 2017 • @author: matthews • """ • def dfs(graph, vertex, path = []): • path.append(vertex) • for newv in graph[vertex]: • if newv not in path: • path = dfs(graph, newv, path) • return path • graph = {1: [4, 5], • 2: [1, 3], • 3: [1, 2], • 4: [1,2,3], • 5: [2, 4, 6], • 6: [2, 5]} • print (dfs(graph, 1))

Chapter 3: findLinks.py • from urllib.request import urlopen • from bs4 import BeautifulSoup • html = urlopen(" http:// en.wikipedia.org/ wiki/ Kevin_Bacon") • bsObj = BeautifulSoup( html) • for link in bsObj.findAll(" a"): • if 'href' in link.attrs: • print( link.attrs[' href'])

Good Links – Bad links • Bad links • // wikimediafoundation.org/ wiki/ Privacy_policy // • en.wikipedia.org/ wiki/ Wikipedia:Contact_us • /wiki/ Category:Articles_withstatements_from_April_2014 • /wiki/ Talk:Kevin_Bacon • So Good links have three things in common: • They reside within the div with the id set to bodyContent • The URLs do not contain colons • The URLs begin with /wiki/

Refined inner loop • for link in bsObj.find(" div", {" id":" bodyContent"}). • findAll(" a", href = re.compile("^(/ wiki/)((?!:).)* $")): • if 'href' in link.attrs: print( link.attrs[' href'])

Chapter 3: 1-getWikiLinks.py • #Web Scraping with Python by Ryan Mitchell • #Chapter 3: 1-getWikiLinks.py • from urllib.request import urlopen • from bs4 import BeautifulSoup • import datetime • import random • import re • (continued on next slide; Henceforth leaving off imports. Etc.)

random.seed(datetime.datetime.now()) • defgetLinks(articleUrl): • html = urlopen("http://en.wikipedia.org"+articleUrl) • bsObj = BeautifulSoup(html, "html.parser") • return bsObj.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")) • links = getLinks("/wiki/Kevin_Bacon") • while len(links) > 0: • newArticle = links[random.randint(0, len(links)-1)].attrs["href"] • print(newArticle) • links = getLinks(newArticle)

#Chapter 3: 2-crawlWikipedia.py • … • pages = set() • defgetLinks(pageUrl): • global pages • html = urlopen("http://en.wikipedia.org"+pageUrl) • bsObj = BeautifulSoup(html, "html.parser") • try: • print(bsObj.h1.get_text()) • print(bsObj.find(id ="mw-content-text").findAll("p")[0]) • print(bsObj.find(id="ca-edit").find("span").find("a").attrs['href']) • except AttributeError: • print("This page is missing something! No worries though!")

for link in bsObj.findAll("a", href=re.compile("^(/wiki/)")): • if 'href' in link.attrs: • if link.attrs['href'] not in pages: • #We have encountered a new page • newPage = link.attrs['href'] • print("----------------\n"+newPage) • pages.add(newPage) • getLinks(newPage) • getLinks("")

Surface web vs Deep Web • Surface web – the part google sees • Deep Web – the part google doesn’t see • ? Forms etc. • The Dark web or Darknet • the Dark Web is a collection of websites that are publicly visible, yet hide the IP addresses of the servers that run them. • That means anyone can visit a Dark Web site, but it can be very difficult to figure out where they’re hosted—or by whom. • http://www.wired.com/2014/11/hacker-lexicon-whats-dark-web/

#Chapter 3: 3-crawlSite.py • pages = set() • random.seed(datetime.datetime.now()) • #Retrieves a list of all Internal links found on a page • defgetInternalLinks(bsObj, includeUrl): • internalLinks = [ ] • #Finds all links that begin with a "/" • for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")): • if link.attrs['href'] is not None: • if link.attrs['href'] not in internalLinks: • internalLinks.append(link.attrs['href']) • return internalLinks

3-crawlSite.py continued • #Retrieves a list of all external links found on a page • defgetExternalLinks(bsObj, excludeUrl): • externalLinks = [] • #Finds all links that start with "http" or "www" that do • #not contain the current URL • for link in bsObj.findAll("a", href=re.compile("^(http|www)((?!"+excludeUrl+").)*$")): • if link.attrs['href'] is not None: • if link.attrs['href'] not in externalLinks: • externalLinks.append(link.attrs['href']) • return externalLinks

3-crawlSite.py continued • defsplitAddress(address): • addressParts = address.replace("http://", "").split("/") • return addressParts • defgetRandomExternalLink(startingPage): • html = urlopen(startingPage) • bsObj = BeautifulSoup(html, "html.parser") • externalLinks = getExternalLinks(bsObj, splitAddress(startingPage)[0]) • if len(externalLinks) == 0: • internalLinks = getInternalLinks(startingPage) • return getNextExternalLink(internalLinks[random.randint(0, • len(internalLinks)-1)]) • else: • return externalLinks[random.randint(0, len(externalLinks)-1)]

3-crawlSite.py continued • deffollowExternalOnly(startingSite): • externalLink = getRandomExternalLink("http://oreilly.com") • print("Random external link is: "+externalLink) • followExternalOnly(externalLink) • followExternalOnly("http://oreilly.com")

# Chapter3: 4-getExternalLinks.py • pages = set() • random.seed(datetime.datetime.now()) • #Retrieves a list of all Internal links found on a page • defgetInternalLinks(bsObj, includeUrl): • includeUrl = urlparse(includeUrl).scheme+"://"+urlparse(includeUrl).netloc • internalLinks = [] • #Finds all links that begin with a "/" • for link in bsObj.findAll("a", href=re.compile("^(/|.*"+includeUrl+")")): • if link.attrs['href'] is not None: • if link.attrs['href'] not in internalLinks: • if(link.attrs['href'].startswith("/")): • internalLinks.append(includeUrl+link.attrs['href']) • else: • internalLinks.append(link.attrs['href']) • return internalLinks

#Retrieves a list of all external links found on a page • defgetExternalLinks(bsObj, excludeUrl): • externalLinks = [] • #Finds all links that start with "http" or "www" that do • #not contain the current URL • for link in bsObj.findAll("a", href=re.compile( • "^(http|www)((?!"+excludeUrl+").)*$")): • if link.attrs['href'] is not None: • if link.attrs['href'] not in externalLinks: • externalLinks.append(link.attrs['href']) • return externalLinks

defgetRandomExternalLink(startingPage): • html = urlopen(startingPage) • bsObj = BeautifulSoup(html, "html.parser") • externalLinks = getExternalLinks(bsObj, urlparse(startingPage).netloc) • if len(externalLinks) == 0: • print("No external links, looking around the site for one") • domain = urlparse(startingPage).scheme+"://"+urlparse(startingPage).netloc • internalLinks = getInternalLinks(bsObj, domain) • return getRandomExternalLink(internalLinks[random.randint(0,len(internalLinks)-1)]) • else: • return externalLinks[random.randint(0, len(externalLinks)-1)]

deffollowExternalOnly(startingSite): • externalLink = getRandomExternalLink(startingSite) • print("Random external link is: "+externalLink) • followExternalOnly(externalLink) • followExternalOnly("http://oreilly.com")

# Chapter3: 5-getAllExternalLinks.py • #Collects a list of all external URLs found on the site • allExtLinks = set() • allIntLinks = set() • defgetAllExternalLinks(siteUrl): • html = urlopen(siteUrl) • domain = urlparse(siteUrl).scheme+"://"+urlparse(siteUrl).netloc • bsObj = BeautifulSoup(html, "html.parser") • internalLinks = getInternalLinks(bsObj,domain) • externalLinks = getExternalLinks(bsObj,domain)

for link in externalLinks: • if link not in allExtLinks: • allExtLinks.add(link) • print(link) • for link in internalLinks: • if link not in allIntLinks: • allIntLinks.add(link) • getAllExternalLinks(link) • followExternalOnly("http://oreilly.com") • allIntLinks.add("http://oreilly.com") • getAllExternalLinks("http://oreilly.com")

CSCE 590 HW 2 - Regular expressions, • Due Jan 29 Sunday night 11:55PM • 1) Give regular expressions that denotes the Languages: • a) { strings x such that x starts with 'aa' followed by anynumber of b's and c's and then end in an 'a'. • b) Phone numbers with optional area codes and optional extensions of the form " ext 432". • c) Email Addresses • d) a Python function definition that just has pass as the body • e) A link in a web page • 2) What languages(sets of strings that match the re) are denoted by the following regular expressions: • a) (a|b)[cde]{3,4}b • b) \w+\W+ • c) \d{3}-\d{2}-\d{4}

3) Give a regular expressions that extracts the "login" for a USC student from their email address "login@email.sc.edu" • (after the match one could use login=match.group(1) ) • 4) Write a Python program that processes a file line by line and cleans it by removing (re.sub) • social security numbers (replacing with <social-security>) • email addresses (replacing with "<email>") • phone numbers (replacing with <phone-number> • For extra credit replacing Soc-sec numbers that leave the number but replace the first three digits with the last three and replace the last three with the first three in the original string, i.e. 123-45-6789 becomes 789-45-6123 .

Beautiful Soup for Web Scraping

Beautiful Soup for Web Scraping

Presentation Transcript

IST 590 Guest Lecture

Lecture 6: Web security: SSL

Advanced Web 2012 Lecture 6

CSCE 452: Lecture 1

CSCE 815 Network Security Lecture 6

EDU 590: Week 6

Web Scraping Services

Semalt: Top 6 Features Of Web Scraping Service

Web Scraping

Web Data Scraping

CSCE 590 Web Scraping – NLTK IE

CSCE 590 Web Scraping - Selenium

590 Web Scraping – Handling Images

590 Scraping – Social Web

CSCE 590 Web Scraping – NLTK

Web Scraping Lecture 11 - Document Encoding

590 Scraping – NER shape features

CSCE 590 Web Scraping - NLTK

Web Scraping Services

_LinkedIn Web Scraping