0% found this document useful (0 votes)
35 views41 pages

03 Web Scraping

Uploaded by

t235912
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views41 pages

03 Web Scraping

Uploaded by

t235912
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

02 Web Scraping

Requests: HTTP for Humans


Requests is an elegant and simple HTTP library for Python, built for human
beings.

https://docs.python-requests.org/en
/latest/index.html
Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need
to manually add query strings to your URLs, or to form-encode your POST data.
Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.
Lab I: Get HTML Page
import requests [as xxx]

● get
● post
● put
● delete
Response
● cookies
● headers
● json(**kwargs)
● status_code
● text
import requests

response=requests.get("https://pypi.org/project/requests/")

print(response.text)
Result
Beautiful Soup
But an html string does not help much!

We may want to parse the result further!

Let’s check Beautiful Soup.


Beautiful Soup is a Python library for pulling data out of HTML and XML files. It
works with your favorite parser to provide idiomatic ways of navigating, searching,
and modifying the parse tree. It commonly saves programmers hours or days of
work.

https://beautiful-soup-4.readthedoc
s.io/en/latest/
Parse
soup = BeautifulSoup(html_doc, 'html.parser')
import requests
from bs4 import BeautifulSoup

response=requests.get("https://pypi.org/project/requests/")

soup=BeautifulSoup(response.text)
print(soup.prettify())
Searching the tree
Beautiful Soup supports jQuery style selectors via the “select” method.

https://www.w3schools.com/jquery/jquery_selectors.asp
Lab 2: Extract Codes from the Page
Tag
A Tag object corresponds to an XML or HTML tag in the original document:
Name

very tag has a name, accessible as .name:


Attributes

A tag may have any number of attributes. The tag <b id="boldest"> has an
attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating
the tag like a dictionary:
text

The embedded text of the tag.

tag.text
import requests
from bs4 import BeautifulSoup

response=requests.get("https://pypi.org/project/requests/")
soup=BeautifulSoup(response.text)
pres=soup.select("pre")
for pre in pres:
print(pre.text)
Lab 3: Crawl the Content of a Site
urllib — URL handling modules
urllib is a package that collects several modules for working with URLs:

● urllib.request for opening and reading URLs


● urllib.error containing the exceptions raised by urllib.request
● urllib.parse for parsing URLs
● urllib.robotparser for parsing robots.txt files
urllib.parse — Parse URLs into components
This module defines a standard interface to break Uniform Resource Locator
(URL) strings up in components (addressing scheme, network location, path etc.),
to combine the components back into a URL string, and to convert a “relative
URL” to an absolute URL given a “base URL.”
urllib.parse.urljoin(base, url, allow_fragments=True)
Construct a full (“absolute”) URL by combining a “base URL” (base) with another
URL (url). Informally, this uses components of the base URL, in particular the
addressing scheme, the network location and (part of) the path, to provide missing
components in the relative URL. For example:
import requests
from bs4 import BeautifulSoup
import urllib.parse
response=requests.get("https://pypi.org/project/requests/")
soup=BeautifulSoup(response.text)
links=soup.select("a")
for link in links:
if link.has_attr('href'):
print("\t"+urllib.parse.urljoin("https://pypi.org/project/requests/", link['href']))
Wait a minute!

Most of the href are out of this site!

Perhaps we should exclude them.


import requests

from bs4 import BeautifulSoup

import urllib.parse

response=requests.get("https://pypi.org/project/requests/")

soup=BeautifulSoup(response.text)

links=soup.select("a")

for link in links:

if link.has_attr('href'):

absUrl=urllib.parse.urljoin("https://pypi.org/project/requests/", link['href'])

if absUrl.startswith("https://pypi.org/project/requests/"):

print(absUrl)
you may want to
further exclude
these
import re
# Example pattern: match 'test-' followed by digits
regex = re.compile(r'test-\d+')
correct_string = 'test-251'
wrong_string = 'test-123x'
if regex.match(correct_string):
print('Matching correct string.')

if regex.match(wrong_string):
print('Matching wrong string.')
https://docs.python.org/3/library/re.
html
import requests

from bs4 import BeautifulSoup

import urllib.parse

import re

response=requests.get("https://pypi.org/project/requests/")

soup=BeautifulSoup(response.text)

p=re.compile('https://pypi.org/project/requests/[\\d\\.]+')

links=soup.select("a")

for link in links:

if link.has_attr('href'):

absUrl=urllib.parse.urljoin("https://pypi.org/project/requests/", link['href'])

if absUrl.startswith("https://pypi.org/project/requests/"):

if not p.match(absUrl):

print(absUrl)
Lab 4: Extract Content from All Pages
if you are considering this, don’t:
Refactor
Recursion
def extractPageContent(url):

response=requests.get(url)

soup=BeautifulSoup(response.text)

p=re.compile('https://pypi.org/project/requests/[\\d\\.]+')

links=soup.select("a")

for link in links:

if link.has_attr('href'):

absUrl=urllib.parse.urljoin("https://pypi.org/project/requests/", link['href'])

if absUrl.startswith("https://pypi.org/project/requests/"):

if not p.match(absUrl):

print(absUrl)

extractPageContent(absUrl)
this will never
stop, why?
linkSet=set()
def extractPageContent(url):
if url in linkSet:
skip parsed links
return
linkSet.add(url)
response=requests.get(url)
soup=BeautifulSoup(response.text)

You might also like