03 Web Scraping
03 Web Scraping
https://docs.python-requests.org/en
/latest/index.html
Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need
to manually add query strings to your URLs, or to form-encode your POST data.
Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.
Lab I: Get HTML Page
import requests [as xxx]
● get
● post
● put
● delete
Response
● cookies
● headers
● json(**kwargs)
● status_code
● text
import requests
response=requests.get("https://pypi.org/project/requests/")
print(response.text)
Result
Beautiful Soup
But an html string does not help much!
https://beautiful-soup-4.readthedoc
s.io/en/latest/
Parse
soup = BeautifulSoup(html_doc, 'html.parser')
import requests
from bs4 import BeautifulSoup
response=requests.get("https://pypi.org/project/requests/")
soup=BeautifulSoup(response.text)
print(soup.prettify())
Searching the tree
Beautiful Soup supports jQuery style selectors via the “select” method.
https://www.w3schools.com/jquery/jquery_selectors.asp
Lab 2: Extract Codes from the Page
Tag
A Tag object corresponds to an XML or HTML tag in the original document:
Name
A tag may have any number of attributes. The tag <b id="boldest"> has an
attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating
the tag like a dictionary:
text
tag.text
import requests
from bs4 import BeautifulSoup
response=requests.get("https://pypi.org/project/requests/")
soup=BeautifulSoup(response.text)
pres=soup.select("pre")
for pre in pres:
print(pre.text)
Lab 3: Crawl the Content of a Site
urllib — URL handling modules
urllib is a package that collects several modules for working with URLs:
import urllib.parse
response=requests.get("https://pypi.org/project/requests/")
soup=BeautifulSoup(response.text)
links=soup.select("a")
if link.has_attr('href'):
absUrl=urllib.parse.urljoin("https://pypi.org/project/requests/", link['href'])
if absUrl.startswith("https://pypi.org/project/requests/"):
print(absUrl)
you may want to
further exclude
these
import re
# Example pattern: match 'test-' followed by digits
regex = re.compile(r'test-\d+')
correct_string = 'test-251'
wrong_string = 'test-123x'
if regex.match(correct_string):
print('Matching correct string.')
if regex.match(wrong_string):
print('Matching wrong string.')
https://docs.python.org/3/library/re.
html
import requests
import urllib.parse
import re
response=requests.get("https://pypi.org/project/requests/")
soup=BeautifulSoup(response.text)
p=re.compile('https://pypi.org/project/requests/[\\d\\.]+')
links=soup.select("a")
if link.has_attr('href'):
absUrl=urllib.parse.urljoin("https://pypi.org/project/requests/", link['href'])
if absUrl.startswith("https://pypi.org/project/requests/"):
if not p.match(absUrl):
print(absUrl)
Lab 4: Extract Content from All Pages
if you are considering this, don’t:
Refactor
Recursion
def extractPageContent(url):
response=requests.get(url)
soup=BeautifulSoup(response.text)
p=re.compile('https://pypi.org/project/requests/[\\d\\.]+')
links=soup.select("a")
if link.has_attr('href'):
absUrl=urllib.parse.urljoin("https://pypi.org/project/requests/", link['href'])
if absUrl.startswith("https://pypi.org/project/requests/"):
if not p.match(absUrl):
print(absUrl)
extractPageContent(absUrl)
this will never
stop, why?
linkSet=set()
def extractPageContent(url):
if url in linkSet:
skip parsed links
return
linkSet.add(url)
response=requests.get(url)
soup=BeautifulSoup(response.text)