0% found this document useful (0 votes)

35 views41 pages

03 Web Scraping

Uploaded by

t235912

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views41 pages

03 Web Scraping

Uploaded by

t235912

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

02 Web Scraping

Requests: HTTP for Humans

Requests is an elegant and simple HTTP library for Python, built for human
beings.

https://docs.python-requests.org/en
/latest/index.html
Requests allows you to send HTTP/1.1 requests extremely easily. There’s no need
to manually add query strings to your URLs, or to form-encode your POST data.
Keep-alive and HTTP connection pooling are 100% automatic, thanks to urllib3.
Lab I: Get HTML Page
import requests [as xxx]

● get
● post
● put
● delete
Response
● cookies
● headers
● json(**kwargs)
● status_code
● text
import requests

response=requests.get("https://pypi.org/project/requests/")

print(response.text)
Result
Beautiful Soup
But an html string does not help much!

We may want to parse the result further!

Let’s check Beautiful Soup.

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It
works with your favorite parser to provide idiomatic ways of navigating, searching,
and modifying the parse tree. It commonly saves programmers hours or days of
work.

https://beautiful-soup-4.readthedoc
s.io/en/latest/
Parse
soup = BeautifulSoup(html_doc, 'html.parser')
import requests
from bs4 import BeautifulSoup

response=requests.get("https://pypi.org/project/requests/")

soup=BeautifulSoup(response.text)
print(soup.prettify())
Searching the tree
Beautiful Soup supports jQuery style selectors via the “select” method.

https://www.w3schools.com/jquery/jquery_selectors.asp
Lab 2: Extract Codes from the Page
Tag
A Tag object corresponds to an XML or HTML tag in the original document:
Name

very tag has a name, accessible as .name:

Attributes

A tag may have any number of attributes. The tag <b id="boldest"> has an
attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating
the tag like a dictionary:
text

The embedded text of the tag.

tag.text
import requests
from bs4 import BeautifulSoup

response=requests.get("https://pypi.org/project/requests/")
soup=BeautifulSoup(response.text)
pres=soup.select("pre")
for pre in pres:
print(pre.text)
Lab 3: Crawl the Content of a Site
urllib — URL handling modules
urllib is a package that collects several modules for working with URLs:

● urllib.request for opening and reading URLs

● urllib.error containing the exceptions raised by urllib.request
● urllib.parse for parsing URLs
● urllib.robotparser for parsing robots.txt files
urllib.parse — Parse URLs into components
This module defines a standard interface to break Uniform Resource Locator
(URL) strings up in components (addressing scheme, network location, path etc.),
to combine the components back into a URL string, and to convert a “relative
URL” to an absolute URL given a “base URL.”
urllib.parse.urljoin(base, url, allow_fragments=True)
Construct a full (“absolute”) URL by combining a “base URL” (base) with another
URL (url). Informally, this uses components of the base URL, in particular the
addressing scheme, the network location and (part of) the path, to provide missing
components in the relative URL. For example:
import requests
from bs4 import BeautifulSoup
import urllib.parse
response=requests.get("https://pypi.org/project/requests/")
soup=BeautifulSoup(response.text)
links=soup.select("a")
for link in links:
if link.has_attr('href'):
print("\t"+urllib.parse.urljoin("https://pypi.org/project/requests/", link['href']))
Wait a minute!

Most of the href are out of this site!

Perhaps we should exclude them.

import requests

from bs4 import BeautifulSoup

import urllib.parse

response=requests.get("https://pypi.org/project/requests/")

soup=BeautifulSoup(response.text)

links=soup.select("a")

for link in links:

if link.has_attr('href'):

absUrl=urllib.parse.urljoin("https://pypi.org/project/requests/", link['href'])

if absUrl.startswith("https://pypi.org/project/requests/"):

print(absUrl)
you may want to
further exclude
these
import re
# Example pattern: match 'test-' followed by digits
regex = re.compile(r'test-\d+')
correct_string = 'test-251'
wrong_string = 'test-123x'
if regex.match(correct_string):
print('Matching correct string.')

if regex.match(wrong_string):
print('Matching wrong string.')
https://docs.python.org/3/library/re.
html
import requests

from bs4 import BeautifulSoup

import urllib.parse

import re

response=requests.get("https://pypi.org/project/requests/")

soup=BeautifulSoup(response.text)

p=re.compile('https://pypi.org/project/requests/[\\d\\.]+')

links=soup.select("a")

for link in links:

if link.has_attr('href'):

absUrl=urllib.parse.urljoin("https://pypi.org/project/requests/", link['href'])

if absUrl.startswith("https://pypi.org/project/requests/"):

if not p.match(absUrl):

print(absUrl)
Lab 4: Extract Content from All Pages
if you are considering this, don’t:
Refactor
Recursion
def extractPageContent(url):

response=requests.get(url)

soup=BeautifulSoup(response.text)

p=re.compile('https://pypi.org/project/requests/[\\d\\.]+')

links=soup.select("a")

for link in links:

if link.has_attr('href'):

absUrl=urllib.parse.urljoin("https://pypi.org/project/requests/", link['href'])

if absUrl.startswith("https://pypi.org/project/requests/"):

if not p.match(absUrl):

print(absUrl)

extractPageContent(absUrl)
this will never
stop, why?
linkSet=set()
def extractPageContent(url):
if url in linkSet:
skip parsed links
return
linkSet.add(url)
response=requests.get(url)
soup=BeautifulSoup(response.text)

@bca Studies : Most Important Question
No ratings yet
@bca Studies : Most Important Question
9 pages
Howto Urllib2
100% (2)
Howto Urllib2
11 pages
Laravel 10 CRUD
No ratings yet
Laravel 10 CRUD
9 pages
Slot 8-9-10 Sessions
No ratings yet
Slot 8-9-10 Sessions
150 pages
Java 11 Web Applications and Java Ee
100% (2)
Java 11 Web Applications and Java Ee
212 pages
Chapter 11. Web Scraping
100% (1)
Chapter 11. Web Scraping
57 pages
ScanCentral Guide 20.2.0
No ratings yet
ScanCentral Guide 20.2.0
80 pages
Distributed File Systems
No ratings yet
Distributed File Systems
56 pages
PYTHON MODULE-4
No ratings yet
PYTHON MODULE-4
109 pages
Using Mimix
No ratings yet
Using Mimix
362 pages
Module 2.1 - Web Automation
No ratings yet
Module 2.1 - Web Automation
32 pages
TIB BW 6.9.0 Installation
No ratings yet
TIB BW 6.9.0 Installation
82 pages
DAP Module4
No ratings yet
DAP Module4
109 pages
KofaxSignDocStandardAdministratorsGuide EN
No ratings yet
KofaxSignDocStandardAdministratorsGuide EN
64 pages
Voicexml TR
No ratings yet
Voicexml TR
47 pages
Howto Urllib2
No ratings yet
Howto Urllib2
11 pages
unit4
No ratings yet
unit4
36 pages
Chapter-4 Update
No ratings yet
Chapter-4 Update
16 pages
Web Programming
No ratings yet
Web Programming
36 pages
OpenText Extended ECM for SAP SuccessFactors CE 22.3 - Smart View User Guide English (CSEESU220300-UGD-EN-01)
No ratings yet
OpenText Extended ECM for SAP SuccessFactors CE 22.3 - Smart View User Guide English (CSEESU220300-UGD-EN-01)
40 pages
slidesgo-getting-cozy-with-pythons-httplib-and-urllib-your-new-best-buddies-for-web-requests-20241216163146egdS
No ratings yet
slidesgo-getting-cozy-with-pythons-httplib-and-urllib-your-new-best-buddies-for-web-requests-20241216163146egdS
10 pages
Selenium Notes 1
No ratings yet
Selenium Notes 1
67 pages
DAP_Module 4
No ratings yet
DAP_Module 4
57 pages
Howto Urllib2
No ratings yet
Howto Urllib2
11 pages
Fun With Python
100% (5)
Fun With Python
113 pages
Howto Urllib2 PDF
No ratings yet
Howto Urllib2 PDF
11 pages
3252_ids_10
No ratings yet
3252_ids_10
5 pages
Summer_Fellowship Programme_Department Login_User manual
No ratings yet
Summer_Fellowship Programme_Department Login_User manual
18 pages
Veritas Netbackup™ Upgrade Quick Start Guide: Release 9.1
No ratings yet
Veritas Netbackup™ Upgrade Quick Start Guide: Release 9.1
40 pages
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
No ratings yet
Data Engineering Concepts #2 - Sending Data Using An API - by Bar Dadon - Dev Genius
14 pages
Web Technologies QA
No ratings yet
Web Technologies QA
5 pages
beautifulSoup
No ratings yet
beautifulSoup
61 pages
Howto Urllib2
No ratings yet
Howto Urllib2
12 pages
Web Scrapping
100% (1)
Web Scrapping
20 pages
Howto Urllib2
No ratings yet
Howto Urllib2
12 pages
Source Diginotes - In: Save The Earth - Go Paperless
No ratings yet
Source Diginotes - In: Save The Earth - Go Paperless
27 pages
Create A Clustering Model With Azure Machine Learning Designer
No ratings yet
Create A Clustering Model With Azure Machine Learning Designer
22 pages
Dewp 2
No ratings yet
Dewp 2
23 pages
Howto Urllib2
No ratings yet
Howto Urllib2
12 pages
CTP M5 CH1,CH2 (1)
No ratings yet
CTP M5 CH1,CH2 (1)
18 pages
Simran
No ratings yet
Simran
26 pages
API Cheatsheet
No ratings yet
API Cheatsheet
4 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
DAP_4_module
No ratings yet
DAP_4_module
45 pages
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
No ratings yet
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
12 pages
Cheat Sheet: API's and Data Collection: Package/Method Description Code Example
No ratings yet
Cheat Sheet: API's and Data Collection: Package/Method Description Code Example
4 pages
Tiktok Ads: Pixel 2.0 Operation Guide
No ratings yet
Tiktok Ads: Pixel 2.0 Operation Guide
20 pages
Beautiful Soup
No ratings yet
Beautiful Soup
40 pages
HOWTO Fetch Internet Resources Using The Urllib Package: Table Des Matières
No ratings yet
HOWTO Fetch Internet Resources Using The Urllib Package: Table Des Matières
11 pages
Sans titre
No ratings yet
Sans titre
11 pages
ibm-python-module-5-apis-data-collection
No ratings yet
ibm-python-module-5-apis-data-collection
3 pages
cs360 2023 hw2
No ratings yet
cs360 2023 hw2
15 pages
DWV_labs_2025_1 (1)
No ratings yet
DWV_labs_2025_1 (1)
17 pages
On Python Project VI Semester: Academic Year: 2018-2019
No ratings yet
On Python Project VI Semester: Academic Year: 2018-2019
7 pages
Python Web Crawler
No ratings yet
Python Web Crawler
15 pages
Web Scraping and Data Collection CheatSheet 1731972399
No ratings yet
Web Scraping and Data Collection CheatSheet 1731972399
10 pages
Fall 2023 - CS607 - 1
No ratings yet
Fall 2023 - CS607 - 1
3 pages
Beautifulsoap4 Experiments
No ratings yet
Beautifulsoap4 Experiments
7 pages
Scrapping The Web
100% (1)
Scrapping The Web
13 pages
another hack test3
No ratings yet
another hack test3
4 pages
60004210188_RajSingh_WIexp4
No ratings yet
60004210188_RajSingh_WIexp4
7 pages
DILG Tarlac Portal and eLGRC Write-Up
No ratings yet
DILG Tarlac Portal and eLGRC Write-Up
10 pages
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
No ratings yet
HOWTO Fetch Internet Resources Using The Urllib Package: Guido Van Rossum and The Python Development Team
11 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
21 pages
IntelliJ IDEA – the IDE for Pro Java and Kotlin Development
No ratings yet
IntelliJ IDEA – the IDE for Pro Java and Kotlin Development
9 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
Web Crawling - python
No ratings yet
Web Crawling - python
34 pages
HOWTO Fetch Internet Resources Using Urllib2: Guido Van Rossum and The Python Development Team
No ratings yet
HOWTO Fetch Internet Resources Using Urllib2: Guido Van Rossum and The Python Development Team
10 pages
web scraping using python
No ratings yet
web scraping using python
18 pages
Bhojpuri Songs and Music Has A Long and Glorious History
No ratings yet
Bhojpuri Songs and Music Has A Long and Glorious History
39 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
PHP Programming Unit 5
No ratings yet
PHP Programming Unit 5
61 pages
Summary Code With Mosh
No ratings yet
Summary Code With Mosh
3 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Strip HTML Tags Using Python
No ratings yet
Strip HTML Tags Using Python
8 pages
Turnitin - Originality Report - The Meaning of The Sentence "Love Covers A Multitude of Sins"
No ratings yet
Turnitin - Originality Report - The Meaning of The Sentence "Love Covers A Multitude of Sins"
8 pages
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
No ratings yet
Sahil Malhotra 16 BCE 0113 Web Mining L51+L52: 1. Universal Crawling 1.1. CODE
11 pages
Vaishanvi B.voc Computer Batch 1st PDF
No ratings yet
Vaishanvi B.voc Computer Batch 1st PDF
6 pages
Api and data structure
No ratings yet
Api and data structure
3 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
ICTProject-Term 3-Grade 5
No ratings yet
ICTProject-Term 3-Grade 5
5 pages
Design of An Online Banking Authentication System, Implementing Mobile-OTP With QR-Code
No ratings yet
Design of An Online Banking Authentication System, Implementing Mobile-OTP With QR-Code
15 pages
Introduction to Web Crawling chapter -13
No ratings yet
Introduction to Web Crawling chapter -13
3 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
080250032-Compiler Design Lab Manual Final
No ratings yet
080250032-Compiler Design Lab Manual Final
3 pages
bs4 Examples
No ratings yet
bs4 Examples
2 pages
A Simple Python Web Crawler...
100% (1)
A Simple Python Web Crawler...
5 pages
Ashok.M: Python Developer
No ratings yet
Ashok.M: Python Developer
3 pages
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
Spring Boot Intermediate Microservices: Resilient Microservices with Spring Boot 2 and Spring Cloud
From Everand
Spring Boot Intermediate Microservices: Resilient Microservices with Spring Boot 2 and Spring Cloud
Jens Boje
No ratings yet