PYTHON MODULE-4
PYTHON MODULE-4
Module-4
Web Scraping And Numerical Analysis
Topics to be studied
• Submitting a form
• CSS Selectors.
• The main purpose of web scraping is to collect and analyze data from
websites for various applications, such as research, business intelligence, or
creating datasets.
• Developers use tools and libraries like BeautifulSoup (for Python), Scrapy, or
Puppeteer to automate the process of fetching and parsing web data.
Python Libraries
• requests
• Beautiful Soup
• Selenium
Requests
• Used to extract tables, lists, paragraph and you can also put filters to extract
information from web pages.
• BeautifulSoup does not fetch the web page for us. So we use requests pip
install beautifulsoup4
BeautifulSoup
print(type(soup))
Tag Object
• This object is usually used to extract a tag from the whole HTML document.
• Beautiful Soup is not an HTTP client which means to scrap online websites
you first have to download them using the requests module and then serve
them to Beautiful Soup for scraping.
• This object returns the first found tag if your document has multiple tags with the same name.
from bs4 import BeautifulSoup
# Initialize the object with an HTML page
soup = BeautifulSoup('''
<html>
<b>RNSIT</b>
<b> Knowx Innovations</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
print(tag)
# Print the output
print(type(tag))
• The tag contains many methods and attributes. And two important features of a tag are
its name and attributes.
• Name:The name of the tag can be accessed through ‘.name’ as suffix.
• Attributes: Anything that is NOT tag
# Import Beautiful Soup
from bs4 import BeautifulSoup
# Initialize the object with an HTML page
soup = BeautifulSoup('''
<html>
<b>Knowx Innovations</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
# Print the output
print(tag.name)
# changing the tag
tag.name = "Strong"
print(tag)
from bs4 import BeautifulSoup
# Initialize the object with an HTML page
soup = BeautifulSoup('''
<html>
<b class=“RNSIT“ name=“knowx”>Knowx Innoavtions</b>
</html>
''', "html.parser")
# Get the tag
tag = soup.b
print(tag["class"])
# modifying class
tag["class"] = “ekant"
print(tag)
# delete the class attributes
del tag["class"]
print(tag)
• A document may contain multi-valued attributes and can be accessed using key-value pair.
• The select method allows you to apply these selectors to navigate and
extract data from the parsed document easily.
CSS Selector
• Id selector (#)
• Class selector (.)
• Universal Selector (*)
• Element Selector (tag)
• Grouping Selector(,)
CSS Selector
• Id selector (#) :The ID selector targets a specific HTML element based on its unique
identifier attribute (id). An ID is intended to be unique within a webpage, so using the ID
selector allows you to style or apply CSS rules to a particular element with a specific ID.
#header {
color: blue;
font-size: 16px;
}
• Class selector (.) : The class selector is used to select and style HTML elements based on
their class attribute. Unlike IDs, multiple elements can share the same class, enabling
you to apply the same styles to multiple elements throughout the document.
.highlight {
background-color: yellow;
font-weight: bold;
}
CSS Selector
• Universal Selector (*) :The universal selector selects all HTML elements on the webpage.
It can be used to apply styles or rules globally, affecting every element. However, it is
important to use the universal selector judiciously to avoid unintended consequences.
*{
margin: 0;
padding: 0;
}
• Element Selector (tag) : The element selector targets all instances of a specific HTML
element on the page. It allows you to apply styles universally to elements of the same
type, regardless of their class or ID.
p{
color: green;
font-size: 14px;
}
• Grouping Selector(,) : The grouping selector allows you to apply the same styles to
multiple selectors at once. Selectors are separated by commas, and the styles specified
will be applied to all the listed selectors.
h1, h2, h3 {
font-family: 'Arial', sans-serif;
color: #333;
}
• These selectors are fundamental to CSS and provide a powerful way to target and style
different elements on a webpage.
<!DOCTYPE html>
<html>
<head>
<title>Sample Page</title>
</head>
<body>
<div id="content">
Creating a basic HTML page <h1>Heading 1</h1>
<p class="paragraph">This is a sample paragraph.</p>
<ul>
<li>Item 1</li>
<li>Item 2</li>
<li>Item 3</li>
</ul>
<a href="https://example.com">Visit Example</a>
</div>
</body>
</html>
Scraping example using CSS selectors
from bs4 import BeautifulSoup # 4. Select by attribute
Html=request.get((“web.html”) link =
soup = BeautifulSoup(Html, 'html.parser') soup.select('a[href="https://example.com"]
# 1. Select by tag name ')
heading = soup.select('h1') print("4. Link:", link[0]['href'])
print("1. Heading:", heading[0].text) # 5. Select all list items
# 2. Select by class list_items = soup.select('ul li')
paragraph = soup.select('.paragraph') print("5. List Items:")
print("2. Paragraph:", paragraph[0].text)
for item in list_items:
# 3. Select by ID
print("-", item.text)
div_content = soup.select('#content')
print("3. Div Content:", div_content[0].text)
Selenium
• Selenium is an open-source testing tool, which means it can be downloaded
from the internet without spending anything.
X = 10000
While Python’s array object provides efficient storage of array-based data, NumPy adds to
this efficient operations on that data.
Creating Arrays from Python Lists
import numpy as np
NumPy is constrained to arrays that all contain the same type. If types do not match, NumPy will upcast if possible
If we want to explicitly set the data type of the resulting array, we can use the dtype keyword:
Creating Arrays from Python Lists
• NumPy arrays can explicitly be multidimensional; here’s one way of initializing a
multidimensional array using a list of lists:
Creating Arrays from Scratch
NumPy Standard Data Types
• While constructing an array, you can specify them using a string:
NumPy arrays have a fixed type. This means, for example, that if you attempt to insert a floating-point value
to an integer array, the value will be silently truncated.
Array Slicing: Accessing Subarrays
One-dimensional subarrays
Multidimensional subarrays
Subarray dimensions can even be reversed together:
Accessing array rows and columns
Subarrays as no-copy views
• The reshape method will use a no-copy view of the initial array, but with noncontiguous
memory buffers this is not always the case.
Another common reshaping pattern is the conversion of a one-dimensional array into a
two-dimensional row or column matrix.
• Reshaping can be done with the reshape method, or more easily by making use of the
newaxis keyword within a slice operation.
Array Concatenation and Splitting
• Concatenation of arrays
• Computation on NumPy arrays can be very fast, or it can be very slow. The key to making
it fast is to use vectorized operations, generally implemented through NumPy’s universal
functions (ufuncs).
• NumPy’s ufuncs can be used to make repeated calculations on array elements much
more efficient.
The Slowness of Loops
Each time the reciprocal is computed, Python first examines the object’s type and does a
dynamic lookup of the correct function to use for that type. If we were working in
compiled code instead, this type specification would be known before the code exe‐
cutes and the result could be computed much more efficiently.
• For many types of operations, NumPy provides a convenient interface into this kind of
statically typed, compiled routine. This is known as a vectorized operation.
• This vectorized approach is designed to push the loop into the compiled layer that
underlies NumPy, leading to much faster execution.
• Looking at the execution time for our big array, we see that it completes orders of
magnitude faster than the Python loop:
Introducing UFuncs
• Vectorized operations in NumPy are implemented via ufuncs, whose main purpose is
to quickly execute repeated operations on values in NumPy arrays.
If we had instead written y[::2] = 2 ** x, this would have resulted in the creation of
a temporary array to hold the results of 2 ** x
Aggregates
• For binary ufuncs, there are some interesting aggregates that can be computed directly
from the object. we can use the reduce method of any ufunc can do this.
• A reduce method repeatedly applies a given operation to the elements of an array until
only a single result remains.
• For example, calling reduce on the add ufunc returns the sum of all elements in the
array:
calling reduce on the multiply ufunc results in the product of all array elements:
Note that for these particular cases, there are dedicated NumPy functions to compute the results
(np.sum, np.prod, np.cumsum, np.cumprod)
Outer products
• Finally, any ufunc can compute the output of all pairs of two different inputs using the
outer method. This allows you, in one line, to do things like create a multiplication table:
Broadcasting
Broadcasting in NumPy is a powerful mechanism that allows for the arithmetic operations on arrays of
different shapes and sizes, without explicitly creating additional copies of the data. It simplifies the
process of performing element-wise operations on arrays of different shapes, making code more
concise and efficient.