0% found this document useful (0 votes)
7 views43 pages

Dsf - Unit II Notes

The document outlines the syllabus for the Data Science Fundamentals course at P.S.N.A. College of Engineering & Technology, focusing on data manipulation using Python tools such as Jupyter Notebook and NumPy. It covers key features of the Python shell, Jupyter Notebook, IPython magic commands, and NumPy arrays, including their creation, manipulation, and various mathematical operations. Additionally, it explains universal functions and aggregation functions in NumPy for efficient data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views43 pages

Dsf - Unit II Notes

The document outlines the syllabus for the Data Science Fundamentals course at P.S.N.A. College of Engineering & Technology, focusing on data manipulation using Python tools such as Jupyter Notebook and NumPy. It covers key features of the Python shell, Jupyter Notebook, IPython magic commands, and NumPy arrays, including their creation, manipulation, and various mathematical operations. Additionally, it explains universal functions and aggregation functions in NumPy for efficient data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

P.S.N.A.

COLLEGE OF ENGINEERING & TECHNOLOGY


(An Autonomous Institution affiliated to Anna University, Chennai)
Kothandaraman Nagar, Muthanampatti (PO), Dindigul – 624 622.
Phone: 0451-2554032, 2554349 Web Link: www.psnacet.org
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
Subject Code / Name : OCS353 / DATA SCIENCE FUNDAMENTALS
Year / Semester : IV/ VII ‘A’

SYLLABUS
UNIT II DATA MANIPULATION
Python Shell - Jupyter Notebook - IPython Magic Commands - NumPy Arrays-Universal
Functions – Aggregations – Computation on Arrays – Fancy Indexing – Sorting arrays –
Structured data – Data manipulation with Pandas – Data Indexing and Selection – Handling
missing data – Hierarchical indexing – Combining datasets – Aggregation and Grouping – String
operations – Working with time series – High performance

PYTHON SHELL

The Python shell, also known as the Python interactive interpreter, is a command-line interface where
you can execute Python code interactively. It allows you to type in Python commands and see the results
immediately, making it an excellent tool for testing small code snippets, debugging, or learning Python.
Key Features of the Python Shell:
1. Interactive Environment: You can write and run Python commands one at a time. The results
are displayed immediately, which is great for experimentation.
2. Quick Feedback: The shell provides instant feedback on your code, making it easy to test ideas
and learn syntax.
3. Data Types and Structures: You can easily create and manipulate variables, lists, dictionaries,
and other data structures.
4. Import Libraries: You can import and use Python libraries right from the shell, allowing you to
leverage existing functionality.
5. Function Definition: You can define functions and test them on the fly.
6. Exit Easily: You can exit the shell whenever you want by using commands like exit() or pressing
Ctrl + Z (Windows) or Ctrl + D (Mac/Linux).
How to Access the Python Shell:
• Command Line: Open a terminal or command prompt and type python or python3, depending
on your installation.
• IDEs: Many integrated development environments (IDEs), like PyCharm or Jupyter Notebook,
provide a Python shell or console as part of their features.
Example Interaction:

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


Here's what a typical session in the Python shell looks like:
$ python
Python 3.x.x (default, ... )
>>> print("Hello, World!")
Hello, World!
>>> x = 5
>>> y = 10
>>> x + y
15
>>> my_list = [1, 2, 3]
>>> my_list.append(4)
>>> my_list
[1, 2, 3, 4]
In this example:
• You start the shell and can run Python commands.
• You see the output immediately after running each command.
The Python shell is an invaluable tool for beginners and experienced developers alike for quick
prototyping and testing.

JUPYTER NOTEBOOK

Jupyter Notebook is an open-source web application that allows you to create and share documents
containing live code, equations, visualizations, and narrative text. It is widely used in data science,
machine learning, and scientific research for its interactive features and ease of use.
Key Features of Jupyter Notebook:
1. Interactive Computing: You can write and execute code in chunks (called cells) and see the
results immediately, making it ideal for experimenting with code and data.
2. Support for Multiple Languages: While it primarily supports Python, Jupyter can work with
many programming languages (such as R, Julia, and Scala) through the use of different kernels.
3. Rich Text Support: You can combine code with rich text elements, such as Markdown, to
document your process, explain your code, or present results in a clear and readable format.
4. Data Visualization: It supports various libraries like Matplotlib, Seaborn, and Plotly, making it
easy to create and display graphs and visualizations inline.
5. Export Options: You can export your notebooks to various formats, including HTML, PDF, and
Markdown, for sharing or publishing.
6. Notebook Sharing: Jupyter Notebooks can be shared easily with others. You can share the
notebook file directly or host it on platforms like GitHub or JupyterHub.

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


Common Uses:
• Data Analysis: Data scientists use Jupyter Notebooks to analyze datasets, visualize results, and
document their findings.
• Machine Learning: It's commonly used for building, testing, and sharing machine learning
models.
• Educational Purposes: Jupyter Notebooks are often used in teaching programming and data
science because they provide an interactive learning experience.
How to Get Started:
1. Installation: You can install Jupyter Notebook using the Anaconda distribution, which includes
Python and many scientific libraries, or by using pip:
pip install notebook

2. Launching Jupyter Notebook: Once installed, you can launch it from the command line by
typing:
jupyter notebook

This will open a new tab in your web browser showing the Jupyter Notebook interface.
Example of a Jupyter Notebook Cell:
Here's a simple example of what you might see in a Jupyter Notebook:

# This is a code cell

import numpy as np

import matplotlib.pyplot as plt

# Generate some data

x = np.linspace(0, 10, 100)

y = np.sin(x)

# Create a plot

plt.plot(x, y)

plt.title("Sine Wave")

plt.xlabel("X-axis")

plt.ylabel("Y-axis")

plt.show()

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


In this example, the code cell imports libraries, generates data, and creates a plot. The plot will be
displayed directly below the code when executed.

IPYTHON MAGIC COMMANDS

IPython magic commands are special commands in IPython and Jupyter Notebooks that enhance the
interactive computing experience. They provide a way to perform various tasks in a more convenient
way than standard Python syntax. These commands start with one or two percent signs (% or %%).

Types of Magic Commands


1. Line Magics: These commands start with a single % and operate on a single line of input.
2. Cell Magics: These commands start with %% and operate on the entire cell.
Commonly Used Magic Commands
Here are some useful IPython magic commands:
Line Magics
%run: Runs a Python script.
%run my_script.py

%time: Times the execution of a single statement.

%time sum(range(10000))
%timeit: Provides a more accurate measurement of execution time by running the statement
multiple times.
%timeit sum(range(10000))
%load: Loads a Python script into the current cell.
%load my_script.py
%matplotlib inline: Ensures that Matplotlib plots are displayed inline within the notebook.
%matplotlib inline
%who: Lists all variables in the current namespace.
%who
%history: Shows command history.
%history

Cell Magics
• %%time: Times the execution of the entire cell.
%%time sum = 0

for i in range(10000):

sum += i

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


%%writefile: Writes the contents of the cell to a file.
%%writefile my_file.py
def my_function():
return "Hello, World!"
%%capture: Captures the output of the cell (stdout and stderr).
%%capture captured_output
print("This will not be printed")
%%bash: Executes the cell content as a bash script.
%%bash
echo "Hello from bash"

How to Use Magic Commands


You can use magic commands directly in your Jupyter Notebook cells. For example, to time a code
block, you would do:
%%time

result = [x**2 for x in range(10000)]

IPython magic commands provide a powerful way to enhance your workflow in Jupyter Notebooks.
They simplify common tasks and offer functionalities that go beyond standard Python capabilities,
making your coding experience more efficient and effective.

NUMPY ARRAYS

NumPy (Numerical Python) is a powerful library in Python that provides support for large, multi-
dimensional arrays and matrices, along with a collection of mathematical functions to operate on these
arrays. NumPy arrays are the fundamental building blocks for numerical computations in Python.
Key Features of NumPy Arrays
1. Homogeneous Data: All elements in a NumPy array must be of the same data type, which
allows for efficient storage and computation.
2. N-dimensional: NumPy arrays can be one-dimensional (1D), two-dimensional (2D), or multi-
dimensional (ND), enabling complex data structures like matrices and tensors.
3. Fast and Efficient: NumPy operations are optimized for performance, making them faster than
traditional Python lists for numerical operations.
4. Element-wise Operations: You can perform operations on entire arrays without needing to write
loops, which leads to cleaner and more readable code.
5. Broadcasting: This feature allows NumPy to perform arithmetic operations on arrays of
different shapes, making it easier to work with arrays of varying dimensions.
Creating NumPy Arrays
To use NumPy, you first need to import it:
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
import numpy as np
Here are some common ways to create NumPy arrays:
1. From a Python List:
arr = np.array([1, 2, 3, 4])
2. Using Built-in Functions:
o Zeros: Create an array filled with zeros.
zeros = np.zeros((3, 4)) # 3 rows, 4 columns
o Ones: Create an array filled with ones.
ones = np.ones((2, 3)) # 2 rows, 3 columns
o Full: Create an array filled with a specified value.
full_array = np.full((2, 2), 7) # 2x2 array filled with 7
o Arange: Create an array with a range of values.
arange_array = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
o Linspace: Create an array with evenly spaced values over a specified range.
linspace_array = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
Accessing and Modifying Arrays
• Indexing: Access elements using zero-based indexing.
first_element = arr[0] # Access the first element
• Slicing: Get a sub-array.
sub_array = arr[1:3] # Access elements from index 1 to 2
• Shape: Get the shape of the array.
shape = arr.shape # Returns a tuple (number of rows, number of columns)
• Reshaping: Change the shape of an array without changing its data.
reshaped_array = arr.reshape((2, 2))
Common Operations
1. Mathematical Operations: Perform element-wise operations.
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
sum_array = arr1 + arr2 # [5, 7, 9]
2. Statistical Functions: Calculate statistics like mean, median, and standard deviation.
mean = np.mean(arr) # Calculate mean
std_dev = np.std(arr) # Calculate standard deviation

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


3. Matrix Operations: Use functions like dot product.
dot_product = np.dot(arr1, arr2) # Calculate dot product
4. Boolean Indexing: Filter elements based on conditions.
filtered_array = arr[arr > 2] # Get elements greater than 2

UNIVERSAL FUNCTIONS IN NUMPY


Universal functions, or ufuncs, are a key feature of the NumPy library in Python. They are highly
optimized functions that operate element-wise on NumPy arrays, allowing for fast and efficient
computations. Ufuncs can take one or more arrays as input and perform operations on their elements in a
vectorized manner.
Key Features of Universal Functions
1. Element-wise Operations: Ufuncs apply operations to each element of an array independently,
which is efficient and often faster than using Python loops.
2. Broadcasting: Ufuncs support broadcasting, enabling operations on arrays of different shapes, as
long as the shapes are compatible.
3. Support for Various Data Types: Ufuncs can handle different data types, including integers,
floats, and complex numbers.
4. Performance: Ufuncs are implemented in C, making them significantly faster than equivalent
Python functions.
Common Universal Functions
Here are some commonly used universal functions in NumPy:
1. Mathematical Functions
• np.add: Adds two arrays element-wise.
import numpy as np
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])
result = np.add(a, b) # [5, 7, 9]
• np.subtract: Subtracts one array from another.
result = np.subtract(a, b) # [-3, -3, -3]
• np.multiply: Multiplies two arrays element-wise.
result = np.multiply(a, b) # [4, 10, 18]
• np.divide: Divides one array by another.
result = np.divide(a, b) # [0.25, 0.4, 0.5]
2. Trigonometric Functions
• np.sin, np.cos, np.tan: Compute the sine, cosine, and tangent of each element.
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
angles = np.array([0, np.pi/2, np.pi])
sine_values = np.sin(angles) # [0.0, 1.0, 0.0]
3. Exponential and Logarithmic Functions
• np.exp: Computes the exponential of each element.
result = np.exp(a) # [e^1, e^2, e^3]
• np.log: Computes the natural logarithm of each element.
result = np.log(a) # [0.0, log(2), log(3)]
4. Aggregate Functions
• np.sum: Computes the sum of array elements.
total = np.sum(a) # 6
• np.mean: Computes the average of array elements.
average = np.mean(a) # 2.0
• np.std: Computes the standard deviation of array elements.
std_dev = np.std(a) # 0.816
5. Comparison Functions
• np.equal, np.greater, np.less: Element-wise comparison.
comparison = np.greater(a, 2) # [False, False, True]
Broadcasting Example
Broadcasting allows ufuncs to work with arrays of different shapes. For example:
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([10, 20, 30])
result = a + b # Adds b to each row of a
Output:
[[11, 22, 33], [14, 25, 36]]

AGGREGATIONS IN NUMPY
Aggregation functions in NumPy are used to compute a single value from an array of values. These
functions allow you to summarize or reduce data, making it easier to analyze and interpret. Common
aggregation functions include sum, mean, median, standard deviation, and more.
Key Aggregation Functions
Here are some commonly used aggregation functions in NumPy:
1. Sum
• np.sum: Computes the sum of array elements.
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
import numpy as np

arr = np.array([1, 2, 3, 4, 5])


total = np.sum(arr) # 15
• Axis Parameter: You can specify an axis to sum along.
matrix = np.array([[1, 2, 3], [4, 5, 6]])
row_sum = np.sum(matrix, axis=0) # Sum along columns: [5, 7, 9]
column_sum = np.sum(matrix, axis=1) # Sum along rows: [6, 15]
2. Mean
• np.mean: Computes the average of array elements.
average = np.mean(arr) # 3.0
• Axis Parameter: You can also compute the mean along a specific axis.
row_mean = np.mean(matrix, axis=0) # Mean along columns: [2.5, 3.5, 4.5]
3. Median
• np.median: Computes the median of array elements.
median_value = np.median(arr) # 3.0
• Axis Parameter: Compute the median along a specific axis.
median_row = np.median(matrix, axis=1) # Median along rows: [2.0, 5.0]
4. Standard Deviation
• np.std: Computes the standard deviation of array elements.
std_dev = np.std(arr) # 1.4142135623730951
• Axis Parameter: Compute the standard deviation along a specific axis.
std_row = np.std(matrix, axis=0) # Standard deviation along columns
5. Variance
• np.var: Computes the variance of array elements.
variance = np.var(arr) # 2.0
6. Minimum and Maximum
• np.min and np.max: Compute the minimum and maximum values.
min_value = np.min(arr) # 1
max_value = np.max(arr) # 5
• Axis Parameter: Find min and max along a specific axis.
min_row = np.min(matrix, axis=1) # Minimum along rows: [1, 4]

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


max_column = np.max(matrix, axis=0) # Maximum along columns: [4, 5, 6]

Example of Aggregation in Action


Here’s a complete example demonstrating various aggregation functions:
import numpy as np
# Create a 2D array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Sum
total_sum = np.sum(data) # 45
column_sum = np.sum(data, axis=0) # [12, 15, 18]
# Mean
mean_value = np.mean(data) # 5.0
row_mean = np.mean(data, axis=1) # [2.0, 5.0, 8.0]
# Median
median_value = np.median(data) # 5.0
# Standard Deviation
std_dev = np.std(data) # 2.581988897471611
# Minimum and Maximum
min_value = np.min(data) # 1
max_value = np.max(data) # 9

COMPUTATIONS ON ARRAYS
• Computation on NumPy arrays can be very fast, or it can be very slow. Using vectorized operations,
fast computations is possible and it is implemented by using NumPy's universial functions (ufuncs).
• A universal function (ufuncs) is a function that operates on ndarrays in an element-by- element fashion,
supporting array broadcasting, type casting, and several other standard features. The ufunc is a
"vectorized" wrapper for a function that takes a fixed number of specific inputs and produces a fixed
number of specific outputs.
• Functions that work on both scalars and arrays are known as ufuncs. For arrays, ufuncs apply the
function in an element-wise fashion. Use of ufuncs is an esssential aspect of vectorization and typically
much more computtionally efficient than using an explicit loop over each element.
NumPy's Ufuncs :
• Ufuncs are of two types: unary ufuncs and binary ufuncs.
• Unary ufuncs operate on a single input and binary ufuncs, which operate on two inputs.

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


• Arithmetic operators implemented in NumPy is as follows:

• Example of Arithmetic Operators: Python Code


# Taking input
num1 = input('Enter first number:')
num2 = input('Enter second number:')
# Addition
sum = float(num1) + float(num2)
# Subtraction
min =float(num1) - float(num2)
# Multiplication
mul = float(num1)* float(num2)
#Division
div = float(num1) / float(num2)
#Modulus
mod = float(num1) % float(num2)
#Exponentiation
exp =float(num1)**float(num2)
#Floor Division
floordiv = float(num1) // float(num2)
print("The sum of {0} and {1} is {2}'.format(num1, num2, sum))
print("The subtraction of {0} and {1} is {2}'.format(num1, num2, min))
print("The multiplication of {0} and {1} is {2}'.format(num1, num2, mul))
print("The division of {0} and {1} is {2}'.format(num1, num2, div))

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


print("The modulus of {0} and {1} is {2}'.format(num1, num2, mod))
print("The exponentiation of {0} and {1} is {2}'.format(num1, num2, exp))
print("The floor division between {0} and {1} is {2}'.format(num1, num2,floordiv))
Absolute value :
• NumPy understands Python's built-in arithmetic operators, it also understands Python's built-in
absolute value function. The abs() function returns the absolute magnitude or value of input passed to it
as an argument. It returns the actual value of input without taking the sign into consideration.
• The abs() function accepts only a single arguement that has to be a number and it returns the absolute
magnitude of the number. If the input is of type integer or float, the abs() function returns the absolute
magnitude/value. If the input is a complex number, the abs() function returns only the magnitude portion
of the number.
Syntax: abs(number)
Where the number can be of integer type, floating point type or a complex number.
• Example:
num -25.79
print("Absolute value:", abs(num))
• Output:
Absolute value : 25.79
Trigonometric functions:
• The numpy package provides trigonometric functions which can be used to calculate trigonometric
ratios for a given angle in radians.

Example:
Import numpy as np
Arr = np.array([0, 30, 60, 90])
#converting the angles in radians
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
Arr = Arr*np.pi/180
print("\nThe sin value of angles:")
print(np.sin(Arr))
print("\nThe cos value of angles:")
print(np.cos(Arr))
print("\nThe tan value of angles:")
print(np.tan(Arr))

FANCY INDEXING
Fancy indexing in NumPy allows you to access and manipulate array elements using integer arrays or
boolean arrays. Here’s a quick overview:
1. Integer Array Indexing
You can use an array of indices to access specific elements:
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
indices = np.array([1, 3, 4])
result = arr[indices]
# Output: [20, 40, 50]
2. Multi-Dimensional Fancy Indexing
For multi-dimensional arrays, you can use two arrays: one for row indices and one for column indices.
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
row_indices = np.array([0, 2])
col_indices = np.array([1, 2])
result = arr_2d[row_indices, col_indices]
# Output: [2, 9]
3. Boolean Indexing
You can also use boolean arrays to filter elements that meet a condition.
arr = np.array([10, 20, 30, 40, 50])
condition = arr > 30
filtered = arr[condition]
Output: [40, 50]
4. Setting Values with Fancy Indexing
You can modify elements using fancy indexing as well:
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
arr = np.array([10, 20, 30, 40, 50])
arr[[1, 3]] = [99, 88] # Set elements at index 1 and 3
# Output: [10, 99, 30, 88, 50]
Fancy indexing in NumPy allows you to access and manipulate array elements using integer arrays or
boolean arrays. Here’s a quick overview:
1. Integer Array Indexing
You can use an array of indices to access specific elements:
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
indices = np.array([1, 3, 4])
result = arr[indices] # Output: [20, 40, 50]
2. Multi-Dimensional Fancy Indexing
For multi-dimensional arrays, you can use two arrays: one for row indices and one for column indices.
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
row_indices = np.array([0, 2])
col_indices = np.array([1, 2])
result = arr_2d[row_indices, col_indices] # Output: [2, 9]
3. Boolean Indexing
You can also use boolean arrays to filter elements that meet a condition.
arr = np.array([10, 20, 30, 40, 50])
condition = arr > 30
filtered = arr[condition] # Output: [40, 50]
4. Setting Values with Fancy Indexing
You can modify elements using fancy indexing as well:
arr = np.array([10, 20, 30, 40, 50])
arr[[1, 3]] = [99, 88] # Set elements at index 1 and 3
# Output: [10, 99, 30, 88, 50]
Example Summary
Here’s a complete example demonstrating fancy indexing:
import numpy as np
# Create an array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Fancy indexing
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
rows = np.array([0, 2])
cols = np.array([1, 2])
selected_elements = data[rows, cols] # Output: [2, 9]
# Boolean indexing
condition = data > 5
filtered_elements = data[condition] # Output: [6, 7, 8, 9]
print("Selected Elements:", selected_elements)
print("Filtered Elements:", filtered_elements)

SORTING ARRAYS
Sorting arrays in NumPy is straightforward and can be done using several functions. Here’s an overview
of how to sort arrays, including both one-dimensional and multi-dimensional arrays.
1. Sorting a 1D Array
You can sort a one-dimensional array using np.sort() or the sort() method of the array.
import numpy as np
arr = np.array([3, 1, 4, 1, 5, 9, 2])
sorted_arr = np.sort(arr) # Returns a sorted copy
print(sorted_arr) # Output: [1, 1, 2, 3, 4, 5, 9]
# Using the sort method (in-place)
arr.sort() # Sorts in place
print(arr) # Output: [1, 1, 2, 3, 4, 5, 9]
2. Sorting a 2D Array
When sorting a 2D array, you can specify the axis along which to sort.
arr_2d = np.array([[3, 1, 4], [1, 5, 9], [2, 6, 5]])
# Sort along the rows (default axis=1)
sorted_rows = np.sort(arr_2d, axis=1)
print(sorted_rows)
# Output:
# [[1, 3, 4],
# [1, 5, 9],
# [2, 5, 6]]
# Sort along the columns (axis=0)
sorted_cols = np.sort(arr_2d, axis=0)
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
print(sorted_cols)
# Output:
# [[1, 1, 4],
# [2, 5, 5],
# [3, 6, 9]]
3. Getting the Indices of Sorted Elements
You can use np.argsort() to get the indices that would sort the array.
indices = np.argsort(arr)
print(indices) # Output: [1, 2, 4, 0, 3, 5, 6]
4. Sorting with Custom Order
If you want to sort based on a custom criterion, you can use the key parameter in combination with a
function. However, this is more common in Python's built-in sorting. For NumPy, you can do it with
boolean indexing or other methods.
5. Sorting with np.lexsort
You can use np.lexsort for sorting based on multiple keys.
# Example: Sort by second column, then by first column
data = np.array([[3, 1], [1, 2], [4, 1], [1, 3]])
sorted_indices = np.lexsort((data[:, 0], data[:, 1]))
sorted_data = data[sorted_indices]
print(sorted_data)
# Output:
# [[1, 2],
# [3, 1],
# [1, 3],
# [4, 1]]
Summary Example
Here’s a complete example that showcases sorting:
import numpy as np
# Create a 2D array
data = np.array([[3, 2, 5], [1, 0, 4], [8, 7, 6]])
# Sort the entire array (flattened)
sorted_flattened = np.sort(data, axis=None)
print("Sorted (flattened):", sorted_flattened)
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
# Sort by rows
sorted_by_rows = np.sort(data, axis=1)
print("Sorted by rows:\n", sorted_by_rows)
# Sort by columns
sorted_by_cols = np.sort(data, axis=0)
print("Sorted by columns:\n", sorted_by_cols)
# Get sorted indices
sorted_indices = np.argsort(data.flatten())
print("Sorted indices:", sorted_indices)

STRUCTURED ARRAYS
• A structured Numpy array is an array of structures. As numpy arrays are homogeneous i.e. they can
contain data of same type only. So, instead of creating a numpy array of int or float, we can create
numpy array of homogeneous structures too.
• First of all import numpy module i.e.
importnumpy as np
• Now to create a structure numpy array we can pass a list of tuples containing the structure elements i.e.
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
• But as elements of a Numpy array are homogeneous, so how will be the size and type of structure will
be decided? For that we need to pass the type of above structure type i.e. schema in dtype parameter.
• Let's create a dtype for above structure i.e.
# Creating the type of a structure
dtype = [('Name', (np.str_, 10)), ('Marks', np.float64), ('GradeLevel', np.int32)]
• Let's create a numpy array based on this dtype i.e.
# Creating a StrucuredNumpy array
structuredArr= np.array([('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)],
dtype=dtype)
• It will create a structured numpy array and its contents will be,
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
• Let's check the data type of the above created numpy array is,
print(structured Arr.dtype)
Output:
[('Name', '<U10'), ('Marks', '<f8'), ('GradeLevel', '<i4')]

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


Creating structured arrays:
• Structured array data types can be specified in a number of ways.
1. Dictionary method :
np.dtype({'names': ('name', 'age', 'weight'),
'formats': ('U10', '14', 'f8')})
Output: dtype([('name', '<U10'), ('age', '<i4'), ('weight', '<f8')])
2. Numerical types can be specified with Python types or NumPydtypes instead :
np.dtype({'names': ('name', 'age', 'weight'),
'formats':((np.str_, 10), int, np.float32)})
Output: dtype([('name', '<U10'), ('age', '<i8'), ('weight', '<f4')])
3. A compound type can also be specified as a list of tuples :
np.dtype([('name', 'S10'), ('age', 'i4'), ('weight', 'f8')])
Output: dtype([('name', 'S10'), ('age', '<i4'), ('weight', '<f8')])
NumPy data types:
• Below is a listing of all data types available in NumPy and the characters that represent them.
1) I - integer
2) b - boolean
3) u - unsigned integer
4) f - float
5) c - complex float
6) m - timedelta
7) M - datetime
8) O - object
9) S - string
10) U - unicode string
11) V - fixed for other types of memory

DATA MANIPULATION WITH PANDAS


• Pandas is a high-level data manipulation tool developed by Wes McKinney. It is built on the Numpy
package and its key data structure is called the DataFrame.
• DataFrames allow you to store and manipulate tabular data in rows of observations and columns of
variables.

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


• Pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or
replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions
from Matplotlib and machine learning algorithms in Scikit-learn.
• Pandas is the library for data manipulation and analysis. Usually, it is the starting point for your data
science tasks. It allows you to read/write data from/to multiple sources. Process the missing data, align
your data, reshape it, merge and join it with other data, search data, group it, slice it.
Create DataFrame with Duplicate Data
• Duplicate data creates the problem for data science project. If database is large, then processing
duplicate data means wastage of time.
• Finding duplicates is important because it will save time, space false result. how to easily and
efficiently you can remove the duplicate data using drop_duplicates() function in pandas.
• Create Dataframe with Duplicate data
import pandas as pd
raw_data={'first_name': ['rupali', 'rupali', 'rakshita','sangeeta', 'mahesh', 'vilas'],
'last_name': ['dhotre', 'dhotre', 'dhotre','Auti', 'jadhav', 'bagad'],
'RNo': [12, 12, 1111111, 36, 24, 73],
'TestScore1': [4, 4, 4, 31, 2, 3],
'TestScore2': [25, 25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df

Drop duplicates
df.drop_duplicates()
• Drop duplicates in the first name column, but take the last observation in the duplicated set
df.drop_duplicates (['first_name'], keep='last')
Creating a Data Map and Data Plan
• Overview of dataset is given by data map. Data map is used for finding potential problems in data,
such as redundant variables, possible errors, missing values and variable transformations.

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


• Try creating a Python script that converts a Python dictionary into a Pandas DataFrame, then print the
DataFrame to screen.
import pandas as pd
scottish_hills={'Ben Nevis': (1345, 56.79685, -5.003508),
'Ben Macdui': (1309, 57.070453, -3.668262),
'Braeriach': (1296, 57.078628, -3.728024),
'Cairn Toul': (1291, 57.054611, -3.71042),
'Sgòr an Lochain Uaine': (1258, 57.057999, -3.725416)}
dataframe = pd.DataFrame(scottish_hills)
print(dataframe)
Manipulating and Creating Categorical Variables
• Categorical variable is one that has a specific value from a limited selection of values. The number of
values is usually fixed.
• Categorical features can only take on a limited, and usually fixed, number of possible values. For
example, if a dataset is about information related to users, then you will typically find features like
country, gender, age group, etc. Alternatively, if the data you are working with is related to products,
you will find features like product type, manufacturer, seller and so on.
• Method for creating a categorical variable and then using it to check whether some data falls within the
specified limits.
import pandas as pd
cycle_colors=pd.Series(['Blue', 'Red', 'Green'], dtype='category')
cycle_data = pd.Series( pd.Categorical(['Yellow', 'Green', 'Red', 'Blue', 'Purple'],
categories=cycle_colors, ordered=False))
find_entries = pd.isnull(cycle_data)
print cycle_colors
print
print cycle_data
print
print find_entries [find_entries==True]
• Here cycle_color is a categorical variable. It contains the values Blue, Red, and Green as color.
Renaming Levels and Combining Levels
• Data frame variable names are typically used many times when wrangling data. Good names for these
variables make it easier to write and read wrangling programs.
• Categorical data has a categories and a ordered property, which list their possible values and whether
the ordering matters or not.

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


• Renaming categories is done by assigning new values to the Series.cat.categories property or by using
the Categorical.rename_categories() method :
In [41]: s = pd.Series(["a","b","c","a"], dtype="category")
In [41]: s
Out[43]:
0a
1b
2C
3a
dtype: category
Categories (3, object): [a, b, c]
In [44]: s.cat.categories=["Group %s" % g for g in s.cat.categories]
In [45]: s
Out[45]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
Categories (3, object): [Group a, Group b, Group c]
In [46]: s.cat.rename_categories([1,2,3])
Out[46]:
01
12
23
31
dtype: category
Categories (3, int64): [1, 2, 3]
Dealing with Dates and Times Values
• Dates are often provided in different formats and must be converted into single format Date Time
objects before analysis.
• Python provides two methods of formatting date and time.
1. str() = It turns a datetime value into a string without any formatting.
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
2. strftime() function= It define how user want the datetime value to appear after
conversion.
1. Using pandas.to_datetime() with a date
import pandas as pd
#input in mm.dd.yyyy format
date = ['21.07.2020']
# output in yyyy-mm-dd format
print(pd.to_datetime(date))
2. Using pandas.to_datetime() with a date and time
import pandas as pd
# date (mm.dd.yyyy) and time (H:MM:SS)
date [21.07.2020 11:31:01 AM']
# output in yyyy-mm-dd HH:MM:SS
print(pd.to_datetime(date))
• We can convert a string to datetime using strptime() function. This function is available in datetime and
time modules to parse a string to datetime and time objects respectively.
• Python strptime() is a class method in datetime class. Its syntax is :
datetime.strptime(date_string, format)
• Both the arguments are mandatory and should be string
import datetime
format="%a %b %d %H:%M:%S %Y"
today = datetime.datetime.today()
print 'ISO:', today
s = today.strftime(format)
print 'strftime:', s
d = datetime.datetime.strptime(s, format)
print 'strptime:', d.strftime(format)
$ python datetime_datetime_strptime.py
ISO : 2013-02-21 06:35:45.707450
strftime: Thu Feb 21 06:35:45 2013
strptime: Thu Feb 21 06:35:45 2013

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


• Time Zones: Within datetime, time zones are represented by subclasses of tzinfo. Since tzinfo is an
abstract base class, you need to define a subclass and provide appropriate implementations for a few
methods to make it useful.
Missing Data
• Data can have missing values for a number of reasons such as observations that were not recorded and
data corruption. Handling missing data is important as many machine learning algorithms do not support
data with missing values.
• You can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.
# load and summarize the dataset
from pandas import read_csv
# load the dataset
dataset = read_csv('csv file name', header=None)
# summarize the dataset
print(dataset.describe())
• In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN. Values with
a NaN value are ignored from operations like sum, count, etc.
• Use the isnull() method to detect the missing values. Pandas Dataframe provides a function isnull(), it
returns a new dataframe of same size as calling dataframe, it contains only True and False only. With
True at the place NaN in original dataframe and False at other places.
Encoding missingness:
• The fillna() function is used to fill NA/NaN values using the specified method.
• Syntax :
DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None,
downcast=None, **kwargs)
Where
1. value: It is a value that is used to fill the null values.
2. method: A method that is used to fill the null values.
3. axis: It takes int or string value for rows/columns.
4. inplace: If it is True, it fills values at an empty place.
5. limit: It is an integer value that specifies the maximum number of consecutive forward/backward NaN
value fills.
6. downcast: It takes a dict that specifies what to downcast like Float64 to int64.

DATA INDEXING AND SELECTION


In pandas, data indexing and selection are essential for manipulating and analyzing data efficiently.
Here’s a comprehensive overview of how to work with indexing and selection in pandas.
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
1. Creating a DataFrame
First, let's create a sample DataFrame to work with:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
2. Basic Indexing and Selection
Selecting Columns
You can select a column by its name:
# Select a single column
ages = df['Age']
print(ages)
# Select multiple columns
subset = df[['Name', 'City']]
print(subset)
Selecting Rows
You can use .loc[] and .iloc[] for row selection.
• Using .loc[]: This is label-based indexing.
# Select a single row by index label
row_bob = df.loc[1]
print(row_bob)
# Select multiple rows by index labels
rows = df.loc[[0, 2]]
print(rows)
Using .iloc[]: This is position-based indexing.
# Select a single row by position
row_charlie = df.iloc[2]
print(row_charlie)
# Select multiple rows by position

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


rows = df.iloc[0:2] # Rows 0 and 1
print(rows)
3. Conditional Selection
You can filter rows based on conditions:
# Select rows where Age is greater than 25
adults = df[df['Age'] > 25]
print(adults)
4. Setting Index
You can set a specific column as the index:
df.set_index('Name', inplace=True)
print(df)
# Now you can select by the new index
charlie_data = df.loc['Charlie']
print(charlie_data)
5. Slicing Rows
You can slice rows using .loc[] and .iloc[]:
# Using .loc[] to slice by index labels
slice_loc = df.loc['Alice':'David']
print(slice_loc)
# Using .iloc[] to slice by position
slice_iloc = df.iloc[1:3]
print(slice_iloc)
6. Selecting with .at[] and .iat[]
• .at[]: For fast access to a single value by label.
age_bob = df.at['Bob', 'Age']
print(age_bob)
• .iat[]: For fast access to a single value by position.
age_charlie = df.iat[2, 1] # Row 2, Column 1 (Age)
print(age_charlie)
7. Using .query()
The .query() method allows for a more intuitive way to filter data:
# Select rows where Age is less than 30

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


young_people = df.query('Age < 30')
print(young_people)
Summary Example
Here’s a complete example demonstrating various indexing and selection techniques:
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}

df = pd.DataFrame(data)

# Select specific column


print("Ages:\n", df['Age'])

# Select rows based on condition


print("Adults:\n", df[df['Age'] > 25])

# Set index and select by index


df.set_index('Name', inplace=True)
print("Data with 'Name' as index:\n", df)

# Accessing a single value


print("Bob's Age:", df.at['Bob', 'Age'])

# Query for specific conditions


print("Young People:\n", df.query('Age < 30'))

HANDLING MISSING DATA


Handling missing data is an essential part of data cleaning and preprocessing in pandas. Here’s a
comprehensive guide on how to detect, handle, and fill missing values in a pandas DataFrame.
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
1. Creating a DataFrame with Missing Data
First, let's create a sample DataFrame that contains some missing values:
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', np.nan, 'David'],
'Age': [24, np.nan, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', np.nan]
}
df = pd.DataFrame(data)
print(df)
2. Detecting Missing Values
You can use several methods to detect missing values in a DataFrame.
Check for Missing Values
# Check for missing values
print(df.isnull())
Summarize Missing Values
To get a summary of missing values in each column:
# Count of missing values per column
print(df.isnull().sum())
3. Dropping Missing Values
You can drop rows or columns that contain missing values using dropna().
Drop Rows with Any Missing Values
# Drop rows with any missing values
df_dropped = df.dropna()
print(df_dropped)
Drop Columns with Any Missing Values
# Drop columns with any missing values
df_dropped_columns = df.dropna(axis=1)
print(df_dropped_columns)
4. Filling Missing Values
Instead of dropping missing values, you can fill them using fillna().

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


Fill with a Specific Value
# Fill missing values with a specific value
df_filled = df.fillna({'Name': 'Unknown', 'Age': 0, 'City': 'Unknown'})
print(df_filled)
Fill with Mean, Median, or Mode
You can also fill missing values using statistical measures:
# Fill Age with the mean
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
print(df)
5. Forward and Backward Filling
You can fill missing values using the values from previous or next rows.
Forward Fill
# Forward fill
df_ffill = df.fillna(method='ffill')
print(df_ffill)
Backward Fill
# Backward fill
df_bfill = df.fillna(method='bfill')
print(df_bfill)
6. Interpolating Missing Values
You can use interpolation to estimate missing values based on surrounding data:
# Interpolate missing values
df_interpolated = df.interpolate()
print(df_interpolated)
Summary Example
Here’s a complete example demonstrating how to handle missing data:
import pandas as pd
import numpy as np
# Sample DataFrame with missing values
data = {
'Name': ['Alice', 'Bob', np.nan, 'David'],

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


'Age': [24, np.nan, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', np.nan]
}

df = pd.DataFrame(data)
# Detecting missing values
print("Missing Values Summary:\n", df.isnull().sum())

# Filling missing values with a specific value


df_filled = df.fillna({'Name': 'Unknown', 'Age': 0, 'City': 'Unknown'})
print("\nFilled DataFrame:\n", df_filled)

# Forward filling
df_ffill = df.fillna(method='ffill')
print("\nForward Filled DataFrame:\n", df_ffill)

# Backward filling
df_bfill = df.fillna(method='bfill')
print("\nBackward Filled DataFrame:\n", df_bfill)

# Interpolating missing values


df_interpolated = df.interpolate()
print("\nInterpolated DataFrame:\n", df_interpolated)

HIERARCHICAL INDEXING
• Hierarchical indexing is a method of creating structured group relationships in data.
• A MultiIndex or Hierarchical index comes in when our DataFrame has more than two dimensions. As
we already know, a Series is a one-dimensional labelled NumPy array and a DataFrame is usually a two-
dimensional table whose columns are Series. In some instances, in order to carry out some sophisticated
data analysis and manipulation, our data is presented in higher dimensions.
• A MultiIndex adds at least one more dimension to the data. A Hierarchical Index as the name suggests
is ordering more than one item in terms of their ranking.
• To createDataFrame with player ratings of a few players from the Fifa 19 dataset.
In [1]: import pandas as pd
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
In [2]: data = {'Position': ['GK', 'GK', 'GK', 'DF', 'DF', 'DF",
'MF', 'MF", 'MF', 'CF', 'CF', 'CF'],
'Name': ['De Gea', 'Coutois', 'Allison', 'VanDijk',
'Ramos', 'Godin', 'Hazard', 'Kante', 'De Bruyne', 'Ronaldo'
'Messi', 'Neymar'],
'Overall': ['91','88', '89', '89', '91', '90', '91', '90', '92', '94', '93', '92'],
'Rank': ['1st', '3rd', '2nd', '3rd','1st', '2nd', '2nd', '3rd', '1st', '1st', '2nd', '3rd']}
In [3]: fifa19 = pd.DataFrame(data, columns=['Position', 'Name', 'Overall', 'Rank'])
In [4]: fifa19
Out[4]:

• From above Dataframe, we notice that the index is the default Pandas index; the columns 'Position' and
'Rank' both have values or objects that are repeated. This could sometimes pose a problem for us when
we want to analyse the data. What we would like to do is to use meaningful indexes that uniquely
identify each row and makes it easier to get a sense of the data we are working with. This is where
MultiIndex or Hierarchical Indexing comes in.
• We do this by using the set_index() method. For Hierarchical indexing, we use set_index() method for
passing a list to represent how we want the rows to be identified uniquely.
In [5]: fif19.set_index(['Position', 'Rank'], drop = False)
In [6]: fifa19
Out[6];

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


• We can see from the code above that we have set our new indexes to 'Position' and 'Rank' but there is a
replication of these columns. This is because we passed drop-False which keeps the columns where they
are. The default method, however, is drop-True so without indicating drop=False the two columns will
be set as the indexes and the columns deleted automatically.
In [7]: fifa19.set_index(['Position', 'Rank'])
Out[7]: Name Overall
Position Rank
GK 1st De Gea91
GK 3rd Coutios88
GK 2nd Allison 89
DF 3rd Van Dijk 89
DF 1st Ramos 91
DF 2nd Godin 90
MF 2nd Hazard 91
MF 3rd Kante90
MF 1st De Bruyne 92
CF 1st Ronaldo 94
CF 2nd Messi93
CF 3rd Neymar92

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


• We use set_index() with an ordered list of column labels to make the new indexes. To verify that we
have indeed set our DataFrame to a hierarchical index, we call the .index attribute.
In [8]: fifa19-fifa 19.set_index(['Position', 'Rank'])
In [9]: fifa19.index
Out[9]: MultiIndex(levels = [['CF', 'DF', 'GK', 'MF'],
['1st', '2nd', '3rd']],
codes = [[2, 2, 2, 1, 1, 1, 3, 3, 3, 0, 0, 0],
[0, 2, 1, 2,0,1, 1, 2, 0, 0, 1, 2]],
names= ['Position', 'Rank'])

COMBINING DATASETS
• Whether it is to concatenate several datasets from different csv files or to merge sets of aggregated data
from different google analytics accounts, combining data from various sources is critical to drawing the
right conclusions and extracting optimal value from data analytics.
• When using pandas, data scientists often have to concatenate multiple pandas DataFrame; either
vertically (adding lines) or horizontally (adding columns).
DataFrame.append
• This method allows to add another dataframe to an existing one. While columns with matching names
are concatenated together, columns with different labels are filled with NA.
>>>df1
ints bools
0 0 True
11 False
2 2 True
>>> df2
ints floats
0 3 1.5
1 4 2.5
2 5 3.5
>>> df1.append(df2).
ints bools floats
0 0 True NaN
1 1 False NaN
2 2 True NaN
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
0 3 NaN 1.5
1 4 NaN 2.5
2 5 NaN 3.5
• In addition to this, DataFrame.append provides other flexibilities such as resetting the resulting index,
sorting the resulting data or raising an error when the resulting index includes duplicate records.
Pandas.concat
• We can concat dataframes both vertically (axis=0) and horizontally (axis=1) by using the
Pandas.concat function. Unlike DataFrame.append, Pandas.concat is not a method but a function that
takes a list of objects as input. On the other hand, columns with different labels are filled with NA values
as for DataFrame.append.
>>> df3
bools floats
0 False 4.5
1 True 5.5
2 False 6.5
>>>pd.concat([df1, df2, df3])
ints bools floats
0 0.0 True NaN
1 1.0 False NaN
2 2.0 True NaN
0 3.0 NaN 1.5
1 4.0 NaN 2.5
2 5.0 NaN 3.5
0 NaN False 4.5
1 NaN True 5.5
2 NaN False 6.5

AGGREGATION AND GROUPING


• Pandas aggregation methods are as follows:
a) count() Total number of items
b) first(), last(): First and last item
c) mean(), median(): Mean and median
d) min(), max(): Minimum and maximum
e) std(), var(): Standard deviation and variance
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
f) mad(): Mean absolute deviation
g) prod(): Product of all items
h) sum(): Sum of all items.
• Sample CSV file is as follows:

• The date column can be parsed using the extremely handy dateutil library.
import pandas as pd
importdateutil
# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)
• Once the data has been loaded into Python, Pandas makes the calculation of different statistics very
simple. For example, mean, max, min, standard deviations and more for columns are easily calculable:
# How many rows the dataset
data['item'].count()
Out[38]: 830
# What was the longest phone call / data entry?
data['duration'].max()
Out[39]: 10528.0
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()
Out[40]: 92321.0
# How many entries are there for each month?
data['month'].value_counts()
Out[41]:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
dtype: int64
# Number of non-null unique network entries
data['network'].nunique()
Out[42]: 9
groupby() function :
• groupby essentially splits the data into different groups depending on a variable of user choice.
• The groupby() function returns a GroupBy object, but essentially describes how the rows of the
original data set has been split. The GroupBy object groups variable is a dictionary whose keys are the
computed unique groups and corresponding values being the axis labels belonging to each group.
• Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object to
obtain summary statistics for each group.
• The GroupBy object supports column indexing in the same way as the DataFrame and returns a
modified GroupBy object.

STRING OPERATIONS

Pandas provides a robust set of string operations that allow you to manipulate text data in DataFrames
easily. Here's a comprehensive guide on common string operations in pandas.

1. Creating a DataFrame with String Data

First, let's create a sample DataFrame containing string data:

import pandas as pd

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],

'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]']

df = pd.DataFrame(data)

print(df)

2. String Methods

Pandas has a .str accessor that allows you to apply string functions on Series objects.

a. Accessing String Length

You can calculate the length of each string using .str.len():

df['Name Length'] = df['Name'].str.len()

print(df[['Name', 'Name Length']])

b. Converting Case

You can convert strings to upper or lower case:

df['Uppercase Name'] = df['Name'].str.upper()

df['Lowercase City'] = df['City'].str.lower()

print(df[['Name', 'Uppercase Name', 'City', 'Lowercase City']])

c. Stripping Whitespace

You can remove leading and trailing whitespace:

df['Name'] = df['Name'].str.strip()

d. Replacing Substrings

You can replace parts of strings with the .str.replace() method:

df['Email'] = df['Email'].str.replace('@example.com', '@newdomain.com')

print(df['Email'])

3. String Splitting and Joining

a. Splitting Strings

You can split strings into lists using .str.split():

df['City Words'] = df['City'].str.split(' ')

print(df[['City', 'City Words']])

b. Joining Strings
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
You can join a list of strings into a single string using .str.join():

city_list = ['New York', 'Los Angeles', 'Chicago', 'Houston']

joined_cities = ', '.join(city_list)

print(joined_cities) # Output: New York, Los Angeles, Chicago, Houston

4. Extracting Substrings

You can extract parts of strings using .str.extract() with regex patterns:

# Extract the username from the email

df['Username'] = df['Email'].str.extract(r'([^@]+)')

print(df[['Email', 'Username']])

5. Finding and Checking Substrings

a. Checking if Strings Contain a Substring

You can check if a substring is present using .str.contains():

df['Has New'] = df['City'].str.contains('New')

print(df[['City', 'Has New']])

b. Finding Positions of Substrings

You can find the position of a substring using .str.find():

df['Position of "a"'] = df['Name'].str.find('a')

print(df[['Name', 'Position of "a"']])

6. String Length and Counting

You can count occurrences of a substring using .str.count():

df['Count of "a"'] = df['Name'].str.count('a')

print(df[['Name', 'Count of "a"']])

Summary Example

Here's a complete example demonstrating various string operations:

import pandas as pd

# Sample DataFrame

data = {

'Name': [' Alice ', 'Bob', 'Charlie', 'David'],

'City': ['New York', 'Los Angeles', 'Chicago', 'Houston'],


R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
'Email': ['[email protected]', '[email protected]', '[email protected]', '[email protected]']

df = pd.DataFrame(data)

# Strip whitespace

df['Name'] = df['Name'].str.strip()

# Convert to uppercase

df['Uppercase Name'] = df['Name'].str.upper()

# Replace email domain

df['Email'] = df['Email'].str.replace('@example.com', '@newdomain.com')

# Extract username from email

df['Username'] = df['Email'].str.extract(r'([^@]+)')

# Check for a substring

df['Has New'] = df['City'].str.contains('New')

# Count occurrences of 'a'

df['Count of "a"'] = df['Name'].str.count('a')

# Display the final DataFrame

print(df)

WORKING WITH TIME SERIES

Working with time series data in pandas is straightforward and powerful, allowing for easy
manipulation, analysis, and visualization of time-based data. Here’s a comprehensive guide on how to
work with time series in pandas.

1. Creating a Time Series

You can create a time series using pd.date_range() and pd.Series():

import pandas as pd

# Create a date range

date_range = pd.date_range(start='2024-01-01', periods=5, freq='D')

# Create a time series

data = [100, 200, 300, 400, 500]

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


time_series = pd.Series(data, index=date_range)

print(time_series)

2. Date and Time Data Types

Pandas has specific data types for date and time:

# Convert a string to datetime

date_str = '2024-01-01'

date_time = pd.to_datetime(date_str)

print(date_time)

# Check the type

print(type(date_time)) # Output: <class 'pandas._libs.tslibs.timestamps.Timestamp'>

3. Indexing and Slicing Time Series

You can slice a time series using date indexing:

# Slice the time series

print(time_series['2024-01-02':'2024-01-04'])

4. Resampling Time Series

Resampling allows you to change the frequency of your time series data.

a. Downsampling

You can downsample to a lower frequency:

# Create a time series with daily frequency

date_range = pd.date_range(start='2024-01-01', periods=10, freq='D')

data = [i * 10 for i in range(10)]

time_series = pd.Series(data, index=date_range)

# Downsample to a weekly frequency and calculate the mean

weekly_mean = time_series.resample('W').mean()

print(weekly_mean)

b. Upsampling

You can also upsample to a higher frequency:

# Upsample to hourly frequency and fill with forward fill


R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
hourly_data = time_series.resample('H').ffill()

print(hourly_data)

5. Handling Missing Data in Time Series

You can handle missing data using methods like forward fill (ffill) or backward fill (bfill):

# Introducing missing values

time_series_with_nan = time_series.reindex(pd.date_range(start='2024-01-01', end='2024-01-10',


freq='D'))

# Forward fill missing values

filled_data = time_series_with_nan.ffill()

print(filled_data)

6. Time Series Analysis

You can perform various analyses, such as calculating moving averages:

# Calculate a rolling mean (moving average) with a window of 3 days

rolling_mean = time_series.rolling(window=3).mean()

print(rolling_mean)

7. Time Zone Handling

Pandas makes it easy to work with time zones:

# Localize to a specific time zone

localized_ts = time_series.tz_localize('UTC')

print(localized_ts)

# Convert to another time zone

converted_ts = localized_ts.tz_convert('America/New_York')

print(converted_ts)

8. Plotting Time Series

Pandas integrates well with Matplotlib for plotting time series data:

import matplotlib.pyplot as plt

# Plot the time series

time_series.plot(title='Time Series Data', xlabel='Date', ylabel='Values')

plt.show()

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


Summary Example

Here’s a complete example demonstrating various time series operations:

import pandas as pd

import matplotlib.pyplot as plt

# Create a time series

date_range = pd.date_range(start='2024-01-01', periods=10, freq='D')

data = [i * 10 for i in range(10)]

time_series = pd.Series(data, index=date_range)

# Resample to weekly frequency

weekly_mean = time_series.resample('W').mean()

# Plot the original and resampled time series

plt.figure(figsize=(12, 6))

plt.plot(time_series, label='Daily Data', marker='o')

plt.plot(weekly_mean, label='Weekly Mean', marker='s', linestyle='--')

plt.title('Time Series Example')

plt.xlabel('Date')

plt.ylabel('Values')

plt.legend()

plt.show()

HIGH PERFORMANCE

Achieving high performance in pandas involves optimizing data manipulation and analysis tasks to
minimize memory usage and processing time. Here are several strategies and best practices to enhance
the performance of your pandas operations:
1. Use Appropriate Data Types
Using the most suitable data types can significantly reduce memory usage.
• Convert to Categorical: If you have string columns with repeated values, convert them to the
category type.
df['Category'] = df['Category'].astype('category')
• Use Smaller Numeric Types: For integer and float columns, use smaller data types if possible
(int32, float32, etc.).
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
df['int_column'] = df['int_column'].astype('int32')
2. Efficient Data Importing
When reading large datasets, use the dtype parameter to specify data types and usecols to load only
necessary columns.
df = pd.read_csv('large_file.csv', dtype={'col1': 'category', 'col2': 'float32'}, usecols=['col1', 'col2'])
3. Vectorized Operations
Pandas is optimized for vectorized operations, so avoid using loops over DataFrame rows.
# Instead of this
for i in range(len(df)):
df['new_col'][i] = df['col1'][i] + df['col2'][i]
# Use this
df['new_col'] = df['col1'] + df['col2']
4. Use apply() Wisely
While apply() is useful, it can be slow for large DataFrames. Prefer built-in pandas functions whenever
possible.
# Instead of using apply
df['new_col'] = df['col1'].apply(lambda x: x + 10)
# Use vectorized operation
df['new_col'] = df['col1'] + 10
5. Chunking for Large Datasets
If the dataset is too large to fit into memory, read it in chunks.
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
process(chunk) # Your processing function
6. Use Efficient Merging and Joining
When merging DataFrames, ensure the keys you’re joining on are indexed to speed up the operation.
df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)
result = df1.join(df2, how='inner')
7. Avoid Copies
Be mindful of operations that create unnecessary copies of DataFrames. Use inplace=True where
applicable to modify DataFrames without creating new ones.
# Instead of creating a new DataFrame

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals


df = df.drop(columns=['unnecessary_col'])

# Use inplace
df.drop(columns=['unnecessary_col'], inplace=True)
8. Profile Your Code
Use the %%time magic command in Jupyter notebooks or the time module to measure the performance
of different approaches.
%time df['new_col'] = df['col1'] + df['col2']
9. Use Dask for Larger-than-Memory Computation
If your data exceeds memory limits, consider using Dask, a parallel computing library that integrates
with pandas.
import dask.dataframe as dd
df = dd.read_csv('large_file.csv')
result = df.groupby('column').sum().compute()
10. Parallelize Operations with modin
You can use Modin, a library that provides a drop-in replacement for pandas, allowing operations to run
in parallel.
import modin.pandas as mpd
df = mpd.read_csv('large_file.csv')

R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals

You might also like