Dsf - Unit II Notes
Dsf - Unit II Notes
SYLLABUS
UNIT II DATA MANIPULATION
Python Shell - Jupyter Notebook - IPython Magic Commands - NumPy Arrays-Universal
Functions – Aggregations – Computation on Arrays – Fancy Indexing – Sorting arrays –
Structured data – Data manipulation with Pandas – Data Indexing and Selection – Handling
missing data – Hierarchical indexing – Combining datasets – Aggregation and Grouping – String
operations – Working with time series – High performance
PYTHON SHELL
The Python shell, also known as the Python interactive interpreter, is a command-line interface where
you can execute Python code interactively. It allows you to type in Python commands and see the results
immediately, making it an excellent tool for testing small code snippets, debugging, or learning Python.
Key Features of the Python Shell:
1. Interactive Environment: You can write and run Python commands one at a time. The results
are displayed immediately, which is great for experimentation.
2. Quick Feedback: The shell provides instant feedback on your code, making it easy to test ideas
and learn syntax.
3. Data Types and Structures: You can easily create and manipulate variables, lists, dictionaries,
and other data structures.
4. Import Libraries: You can import and use Python libraries right from the shell, allowing you to
leverage existing functionality.
5. Function Definition: You can define functions and test them on the fly.
6. Exit Easily: You can exit the shell whenever you want by using commands like exit() or pressing
Ctrl + Z (Windows) or Ctrl + D (Mac/Linux).
How to Access the Python Shell:
• Command Line: Open a terminal or command prompt and type python or python3, depending
on your installation.
• IDEs: Many integrated development environments (IDEs), like PyCharm or Jupyter Notebook,
provide a Python shell or console as part of their features.
Example Interaction:
JUPYTER NOTEBOOK
Jupyter Notebook is an open-source web application that allows you to create and share documents
containing live code, equations, visualizations, and narrative text. It is widely used in data science,
machine learning, and scientific research for its interactive features and ease of use.
Key Features of Jupyter Notebook:
1. Interactive Computing: You can write and execute code in chunks (called cells) and see the
results immediately, making it ideal for experimenting with code and data.
2. Support for Multiple Languages: While it primarily supports Python, Jupyter can work with
many programming languages (such as R, Julia, and Scala) through the use of different kernels.
3. Rich Text Support: You can combine code with rich text elements, such as Markdown, to
document your process, explain your code, or present results in a clear and readable format.
4. Data Visualization: It supports various libraries like Matplotlib, Seaborn, and Plotly, making it
easy to create and display graphs and visualizations inline.
5. Export Options: You can export your notebooks to various formats, including HTML, PDF, and
Markdown, for sharing or publishing.
6. Notebook Sharing: Jupyter Notebooks can be shared easily with others. You can share the
notebook file directly or host it on platforms like GitHub or JupyterHub.
2. Launching Jupyter Notebook: Once installed, you can launch it from the command line by
typing:
jupyter notebook
This will open a new tab in your web browser showing the Jupyter Notebook interface.
Example of a Jupyter Notebook Cell:
Here's a simple example of what you might see in a Jupyter Notebook:
import numpy as np
y = np.sin(x)
# Create a plot
plt.plot(x, y)
plt.title("Sine Wave")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.show()
IPython magic commands are special commands in IPython and Jupyter Notebooks that enhance the
interactive computing experience. They provide a way to perform various tasks in a more convenient
way than standard Python syntax. These commands start with one or two percent signs (% or %%).
%time sum(range(10000))
%timeit: Provides a more accurate measurement of execution time by running the statement
multiple times.
%timeit sum(range(10000))
%load: Loads a Python script into the current cell.
%load my_script.py
%matplotlib inline: Ensures that Matplotlib plots are displayed inline within the notebook.
%matplotlib inline
%who: Lists all variables in the current namespace.
%who
%history: Shows command history.
%history
Cell Magics
• %%time: Times the execution of the entire cell.
%%time sum = 0
for i in range(10000):
sum += i
IPython magic commands provide a powerful way to enhance your workflow in Jupyter Notebooks.
They simplify common tasks and offer functionalities that go beyond standard Python capabilities,
making your coding experience more efficient and effective.
NUMPY ARRAYS
NumPy (Numerical Python) is a powerful library in Python that provides support for large, multi-
dimensional arrays and matrices, along with a collection of mathematical functions to operate on these
arrays. NumPy arrays are the fundamental building blocks for numerical computations in Python.
Key Features of NumPy Arrays
1. Homogeneous Data: All elements in a NumPy array must be of the same data type, which
allows for efficient storage and computation.
2. N-dimensional: NumPy arrays can be one-dimensional (1D), two-dimensional (2D), or multi-
dimensional (ND), enabling complex data structures like matrices and tensors.
3. Fast and Efficient: NumPy operations are optimized for performance, making them faster than
traditional Python lists for numerical operations.
4. Element-wise Operations: You can perform operations on entire arrays without needing to write
loops, which leads to cleaner and more readable code.
5. Broadcasting: This feature allows NumPy to perform arithmetic operations on arrays of
different shapes, making it easier to work with arrays of varying dimensions.
Creating NumPy Arrays
To use NumPy, you first need to import it:
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
import numpy as np
Here are some common ways to create NumPy arrays:
1. From a Python List:
arr = np.array([1, 2, 3, 4])
2. Using Built-in Functions:
o Zeros: Create an array filled with zeros.
zeros = np.zeros((3, 4)) # 3 rows, 4 columns
o Ones: Create an array filled with ones.
ones = np.ones((2, 3)) # 2 rows, 3 columns
o Full: Create an array filled with a specified value.
full_array = np.full((2, 2), 7) # 2x2 array filled with 7
o Arange: Create an array with a range of values.
arange_array = np.arange(0, 10, 2) # [0, 2, 4, 6, 8]
o Linspace: Create an array with evenly spaced values over a specified range.
linspace_array = np.linspace(0, 1, 5) # [0.0, 0.25, 0.5, 0.75, 1.0]
Accessing and Modifying Arrays
• Indexing: Access elements using zero-based indexing.
first_element = arr[0] # Access the first element
• Slicing: Get a sub-array.
sub_array = arr[1:3] # Access elements from index 1 to 2
• Shape: Get the shape of the array.
shape = arr.shape # Returns a tuple (number of rows, number of columns)
• Reshaping: Change the shape of an array without changing its data.
reshaped_array = arr.reshape((2, 2))
Common Operations
1. Mathematical Operations: Perform element-wise operations.
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
sum_array = arr1 + arr2 # [5, 7, 9]
2. Statistical Functions: Calculate statistics like mean, median, and standard deviation.
mean = np.mean(arr) # Calculate mean
std_dev = np.std(arr) # Calculate standard deviation
AGGREGATIONS IN NUMPY
Aggregation functions in NumPy are used to compute a single value from an array of values. These
functions allow you to summarize or reduce data, making it easier to analyze and interpret. Common
aggregation functions include sum, mean, median, standard deviation, and more.
Key Aggregation Functions
Here are some commonly used aggregation functions in NumPy:
1. Sum
• np.sum: Computes the sum of array elements.
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
import numpy as np
COMPUTATIONS ON ARRAYS
• Computation on NumPy arrays can be very fast, or it can be very slow. Using vectorized operations,
fast computations is possible and it is implemented by using NumPy's universial functions (ufuncs).
• A universal function (ufuncs) is a function that operates on ndarrays in an element-by- element fashion,
supporting array broadcasting, type casting, and several other standard features. The ufunc is a
"vectorized" wrapper for a function that takes a fixed number of specific inputs and produces a fixed
number of specific outputs.
• Functions that work on both scalars and arrays are known as ufuncs. For arrays, ufuncs apply the
function in an element-wise fashion. Use of ufuncs is an esssential aspect of vectorization and typically
much more computtionally efficient than using an explicit loop over each element.
NumPy's Ufuncs :
• Ufuncs are of two types: unary ufuncs and binary ufuncs.
• Unary ufuncs operate on a single input and binary ufuncs, which operate on two inputs.
Example:
Import numpy as np
Arr = np.array([0, 30, 60, 90])
#converting the angles in radians
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
Arr = Arr*np.pi/180
print("\nThe sin value of angles:")
print(np.sin(Arr))
print("\nThe cos value of angles:")
print(np.cos(Arr))
print("\nThe tan value of angles:")
print(np.tan(Arr))
FANCY INDEXING
Fancy indexing in NumPy allows you to access and manipulate array elements using integer arrays or
boolean arrays. Here’s a quick overview:
1. Integer Array Indexing
You can use an array of indices to access specific elements:
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
indices = np.array([1, 3, 4])
result = arr[indices]
# Output: [20, 40, 50]
2. Multi-Dimensional Fancy Indexing
For multi-dimensional arrays, you can use two arrays: one for row indices and one for column indices.
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
row_indices = np.array([0, 2])
col_indices = np.array([1, 2])
result = arr_2d[row_indices, col_indices]
# Output: [2, 9]
3. Boolean Indexing
You can also use boolean arrays to filter elements that meet a condition.
arr = np.array([10, 20, 30, 40, 50])
condition = arr > 30
filtered = arr[condition]
Output: [40, 50]
4. Setting Values with Fancy Indexing
You can modify elements using fancy indexing as well:
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
arr = np.array([10, 20, 30, 40, 50])
arr[[1, 3]] = [99, 88] # Set elements at index 1 and 3
# Output: [10, 99, 30, 88, 50]
Fancy indexing in NumPy allows you to access and manipulate array elements using integer arrays or
boolean arrays. Here’s a quick overview:
1. Integer Array Indexing
You can use an array of indices to access specific elements:
import numpy as np
arr = np.array([10, 20, 30, 40, 50])
indices = np.array([1, 3, 4])
result = arr[indices] # Output: [20, 40, 50]
2. Multi-Dimensional Fancy Indexing
For multi-dimensional arrays, you can use two arrays: one for row indices and one for column indices.
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
row_indices = np.array([0, 2])
col_indices = np.array([1, 2])
result = arr_2d[row_indices, col_indices] # Output: [2, 9]
3. Boolean Indexing
You can also use boolean arrays to filter elements that meet a condition.
arr = np.array([10, 20, 30, 40, 50])
condition = arr > 30
filtered = arr[condition] # Output: [40, 50]
4. Setting Values with Fancy Indexing
You can modify elements using fancy indexing as well:
arr = np.array([10, 20, 30, 40, 50])
arr[[1, 3]] = [99, 88] # Set elements at index 1 and 3
# Output: [10, 99, 30, 88, 50]
Example Summary
Here’s a complete example demonstrating fancy indexing:
import numpy as np
# Create an array
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# Fancy indexing
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
rows = np.array([0, 2])
cols = np.array([1, 2])
selected_elements = data[rows, cols] # Output: [2, 9]
# Boolean indexing
condition = data > 5
filtered_elements = data[condition] # Output: [6, 7, 8, 9]
print("Selected Elements:", selected_elements)
print("Filtered Elements:", filtered_elements)
SORTING ARRAYS
Sorting arrays in NumPy is straightforward and can be done using several functions. Here’s an overview
of how to sort arrays, including both one-dimensional and multi-dimensional arrays.
1. Sorting a 1D Array
You can sort a one-dimensional array using np.sort() or the sort() method of the array.
import numpy as np
arr = np.array([3, 1, 4, 1, 5, 9, 2])
sorted_arr = np.sort(arr) # Returns a sorted copy
print(sorted_arr) # Output: [1, 1, 2, 3, 4, 5, 9]
# Using the sort method (in-place)
arr.sort() # Sorts in place
print(arr) # Output: [1, 1, 2, 3, 4, 5, 9]
2. Sorting a 2D Array
When sorting a 2D array, you can specify the axis along which to sort.
arr_2d = np.array([[3, 1, 4], [1, 5, 9], [2, 6, 5]])
# Sort along the rows (default axis=1)
sorted_rows = np.sort(arr_2d, axis=1)
print(sorted_rows)
# Output:
# [[1, 3, 4],
# [1, 5, 9],
# [2, 5, 6]]
# Sort along the columns (axis=0)
sorted_cols = np.sort(arr_2d, axis=0)
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
print(sorted_cols)
# Output:
# [[1, 1, 4],
# [2, 5, 5],
# [3, 6, 9]]
3. Getting the Indices of Sorted Elements
You can use np.argsort() to get the indices that would sort the array.
indices = np.argsort(arr)
print(indices) # Output: [1, 2, 4, 0, 3, 5, 6]
4. Sorting with Custom Order
If you want to sort based on a custom criterion, you can use the key parameter in combination with a
function. However, this is more common in Python's built-in sorting. For NumPy, you can do it with
boolean indexing or other methods.
5. Sorting with np.lexsort
You can use np.lexsort for sorting based on multiple keys.
# Example: Sort by second column, then by first column
data = np.array([[3, 1], [1, 2], [4, 1], [1, 3]])
sorted_indices = np.lexsort((data[:, 0], data[:, 1]))
sorted_data = data[sorted_indices]
print(sorted_data)
# Output:
# [[1, 2],
# [3, 1],
# [1, 3],
# [4, 1]]
Summary Example
Here’s a complete example that showcases sorting:
import numpy as np
# Create a 2D array
data = np.array([[3, 2, 5], [1, 0, 4], [8, 7, 6]])
# Sort the entire array (flattened)
sorted_flattened = np.sort(data, axis=None)
print("Sorted (flattened):", sorted_flattened)
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
# Sort by rows
sorted_by_rows = np.sort(data, axis=1)
print("Sorted by rows:\n", sorted_by_rows)
# Sort by columns
sorted_by_cols = np.sort(data, axis=0)
print("Sorted by columns:\n", sorted_by_cols)
# Get sorted indices
sorted_indices = np.argsort(data.flatten())
print("Sorted indices:", sorted_indices)
STRUCTURED ARRAYS
• A structured Numpy array is an array of structures. As numpy arrays are homogeneous i.e. they can
contain data of same type only. So, instead of creating a numpy array of int or float, we can create
numpy array of homogeneous structures too.
• First of all import numpy module i.e.
importnumpy as np
• Now to create a structure numpy array we can pass a list of tuples containing the structure elements i.e.
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
• But as elements of a Numpy array are homogeneous, so how will be the size and type of structure will
be decided? For that we need to pass the type of above structure type i.e. schema in dtype parameter.
• Let's create a dtype for above structure i.e.
# Creating the type of a structure
dtype = [('Name', (np.str_, 10)), ('Marks', np.float64), ('GradeLevel', np.int32)]
• Let's create a numpy array based on this dtype i.e.
# Creating a StrucuredNumpy array
structuredArr= np.array([('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)],
dtype=dtype)
• It will create a structured numpy array and its contents will be,
[('Ram', 22.2, 3), ('Rutu', 39.4, 5), ('Rupu', 55.5, 6), ('Iresh', 99.9, 7)]
• Let's check the data type of the above created numpy array is,
print(structured Arr.dtype)
Output:
[('Name', '<U10'), ('Marks', '<f8'), ('GradeLevel', '<i4')]
Drop duplicates
df.drop_duplicates()
• Drop duplicates in the first name column, but take the last observation in the duplicated set
df.drop_duplicates (['first_name'], keep='last')
Creating a Data Map and Data Plan
• Overview of dataset is given by data map. Data map is used for finding potential problems in data,
such as redundant variables, possible errors, missing values and variable transformations.
df = pd.DataFrame(data)
df = pd.DataFrame(data)
# Detecting missing values
print("Missing Values Summary:\n", df.isnull().sum())
# Forward filling
df_ffill = df.fillna(method='ffill')
print("\nForward Filled DataFrame:\n", df_ffill)
# Backward filling
df_bfill = df.fillna(method='bfill')
print("\nBackward Filled DataFrame:\n", df_bfill)
HIERARCHICAL INDEXING
• Hierarchical indexing is a method of creating structured group relationships in data.
• A MultiIndex or Hierarchical index comes in when our DataFrame has more than two dimensions. As
we already know, a Series is a one-dimensional labelled NumPy array and a DataFrame is usually a two-
dimensional table whose columns are Series. In some instances, in order to carry out some sophisticated
data analysis and manipulation, our data is presented in higher dimensions.
• A MultiIndex adds at least one more dimension to the data. A Hierarchical Index as the name suggests
is ordering more than one item in terms of their ranking.
• To createDataFrame with player ratings of a few players from the Fifa 19 dataset.
In [1]: import pandas as pd
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
In [2]: data = {'Position': ['GK', 'GK', 'GK', 'DF', 'DF', 'DF",
'MF', 'MF", 'MF', 'CF', 'CF', 'CF'],
'Name': ['De Gea', 'Coutois', 'Allison', 'VanDijk',
'Ramos', 'Godin', 'Hazard', 'Kante', 'De Bruyne', 'Ronaldo'
'Messi', 'Neymar'],
'Overall': ['91','88', '89', '89', '91', '90', '91', '90', '92', '94', '93', '92'],
'Rank': ['1st', '3rd', '2nd', '3rd','1st', '2nd', '2nd', '3rd', '1st', '1st', '2nd', '3rd']}
In [3]: fifa19 = pd.DataFrame(data, columns=['Position', 'Name', 'Overall', 'Rank'])
In [4]: fifa19
Out[4]:
• From above Dataframe, we notice that the index is the default Pandas index; the columns 'Position' and
'Rank' both have values or objects that are repeated. This could sometimes pose a problem for us when
we want to analyse the data. What we would like to do is to use meaningful indexes that uniquely
identify each row and makes it easier to get a sense of the data we are working with. This is where
MultiIndex or Hierarchical Indexing comes in.
• We do this by using the set_index() method. For Hierarchical indexing, we use set_index() method for
passing a list to represent how we want the rows to be identified uniquely.
In [5]: fif19.set_index(['Position', 'Rank'], drop = False)
In [6]: fifa19
Out[6];
COMBINING DATASETS
• Whether it is to concatenate several datasets from different csv files or to merge sets of aggregated data
from different google analytics accounts, combining data from various sources is critical to drawing the
right conclusions and extracting optimal value from data analytics.
• When using pandas, data scientists often have to concatenate multiple pandas DataFrame; either
vertically (adding lines) or horizontally (adding columns).
DataFrame.append
• This method allows to add another dataframe to an existing one. While columns with matching names
are concatenated together, columns with different labels are filled with NA.
>>>df1
ints bools
0 0 True
11 False
2 2 True
>>> df2
ints floats
0 3 1.5
1 4 2.5
2 5 3.5
>>> df1.append(df2).
ints bools floats
0 0 True NaN
1 1 False NaN
2 2 True NaN
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
0 3 NaN 1.5
1 4 NaN 2.5
2 5 NaN 3.5
• In addition to this, DataFrame.append provides other flexibilities such as resetting the resulting index,
sorting the resulting data or raising an error when the resulting index includes duplicate records.
Pandas.concat
• We can concat dataframes both vertically (axis=0) and horizontally (axis=1) by using the
Pandas.concat function. Unlike DataFrame.append, Pandas.concat is not a method but a function that
takes a list of objects as input. On the other hand, columns with different labels are filled with NA values
as for DataFrame.append.
>>> df3
bools floats
0 False 4.5
1 True 5.5
2 False 6.5
>>>pd.concat([df1, df2, df3])
ints bools floats
0 0.0 True NaN
1 1.0 False NaN
2 2.0 True NaN
0 3.0 NaN 1.5
1 4.0 NaN 2.5
2 5.0 NaN 3.5
0 NaN False 4.5
1 NaN True 5.5
2 NaN False 6.5
• The date column can be parsed using the extremely handy dateutil library.
import pandas as pd
importdateutil
# Load data from csv file
data = pd.DataFrame.from_csv('phone_data.csv')
# Convert date from string to date times
data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True)
• Once the data has been loaded into Python, Pandas makes the calculation of different statistics very
simple. For example, mean, max, min, standard deviations and more for columns are easily calculable:
# How many rows the dataset
data['item'].count()
Out[38]: 830
# What was the longest phone call / data entry?
data['duration'].max()
Out[39]: 10528.0
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
# How many seconds of phone calls are recorded in total?
data['duration'][data['item'] == 'call'].sum()
Out[40]: 92321.0
# How many entries are there for each month?
data['month'].value_counts()
Out[41]:
2014-11 230
2015-01 205
2014-12 157
2015-02 137
2015-03 101
dtype: int64
# Number of non-null unique network entries
data['network'].nunique()
Out[42]: 9
groupby() function :
• groupby essentially splits the data into different groups depending on a variable of user choice.
• The groupby() function returns a GroupBy object, but essentially describes how the rows of the
original data set has been split. The GroupBy object groups variable is a dictionary whose keys are the
computed unique groups and corresponding values being the axis labels belonging to each group.
• Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object to
obtain summary statistics for each group.
• The GroupBy object supports column indexing in the same way as the DataFrame and returns a
modified GroupBy object.
STRING OPERATIONS
Pandas provides a robust set of string operations that allow you to manipulate text data in DataFrames
easily. Here's a comprehensive guide on common string operations in pandas.
import pandas as pd
data = {
df = pd.DataFrame(data)
print(df)
2. String Methods
Pandas has a .str accessor that allows you to apply string functions on Series objects.
b. Converting Case
c. Stripping Whitespace
df['Name'] = df['Name'].str.strip()
d. Replacing Substrings
print(df['Email'])
a. Splitting Strings
b. Joining Strings
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
You can join a list of strings into a single string using .str.join():
4. Extracting Substrings
You can extract parts of strings using .str.extract() with regex patterns:
df['Username'] = df['Email'].str.extract(r'([^@]+)')
print(df[['Email', 'Username']])
Summary Example
import pandas as pd
# Sample DataFrame
data = {
df = pd.DataFrame(data)
# Strip whitespace
df['Name'] = df['Name'].str.strip()
# Convert to uppercase
df['Username'] = df['Email'].str.extract(r'([^@]+)')
print(df)
Working with time series data in pandas is straightforward and powerful, allowing for easy
manipulation, analysis, and visualization of time-based data. Here’s a comprehensive guide on how to
work with time series in pandas.
import pandas as pd
print(time_series)
date_str = '2024-01-01'
date_time = pd.to_datetime(date_str)
print(date_time)
print(time_series['2024-01-02':'2024-01-04'])
Resampling allows you to change the frequency of your time series data.
a. Downsampling
weekly_mean = time_series.resample('W').mean()
print(weekly_mean)
b. Upsampling
print(hourly_data)
You can handle missing data using methods like forward fill (ffill) or backward fill (bfill):
filled_data = time_series_with_nan.ffill()
print(filled_data)
rolling_mean = time_series.rolling(window=3).mean()
print(rolling_mean)
localized_ts = time_series.tz_localize('UTC')
print(localized_ts)
converted_ts = localized_ts.tz_convert('America/New_York')
print(converted_ts)
Pandas integrates well with Matplotlib for plotting time series data:
plt.show()
import pandas as pd
weekly_mean = time_series.resample('W').mean()
plt.figure(figsize=(12, 6))
plt.xlabel('Date')
plt.ylabel('Values')
plt.legend()
plt.show()
HIGH PERFORMANCE
Achieving high performance in pandas involves optimizing data manipulation and analysis tasks to
minimize memory usage and processing time. Here are several strategies and best practices to enhance
the performance of your pandas operations:
1. Use Appropriate Data Types
Using the most suitable data types can significantly reduce memory usage.
• Convert to Categorical: If you have string columns with repeated values, convert them to the
category type.
df['Category'] = df['Category'].astype('category')
• Use Smaller Numeric Types: For integer and float columns, use smaller data types if possible
(int32, float32, etc.).
R.GAYATHRI / AP-CSE UNIT-II NOTES Data Science Fundamentals
df['int_column'] = df['int_column'].astype('int32')
2. Efficient Data Importing
When reading large datasets, use the dtype parameter to specify data types and usecols to load only
necessary columns.
df = pd.read_csv('large_file.csv', dtype={'col1': 'category', 'col2': 'float32'}, usecols=['col1', 'col2'])
3. Vectorized Operations
Pandas is optimized for vectorized operations, so avoid using loops over DataFrame rows.
# Instead of this
for i in range(len(df)):
df['new_col'][i] = df['col1'][i] + df['col2'][i]
# Use this
df['new_col'] = df['col1'] + df['col2']
4. Use apply() Wisely
While apply() is useful, it can be slow for large DataFrames. Prefer built-in pandas functions whenever
possible.
# Instead of using apply
df['new_col'] = df['col1'].apply(lambda x: x + 10)
# Use vectorized operation
df['new_col'] = df['col1'] + 10
5. Chunking for Large Datasets
If the dataset is too large to fit into memory, read it in chunks.
chunk_size = 10000
for chunk in pd.read_csv('large_file.csv', chunksize=chunk_size):
process(chunk) # Your processing function
6. Use Efficient Merging and Joining
When merging DataFrames, ensure the keys you’re joining on are indexed to speed up the operation.
df1.set_index('key', inplace=True)
df2.set_index('key', inplace=True)
result = df1.join(df2, how='inner')
7. Avoid Copies
Be mindful of operations that create unnecessary copies of DataFrames. Use inplace=True where
applicable to modify DataFrames without creating new ones.
# Instead of creating a new DataFrame
# Use inplace
df.drop(columns=['unnecessary_col'], inplace=True)
8. Profile Your Code
Use the %%time magic command in Jupyter notebooks or the time module to measure the performance
of different approaches.
%time df['new_col'] = df['col1'] + df['col2']
9. Use Dask for Larger-than-Memory Computation
If your data exceeds memory limits, consider using Dask, a parallel computing library that integrates
with pandas.
import dask.dataframe as dd
df = dd.read_csv('large_file.csv')
result = df.groupby('column').sum().compute()
10. Parallelize Operations with modin
You can use Modin, a library that provides a drop-in replacement for pandas, allowing operations to run
in parallel.
import modin.pandas as mpd
df = mpd.read_csv('large_file.csv')