0% found this document useful (0 votes)

24 views

dataframing_in_csv

The document outlines the structure and steps for manipulating a CSV file containing student data, detailing the columns present and the data types. It provides a comprehensive guide on how to read, inspect, manipulate, and save the data using Python's Pandas library in a Jupyter Notebook. Additionally, it includes exercises for practical application of the concepts discussed.

Uploaded by

kameshs366

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views

dataframing_in_csv

Uploaded by

kameshs366

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 14

DATA FRAMING IN CSV

Student data.csv

Columns in Dataset:
• id: Identifier for each individual (likely numeric or string)
• name: Name of the individual
• nationality: Nationality of the individual
• city: City of the individual
• latitude: Latitude of the city (numeric)
• longitude: Longitude of the city (numeric)
• gender: Gender of the individual
• ethnic.group: Ethnic group of the individual
• age: Age of the individual (numeric)
• english.grade: Grade in English (numeric or categorical)
• math.grade: Grade in Mathematics (numeric or categorical)
• sciences.grade: Grade in Sciences (numeric or categorical)
• language.grade: Grade in Language (numeric or categorical)
• portfolio.rating: Rating of the portfolio (numeric or categorical)
• coverletter.rating: Rating of the cover letter (numeric or categorical)
• refletter.rating: Rating of the reference letter (numeric or categorical)

Steps to Frame CSV File (sdata.csv) in Jupyter Notebook

1. Read the CSV File into a DataFrame.
2. Inspect the Data to understand its structure.
3. Perform Necessary Data Manipulations (e.g., handle missing values, convert data types,
filter, etc.).
4. Display the Data after framing it.
5. Save the Manipulated Data if needed.

Step 1: Import Pandas and Read the CSV File

• Importing Pandas: Pandas is the library we use to handle data in Python. import
pandas as pd makes Pandas available for us to work with.

• Reading the CSV file: The pd.read_csv() function loads the contents of sdata.csv
into a DataFrame (df). It automatically detects column headers (the first row of the
CSV)and creates the DataFrame based on them.
df.head(): This function displays the first five rows of the dataset so you can quickly get an
overview of the structure of the data.
Code and Output:

Step 2: Inspect the Data

Use df.info() to check the data types and missing values, and df.describe() to view
summary statistics.

Display data types and check for missing values-df.info()

Display summary statistics for numeric columns-df.describe()

Check for any missing data-df.isnull().sum()

Commands to inspect the datas

1. df.info(): This provides useful metadata about the DataFrame, such as:

• Column names: The names of all columns.

• Data types: The type of data in each column (e.g., integer, float, object/string).
• Non-null count: The number of non-null entries in each column (useful to identify
missing data).

Output:
2. df.describe(): This gives a quick statistical summary of numerical columns. It
provides:
• Count: Number of non-null entries.
• Mean: The average value for numeric columns.
• Standard deviation: The variation in the data.
• Min/Max: The smallest/largest value in each column.
• Quartiles: The 25th, 50th, and 75th percentiles.

3. df.isnull().sum(): This checks for missing values (NaNs) across all columns. The
result will show the count of missing values in each column. This helps identify if there are
any columns with missing data that may need to be cleaned.
Inspecting the data ensures that we understand the structure, identify any issues like missing
data, and decide on how to clean or manipulate the data.

Step 3: Data Manipulation

After inspecting the data, the next step is to manipulate it to make it more useful or clean it up.
Here are common manipulations you can perform:

Handling Missing Values

If there are missing values in your dataset, you need to decide how to handle them. You can either
fill the missing values with a default value (like 'Unknown') or drop the rows/columns
containing them.

1. Handle Columns with All NaN Values:

• For Numeric Columns: Before filling with the median or mean, check if the column is
empty and then fill it with a default value.
• For Categorical Columns: You may want to fill them with a placeholder string, such as
'Unknown'.
code:

explanation:
df[col].isna().all():

• This checks if all the values in a column (df[col]) are NaN.

• If the column has only NaN values, the expression will return True, and you can handle it
by assigning a default value.
Filling the Column:
• If a column is not entirely NaN, we proceed with filling it using the median (for numeric
columns) or a placeholder string like 'Unknown' (for categorical columns).
• If it is completely NaN, we assign a default value like 0 for numeric columns or
'Unknown' for categorical ones.

2. Renaming Columns
Sometimes, column names need to be more descriptive. You can rename columns if necessary:

code and output:

3. Converting Data Types
If any columns need to be converted to appropriate data types (e.g., converting age from string to
integer):
code:

Filtering Data Based on Conditions

You can filter the DataFrame based on specific conditions, like filtering out individuals older than
20 years:
code and output:
5.Sorting Data
Sorting Rows: The df.sort_values() function sorts the DataFrame by the specified column
(age in this case). The ascending=True parameter sorts the data in ascending order. If you
wanted descending order, you’d set ascending=False.

Code and output:

6. Grouping Data (GroupBy)
Grouping data is useful when you want to perform aggregate operations, like computing the mean,
sum, or count for each group.

Group By a Column:
Group by 'city' and calculate the mean of 'age' for each city

Multiple

Aggregations:

# Group by 'city' and calculate both the mean and sum of 'age'
7. Adding New Columns
You can add new columns based on existing ones by performing operations or calculations.
# Add a new column 'age_group' based on 'age'

8.Merging Dataframes:

You can merge multiple DataFrames based on common columns using the merge() function,
similar to SQL joins.

8.1.Inner Join
An inner join keeps only the rows where there is a match in both DataFrames based on the join
key(s).

Inner join on 'id' column df_inner = pd.merge(df1, df2, on='id', how='inner')

code:

output:

8.2.left join
keeps all rows from the left DataFrame and adds matching rows from the right DataFrame. If there
is no match, it fills with NaN.

Code:

output:

8.3.Right join
A right join keeps all rows from the right DataFrame and adds matching rows from the left
DataFrame. If there is no match, it fills with NaN.

Right join on 'id' column

df_right = pd.merge(df1, df2, on='id', how='right')
code and output:

8.4. Outer Join

An outer join keeps all rows from both DataFrames. If there is no match, it fills with NaN.#Outer
join on 'id' column df_outer = pd.merge(df1, df2, on='id', how='outer')
8.5.

Merging on Multiple Columns

You can merge on multiple columns, just like in SQL joins. The on parameter accepts a list of
column names.
Code and output:

9.Saving the Data:

• df.to_csv('sdata_modified.csv', index=False):
code and output:

Exercises:
1. Load a CSV file called company_data.csv. Display the first 10 rows and show the
column names and data types using .info().

2. Load a CSV file called sales_data.csv. Fill all missing numerical values with the
column's median, and replace missing categorical values with 'Unknown'. Verify if all
missing data is handled.
3. Load employee_data.csv. Filter and display only the employees whose age is greater
than 40.
4. Load product_data.csv. Add a new column called discounted_price that
applies a 15% discount to the price column.

5. Load sales_data.csv. Group the data by region and calculate the total sales for each
region.
6. Load two CSV files: orders.csv and customers.csv. Merge these DataFrames on
the customer_id column. Display the first 5 rows of the merged DataFrame.
7. Load student_scores.csv. Sort the data based on the math.grade column in
descending order, and display the top 5 rows.
8. Load review_data.csv. Identify and remove duplicate rows based on the review_id
column. Save the cleaned data to a new file cleaned_review_data.csv.

9. Load attendance_data.csv. Convert the attendance_date column to a datetime

object. Filter the data to show records from the last 60 days.
10. Load performance_data.csv. Add a new column performance_score as
the average of the task1_score, task2_score, and task3_score. Save the
updated data to a new file performance_with_scores.csv.

Weeks 11 20 Series A Numeracy Ninjas Skill Book 3 Sessions Per Week
No ratings yet
Weeks 11 20 Series A Numeracy Ninjas Skill Book 3 Sessions Per Week
61 pages
Securing Windows Server 2016: Official Microsoft Learning Product
No ratings yet
Securing Windows Server 2016: Official Microsoft Learning Product
5 pages
Upper Control Card Repair Manual
No ratings yet
Upper Control Card Repair Manual
17 pages
Pandas
No ratings yet
Pandas
94 pages
What is pandas
No ratings yet
What is pandas
9 pages
Introduction To Pandas
No ratings yet
Introduction To Pandas
27 pages
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
No ratings yet
Exploratory Data Analysis (Eda) With Pandas: (Cheatsheet)
7 pages
Universal Data Analytics Algorithm
No ratings yet
Universal Data Analytics Algorithm
51 pages
DataFrame.docx
No ratings yet
DataFrame.docx
95 pages
ainotes
No ratings yet
ainotes
5 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Pandas PDF(2)
No ratings yet
Pandas PDF(2)
25 pages
IP 12th Chapter 3
No ratings yet
IP 12th Chapter 3
9 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Ainotes dataframe
No ratings yet
Ainotes dataframe
5 pages
Pandas Library Documentation
No ratings yet
Pandas Library Documentation
16 pages
Pandas Cheat Sheet Final
No ratings yet
Pandas Cheat Sheet Final
1 page
CO3_3_Indexing and Sorting, Loading Data From CSV
No ratings yet
CO3_3_Indexing and Sorting, Loading Data From CSV
29 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Chapter Notes - Data Handling Using Pandas DataFrame
No ratings yet
Chapter Notes - Data Handling Using Pandas DataFrame
16 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Pandas_Dataframe_All_Operations_1735471870
No ratings yet
Pandas_Dataframe_All_Operations_1735471870
4 pages
lab 1 ML lab
No ratings yet
lab 1 ML lab
15 pages
7 Days Analytics Course 3feiz7 4
No ratings yet
7 Days Analytics Course 3feiz7 4
8 pages
Python Notes by Prof T
No ratings yet
Python Notes by Prof T
10 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
60 pages
Pandas
No ratings yet
Pandas
13 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
DevOps Session 3 Pandas.pptx
No ratings yet
DevOps Session 3 Pandas.pptx
33 pages
Informatics Practices Practical File
No ratings yet
Informatics Practices Practical File
8 pages
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
No ratings yet
3Y3Z2Xzqn7 U Y%K : 2. How To Create A Data Frame Using A Dictionary of Pre-Existing Columns or Numpy 2D Arrays?
8 pages
EDS - Python Cheat Sheet
0% (1)
EDS - Python Cheat Sheet
3 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Getting start with pandas
No ratings yet
Getting start with pandas
11 pages
Pandas - Digitalocean
No ratings yet
Pandas - Digitalocean
15 pages
justenoughpython_pandas_220915_175329
No ratings yet
justenoughpython_pandas_220915_175329
64 pages
99c949c0-5910-425f-9ac5-155882800fa5
No ratings yet
99c949c0-5910-425f-9ac5-155882800fa5
36 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
pandas_merged
No ratings yet
pandas_merged
2 pages
1745516832930-Pandas-Handbook
No ratings yet
1745516832930-Pandas-Handbook
33 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
exp3 python (1)
No ratings yet
exp3 python (1)
15 pages
Pandas_numpy_handing_data
No ratings yet
Pandas_numpy_handing_data
32 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pandas
No ratings yet
Pandas
13 pages
Pandas Cheat Sheet
No ratings yet
Pandas Cheat Sheet
5 pages
Pandas
No ratings yet
Pandas
4 pages
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
No ratings yet
Experiment No 3 Importing and Exporting Data in Python Using Pandas Student
6 pages
hduud
No ratings yet
hduud
55 pages
14oct Pandas 2024
No ratings yet
14oct Pandas 2024
13 pages
Important Pandas Operations 1697910759
No ratings yet
Important Pandas Operations 1697910759
6 pages
Dataframe in Pandas - Cheatsheet
No ratings yet
Dataframe in Pandas - Cheatsheet
8 pages
lecture-week2
No ratings yet
lecture-week2
72 pages
python interviews
No ratings yet
python interviews
154 pages
CH-6 Data Loading, Storage, and File Formats
No ratings yet
CH-6 Data Loading, Storage, and File Formats
163 pages
Pandas Fuction Notes
No ratings yet
Pandas Fuction Notes
3 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Excel Techniques
From Everand
Excel Techniques
Online Trainees
2/5 (1)
CSV File Guide
From Everand
CSV File Guide
Mia Wright
No ratings yet
Mastering Data Structures and Algorithms in C and C++
From Everand
Mastering Data Structures and Algorithms in C and C++
Sachin Naha
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Hakin9 EN 03 2014
100% (1)
Hakin9 EN 03 2014
74 pages
IT60020 Cloud Computing MS 2016
No ratings yet
IT60020 Cloud Computing MS 2016
2 pages
Integrating AI Technologies in Education
No ratings yet
Integrating AI Technologies in Education
6 pages
System Optimization of Turbofan Engines Using Genetic Algorithms
No ratings yet
System Optimization of Turbofan Engines Using Genetic Algorithms
12 pages
Instant Access to Lectures on Quantum Information Physics Textbook 1. Auflage Edition Dagmar Bruß ebook Full Chapters
No ratings yet
Instant Access to Lectures on Quantum Information Physics Textbook 1. Auflage Edition Dagmar Bruß ebook Full Chapters
77 pages
User Manual OGM Rec
No ratings yet
User Manual OGM Rec
2 pages
Ai-Driven+Data+Analytics+And+Automation+A+Systematic+Literature+Review+Of+Industry+Applications
No ratings yet
Ai-Driven+Data+Analytics+And+Automation+A+Systematic+Literature+Review+Of+Industry+Applications
20 pages
Example 2
100% (1)
Example 2
5 pages
2023 Our Zone - Nit, Iiit, Gift Joining List - 185-1
100% (1)
2023 Our Zone - Nit, Iiit, Gift Joining List - 185-1
10 pages
CBE 202 HW 2 (Revised)
No ratings yet
CBE 202 HW 2 (Revised)
2 pages
1-9 Lámina 8'X20' (Odc 4536)
No ratings yet
1-9 Lámina 8'X20' (Odc 4536)
10 pages
Appendix A Eigenvalue Buckling Analysis: Introduction To ANSYS Mechanical
No ratings yet
Appendix A Eigenvalue Buckling Analysis: Introduction To ANSYS Mechanical
17 pages
WF - MD78X Technical Bulletin External Speaker PDF
No ratings yet
WF - MD78X Technical Bulletin External Speaker PDF
3 pages
Untitled
No ratings yet
Untitled
5 pages
Rice Leaf Diseases Detection Using Machine Learnin
No ratings yet
Rice Leaf Diseases Detection Using Machine Learnin
6 pages
Ict 10 2
No ratings yet
Ict 10 2
47 pages
E-380kVA KOEL DG Set drg.-R1 15.09.2022
No ratings yet
E-380kVA KOEL DG Set drg.-R1 15.09.2022
2 pages
RN_iDX_4.1.6.x._T0001285_Rev_G_28_Apr_2023
No ratings yet
RN_iDX_4.1.6.x._T0001285_Rev_G_28_Apr_2023
128 pages
Introduction To Management Accounting
No ratings yet
Introduction To Management Accounting
7 pages
ARAG Digiwolf Fill Meter Instructions
No ratings yet
ARAG Digiwolf Fill Meter Instructions
24 pages
Intelligence Artificielle Comprendre Le Machine Learning Et Les
No ratings yet
Intelligence Artificielle Comprendre Le Machine Learning Et Les
75 pages
Leeming LUT Pro III Setup Guide Fujifilm X Series
No ratings yet
Leeming LUT Pro III Setup Guide Fujifilm X Series
6 pages
Datasheet XPG Spectrix d50 ddr4 RGB Dram 20210517
No ratings yet
Datasheet XPG Spectrix d50 ddr4 RGB Dram 20210517
3 pages
W9S1S2 - First Order Logic (Formula, Model, Tableaux)
No ratings yet
W9S1S2 - First Order Logic (Formula, Model, Tableaux)
40 pages
Lecture 6 - Variables in C++
No ratings yet
Lecture 6 - Variables in C++
22 pages
Imd163 Sow
No ratings yet
Imd163 Sow
4 pages
2GPL113777 D02 Assembly Drawing To Print A SUPPORT D2
No ratings yet
2GPL113777 D02 Assembly Drawing To Print A SUPPORT D2
2 pages

Uploaded by

Uploaded by

DATA FRAMING IN CSV

Steps to Frame CSV File (sdata.csv) in Jupyter Notebook

Step 1: Import Pandas and Read the CSV File

Step 2: Inspect the Data

Display data types and check for missing values-df.info()

Display summary statistics for numeric columns-df.describe()

Check for any missing data-df.isnull().sum()

Commands to inspect the datas

• Column names: The names of all columns.

Step 3: Data Manipulation

Handling Missing Values

1. Handle Columns with All NaN Values:

• This checks if all the values in a column (df[col]) are NaN.

code and output:

Filtering Data Based on Conditions

Code and output:

Inner join on 'id' column df_inner = pd.merge(df1, df2, on='id', how='inner')

Right join on 'id' column

8.4. Outer Join

Merging on Multiple Columns

9.Saving the Data:

9. Load attendance_data.csv. Convert the attendance_date column to a datetime

You might also like