0% found this document useful (0 votes)
24 views

dataframing_in_csv

The document outlines the structure and steps for manipulating a CSV file containing student data, detailing the columns present and the data types. It provides a comprehensive guide on how to read, inspect, manipulate, and save the data using Python's Pandas library in a Jupyter Notebook. Additionally, it includes exercises for practical application of the concepts discussed.

Uploaded by

kameshs366
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

dataframing_in_csv

The document outlines the structure and steps for manipulating a CSV file containing student data, detailing the columns present and the data types. It provides a comprehensive guide on how to read, inspect, manipulate, and save the data using Python's Pandas library in a Jupyter Notebook. Additionally, it includes exercises for practical application of the concepts discussed.

Uploaded by

kameshs366
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 14

DATA FRAMING IN CSV

Student data.csv

Columns in Dataset:
• id: Identifier for each individual (likely numeric or string)
• name: Name of the individual
• nationality: Nationality of the individual
• city: City of the individual
• latitude: Latitude of the city (numeric)
• longitude: Longitude of the city (numeric)
• gender: Gender of the individual
• ethnic.group: Ethnic group of the individual
• age: Age of the individual (numeric)
• english.grade: Grade in English (numeric or categorical)
• math.grade: Grade in Mathematics (numeric or categorical)
• sciences.grade: Grade in Sciences (numeric or categorical)
• language.grade: Grade in Language (numeric or categorical)
• portfolio.rating: Rating of the portfolio (numeric or categorical)
• coverletter.rating: Rating of the cover letter (numeric or categorical)
• refletter.rating: Rating of the reference letter (numeric or categorical)

Steps to Frame CSV File (sdata.csv) in Jupyter Notebook


1. Read the CSV File into a DataFrame.
2. Inspect the Data to understand its structure.
3. Perform Necessary Data Manipulations (e.g., handle missing values, convert data types,
filter, etc.).
4. Display the Data after framing it.
5. Save the Manipulated Data if needed.

Step 1: Import Pandas and Read the CSV File


• Importing Pandas: Pandas is the library we use to handle data in Python. import
pandas as pd makes Pandas available for us to work with.

• Reading the CSV file: The pd.read_csv() function loads the contents of sdata.csv
into a DataFrame (df). It automatically detects column headers (the first row of the
CSV)and creates the DataFrame based on them.
df.head(): This function displays the first five rows of the dataset so you can quickly get an
overview of the structure of the data.
Code and Output:

Step 2: Inspect the Data


Use df.info() to check the data types and missing values, and df.describe() to view
summary statistics.

Display data types and check for missing values-df.info()

Display summary statistics for numeric columns-df.describe()

Check for any missing data-df.isnull().sum()

Commands to inspect the datas

1. df.info(): This provides useful metadata about the DataFrame, such as:

• Column names: The names of all columns.


• Data types: The type of data in each column (e.g., integer, float, object/string).
• Non-null count: The number of non-null entries in each column (useful to identify
missing data).

Output:
2. df.describe(): This gives a quick statistical summary of numerical columns. It
provides:
• Count: Number of non-null entries.
• Mean: The average value for numeric columns.
• Standard deviation: The variation in the data.
• Min/Max: The smallest/largest value in each column.
• Quartiles: The 25th, 50th, and 75th percentiles.

3. df.isnull().sum(): This checks for missing values (NaNs) across all columns. The
result will show the count of missing values in each column. This helps identify if there are
any columns with missing data that may need to be cleaned.
Inspecting the data ensures that we understand the structure, identify any issues like missing
data, and decide on how to clean or manipulate the data.

Step 3: Data Manipulation


After inspecting the data, the next step is to manipulate it to make it more useful or clean it up.
Here are common manipulations you can perform:

Handling Missing Values


If there are missing values in your dataset, you need to decide how to handle them. You can either
fill the missing values with a default value (like 'Unknown') or drop the rows/columns
containing them.

1. Handle Columns with All NaN Values:


• For Numeric Columns: Before filling with the median or mean, check if the column is
empty and then fill it with a default value.
• For Categorical Columns: You may want to fill them with a placeholder string, such as
'Unknown'.
code:

explanation:
df[col].isna().all():

• This checks if all the values in a column (df[col]) are NaN.


• If the column has only NaN values, the expression will return True, and you can handle it
by assigning a default value.
Filling the Column:
• If a column is not entirely NaN, we proceed with filling it using the median (for numeric
columns) or a placeholder string like 'Unknown' (for categorical columns).
• If it is completely NaN, we assign a default value like 0 for numeric columns or
'Unknown' for categorical ones.

2. Renaming Columns
Sometimes, column names need to be more descriptive. You can rename columns if necessary:

code and output:


3. Converting Data Types
If any columns need to be converted to appropriate data types (e.g., converting age from string to
integer):
code:

4.

Filtering Data Based on Conditions


You can filter the DataFrame based on specific conditions, like filtering out individuals older than
20 years:
code and output:
5.Sorting Data
Sorting Rows: The df.sort_values() function sorts the DataFrame by the specified column
(age in this case). The ascending=True parameter sorts the data in ascending order. If you
wanted descending order, you’d set ascending=False.

Code and output:


6. Grouping Data (GroupBy)
Grouping data is useful when you want to perform aggregate operations, like computing the mean,
sum, or count for each group.

Group By a Column:
Group by 'city' and calculate the mean of 'age' for each city

Multiple

Aggregations:

# Group by 'city' and calculate both the mean and sum of 'age'
7. Adding New Columns
You can add new columns based on existing ones by performing operations or calculations.
# Add a new column 'age_group' based on 'age'

8.Merging Dataframes:

You can merge multiple DataFrames based on common columns using the merge() function,
similar to SQL joins.

8.1.Inner Join
An inner join keeps only the rows where there is a match in both DataFrames based on the join
key(s).

Inner join on 'id' column df_inner = pd.merge(df1, df2, on='id', how='inner')


code:

output:

8.2.left join
keeps all rows from the left DataFrame and adds matching rows from the right DataFrame. If there
is no match, it fills with NaN.

Code:

output:

8.3.Right join
A right join keeps all rows from the right DataFrame and adds matching rows from the left
DataFrame. If there is no match, it fills with NaN.

Right join on 'id' column


df_right = pd.merge(df1, df2, on='id', how='right')
code and output:

8.4. Outer Join


An outer join keeps all rows from both DataFrames. If there is no match, it fills with NaN.#Outer
join on 'id' column df_outer = pd.merge(df1, df2, on='id', how='outer')
8.5.

Merging on Multiple Columns


You can merge on multiple columns, just like in SQL joins. The on parameter accepts a list of
column names.
Code and output:

9.Saving the Data:


• df.to_csv('sdata_modified.csv', index=False):
code and output:

Exercises:
1. Load a CSV file called company_data.csv. Display the first 10 rows and show the
column names and data types using .info().

2. Load a CSV file called sales_data.csv. Fill all missing numerical values with the
column's median, and replace missing categorical values with 'Unknown'. Verify if all
missing data is handled.
3. Load employee_data.csv. Filter and display only the employees whose age is greater
than 40.
4. Load product_data.csv. Add a new column called discounted_price that
applies a 15% discount to the price column.

5. Load sales_data.csv. Group the data by region and calculate the total sales for each
region.
6. Load two CSV files: orders.csv and customers.csv. Merge these DataFrames on
the customer_id column. Display the first 5 rows of the merged DataFrame.
7. Load student_scores.csv. Sort the data based on the math.grade column in
descending order, and display the top 5 rows.
8. Load review_data.csv. Identify and remove duplicate rows based on the review_id
column. Save the cleaned data to a new file cleaned_review_data.csv.

9. Load attendance_data.csv. Convert the attendance_date column to a datetime


object. Filter the data to show records from the last 60 days.
10. Load performance_data.csv. Add a new column performance_score as
the average of the task1_score, task2_score, and task3_score. Save the
updated data to a new file performance_with_scores.csv.

You might also like