dataframing_in_csv
dataframing_in_csv
Student data.csv
Columns in Dataset:
• id: Identifier for each individual (likely numeric or string)
• name: Name of the individual
• nationality: Nationality of the individual
• city: City of the individual
• latitude: Latitude of the city (numeric)
• longitude: Longitude of the city (numeric)
• gender: Gender of the individual
• ethnic.group: Ethnic group of the individual
• age: Age of the individual (numeric)
• english.grade: Grade in English (numeric or categorical)
• math.grade: Grade in Mathematics (numeric or categorical)
• sciences.grade: Grade in Sciences (numeric or categorical)
• language.grade: Grade in Language (numeric or categorical)
• portfolio.rating: Rating of the portfolio (numeric or categorical)
• coverletter.rating: Rating of the cover letter (numeric or categorical)
• refletter.rating: Rating of the reference letter (numeric or categorical)
• Reading the CSV file: The pd.read_csv() function loads the contents of sdata.csv
into a DataFrame (df). It automatically detects column headers (the first row of the
CSV)and creates the DataFrame based on them.
df.head(): This function displays the first five rows of the dataset so you can quickly get an
overview of the structure of the data.
Code and Output:
1. df.info(): This provides useful metadata about the DataFrame, such as:
Output:
2. df.describe(): This gives a quick statistical summary of numerical columns. It
provides:
• Count: Number of non-null entries.
• Mean: The average value for numeric columns.
• Standard deviation: The variation in the data.
• Min/Max: The smallest/largest value in each column.
• Quartiles: The 25th, 50th, and 75th percentiles.
3. df.isnull().sum(): This checks for missing values (NaNs) across all columns. The
result will show the count of missing values in each column. This helps identify if there are
any columns with missing data that may need to be cleaned.
Inspecting the data ensures that we understand the structure, identify any issues like missing
data, and decide on how to clean or manipulate the data.
explanation:
df[col].isna().all():
2. Renaming Columns
Sometimes, column names need to be more descriptive. You can rename columns if necessary:
4.
Group By a Column:
Group by 'city' and calculate the mean of 'age' for each city
Multiple
Aggregations:
# Group by 'city' and calculate both the mean and sum of 'age'
7. Adding New Columns
You can add new columns based on existing ones by performing operations or calculations.
# Add a new column 'age_group' based on 'age'
8.Merging Dataframes:
You can merge multiple DataFrames based on common columns using the merge() function,
similar to SQL joins.
8.1.Inner Join
An inner join keeps only the rows where there is a match in both DataFrames based on the join
key(s).
output:
8.2.left join
keeps all rows from the left DataFrame and adds matching rows from the right DataFrame. If there
is no match, it fills with NaN.
Code:
output:
8.3.Right join
A right join keeps all rows from the right DataFrame and adds matching rows from the left
DataFrame. If there is no match, it fills with NaN.
Exercises:
1. Load a CSV file called company_data.csv. Display the first 10 rows and show the
column names and data types using .info().
2. Load a CSV file called sales_data.csv. Fill all missing numerical values with the
column's median, and replace missing categorical values with 'Unknown'. Verify if all
missing data is handled.
3. Load employee_data.csv. Filter and display only the employees whose age is greater
than 40.
4. Load product_data.csv. Add a new column called discounted_price that
applies a 15% discount to the price column.
5. Load sales_data.csv. Group the data by region and calculate the total sales for each
region.
6. Load two CSV files: orders.csv and customers.csv. Merge these DataFrames on
the customer_id column. Display the first 5 rows of the merged DataFrame.
7. Load student_scores.csv. Sort the data based on the math.grade column in
descending order, and display the top 5 rows.
8. Load review_data.csv. Identify and remove duplicate rows based on the review_id
column. Save the cleaned data to a new file cleaned_review_data.csv.