0% found this document useful (0 votes)
3 views32 pages

Copy of AE II Simulation File.pdf

The document outlines practical exercises for a Computer Science course focusing on data analysis and visualization using Python. It includes tasks such as cleaning the Titanic dataset, performing statistical analysis on family income data, and utilizing NumPy and Pandas for various data manipulation techniques. Additionally, it covers merging data from Excel files and generating descriptive statistics.

Uploaded by

aaravsharma3015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views32 pages

Copy of AE II Simulation File.pdf

The document outlines practical exercises for a Computer Science course focusing on data analysis and visualization using Python. It includes tasks such as cleaning the Titanic dataset, performing statistical analysis on family income data, and utilizing NumPy and Pandas for various data manipulation techniques. Additionally, it covers merging data from Excel files and generating descriptive statistics.

Uploaded by

aaravsharma3015
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

DSE - COMPUTER SCIENCE SEM - III

DATA ANALYSIS AND VISUALIZATION USING PYTHON

PRACTICAL FILE

Name Aarav Sharma


Course B.Sc. (Hons.) Electronic Science
Roll no. 2303047
Submitted to Dr. Jayant Sharma
Practical 6. Using Titanic dataset, to do the following:
a. Clean the data by dropping the column which has the largest number of missing
values.
b. Find total number of passengers with age more than 30
c. Find total fare paid by passengers of second class
d. Compare number of survivors of each passenger class
e. Compute descriptive statistics for age attribute gender wise
f. Draw a scatter plot for passenger fare paid by Female and Male passengers
separately
g. Compare density distribution for features age and passenger fare
h. Draw the pie chart for three groups labelled as class 1, class 2, class 3 respectively
displayed in different
colours. The occurrence of each group converted into percentage should be
displayed in the pie chart.
Appropriately Label the chart.
i. Find % of survived passengers for each class and answer the question “Did class
play a role in survival?”.

Code
-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load Titanic dataset (assuming it's a CSV file)


titanic_df = pd.read_csv("titanic.csv") # Replace with the actual path to your Titanic dataset

# a. Drop the column with the largest number of missing values


missing_counts = titanic_df.isnull().sum()
column_to_drop = missing_counts.idxmax() # Find column with maximum missing values
print(f"Column with most missing values: {column_to_drop}
({missing_counts[column_to_drop]} missing values)")
titanic_df.drop(columns=[column_to_drop], inplace=True)

# b. Total number of passengers with age > 30


passengers_above_30 = titanic_df[titanic_df['Age'] > 30]
total_passengers_above_30 = len(passengers_above_30)
print(f"\nTotal passengers with age > 30: {total_passengers_above_30}")

# c. Total fare paid by second-class passengers


second_class_fare = titanic_df[titanic_df['Pclass'] == 2]['Fare'].sum()
print(f"\nTotal fare paid by second-class passengers: {second_class_fare}")

# d. Compare number of survivors of each passenger class


survivors_by_class = titanic_df.groupby('Pclass')['Survived'].sum()
print("\nNumber of survivors by passenger class:")
print(survivors_by_class)
Practical 6. Using Titanic dataset, to do the following:
a. Clean the data by dropping the column which has the largest number of missing
values.
b. Find total number of passengers with age more than 30
c. Find total fare paid by passengers of second class
d. Compare number of survivors of each passenger class
e. Compute descriptive statistics for age attribute gender wise
f. Draw a scatter plot for passenger fare paid by Female and Male passengers
separately
g. Compare density distribution for features age and passenger fare
h. Draw the pie chart for three groups labelled as class 1, class 2, class 3 respectively
displayed in different
colours. The occurrence of each group converted into percentage should be
displayed in the pie chart.
Appropriately Label the chart.
i. Find % of survived passengers for each class and answer the question “Did class
play a role in survival?”.

Code
-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load Titanic dataset (assuming it's a CSV file)


titanic_df = pd.read_csv("titanic.csv") # Replace with the actual path to your Titanic dataset

# a. Drop the column with the largest number of missing values


missing_counts = titanic_df.isnull().sum()
column_to_drop = missing_counts.idxmax() # Find column with maximum missing values
print(f"Column with most missing values: {column_to_drop}
({missing_counts[column_to_drop]} missing values)")
titanic_df.drop(columns=[column_to_drop], inplace=True)

# b. Total number of passengers with age > 30


passengers_above_30 = titanic_df[titanic_df['Age'] > 30]
total_passengers_above_30 = len(passengers_above_30)
print(f"\nTotal passengers with age > 30: {total_passengers_above_30}")

# c. Total fare paid by second-class passengers


second_class_fare = titanic_df[titanic_df['Pclass'] == 2]['Fare'].sum()
print(f"\nTotal fare paid by second-class passengers: {second_class_fare}")

# d. Compare number of survivors of each passenger class


survivors_by_class = titanic_df.groupby('Pclass')['Survived'].sum()
print("\nNumber of survivors by passenger class:")
print(survivors_by_class)
- Output
7. Consider the following data frame containing a family name, gender of the family
member and her/his monthly
income in each record.
FamilyName Gender MonthlyIncome (Rs.)
Shah Male 44000.00
Vats Male 65000.00
Vats Female 43150.00
Kumar Female 66500.00
Vats Female 255000.00
Kumar Male 103000.00
Shah Male 55000.00
Shah Female 112400.00
Kumar Female 81030.00
Vats Male 71900.00
Write a program in Python using Pandas to perform the following:
a. Calculate and display familywise gross monthly income.
b. Display the highest and lowest monthly income for each family name
c. Calculate and display monthly income of all members earning income less than Rs.
80000.00.
d. Display total number of females along with their average monthly income
e. Delete rows with Monthly income less than the average income of all members

- Code
import pandas as pd

data = {
"FamilyName": ["Shah", "Vats", "Vats", "Kumar", "Vats", "Kumar", "Shah", "Shah", "Kumar",
"Vats"],
"Gender": ["Male", "Male", "Female", "Female", "Female", "Male", "Male", "Female",
"Female", "Male"],
"MonthlyIncome": [44000.00, 65000.00, 43150.00, 66500.00, 255000.00, 103000.00,
55000.00, 112400.00, 81030.00, 71900.00]
}
df = pd.DataFrame(data)

"(A)"
family_gross_income = df.groupby("FamilyName")["MonthlyIncome"].sum()
print("Familywise Gross Monthly Income:")
print(family_gross_income)

"(B)"
family_income_stats = df.groupby("FamilyName")["MonthlyIncome"].agg(["max", "min"])
print("\nHighest and Lowest Monthly Income for Each Family:")
print(family_income_stats)
"(C)"
low_income_members = df[df["MonthlyIncome"] < 80000.00]
print("\nMonthly Income of Members Earning Less Than Rs. 80000:")
print(low_income_members)

"(D)"
female_stats = df[df["Gender"] == "Female"]
total_females = len(female_stats)
average_female_income = female_stats["MonthlyIncome"].mean()
print(f"\nTotal Number of Females: {total_females}")
print(f"Average Monthly Income of Females: Rs. {average_female_income:.2f}")

"(E)"
average_income = df["MonthlyIncome"].mean()
print("\nAverage income is:",average_income)
df_filtered = df[df["MonthlyIncome"] >= average_income]
print("Data Frame After Removing Rows with Income Below Average Income:")
print(df_filtered)
Practical 1. Write programs in Python using NumPy library to do the following:
a. Create a two dimensional array, ARR1 having random values from 0 to 1. Compute
the mean, standard
deviation, and variance of ARR1 along the second axis.
b. Create a 2-dimensional array of size m x n integer elements, also print the shape,
type and data type of
the array and then reshape it into an n x m array, where n and m are user inputs given
at the run time.
c. Test whether the elements of a given 1D array are zero, non-zero and NaN. Record
the indices of these
elements in three separate arrays.
d. Create three random arrays of the same size: Array1, Array2 and Array3. Subtract
Array 2 from Array3
and store in Array4. Create another array Array5 having two times the values in
Array1. Find Co-
variance and Correlation of Array1 with Array4 and Array5 respectively.
e. Create two random arrays of the same size 10: Array1, and Array2. Find the sum of
the first half of both
the arrays and product of the second half of both the arrays.
f. Create an array with random values. Determine the size of the memory occupied by
the array.
g. Create a 2-dimensional array of size m x n having integer elements in the range
(10,100). Write
statements to swap any two rows, reverse a specified column and store updated array
in another
variable

- Code

import numpy as np
#"A"
array1 = np.random.rand(5,4)
print("array1:", array1)

mean = np.mean(array1, axis = 1)


Std_Deviation = np.std(array1, axis = 1)
variance = np.var(array1, axis =1)

print("Mean along axis 1:",mean)


print("Standard Deviation along axis 1:",Std_Deviation)
print("Variance along axis 1:",variance)

# "(B)"
m = int(input("Enter the no. of rows(m):"))
n = int(input("Enter the no. of columns(n):"))

array2 = np.random.randint(1,100,size=(m,n))
print("Original Array:", array2)
print("Shape", array2.shape)
print("Type:", type(array2))
print("DataType:", array2.dtype)

reshaped_array = array2.reshape(n,m)
print("Reshaped Array:",reshaped_array)

# "(C)"

array3 = np.array([0, 5, np.nan, 7, 0, np.nan, 3])

zero_indices = np.where(array3 == 0)[0]


non_zero_indices = np.where(array3 != 0)[0]
nan_indices = np.where(np.isnan(array3))[0]

print("Indices of zeros:", zero_indices)


print("Indices of non-zeros:", non_zero_indices)
print("Indices of NaNs:", nan_indices)

# "(D)" ARR1 =
np.random.rand(10) ARR2 =
np.random.rand(10) ARR3 =
np.random.rand(10)

ARR4 = ARR3 - ARR2

ARR5 = 2 * ARR1

covariance = np.cov(ARR1, ARR4)[0, 1]


correlation = np.corrcoef(ARR1, ARR5)[0, 1]

print("Covariance between ARR1 and ARR4:", covariance)


print("Correlation between ARR1 and ARR5:", correlation)

# "(E)"

Array1 = np.random.randint(1, 10, size=10)


Array2 = np.random.randint(1, 10, size=10)

first_half_sum = np.sum(Array1[:5]) + np.sum(Array2[:5])


second_half_product = np.prod(Array1[5:]) * np.prod(Array2[5:])

print("Sum of first half:", first_half_sum)


print("Product of second half:", second_half_product)

# "(F)"
array = np.random.rand(10, 10)

memory_size = array.nbytes

print("Memory occupied by the array:", memory_size, "bytes")

#"(G)"

m = int(input("Enter the no. of rows (m): "))


n = int(input("Enter of no. of columns (n): "))

array = np.random.randint(10, 100, size=(m, n))


print("Original Array:", array)

row1, row2 = 0,1


swapped_array = array.copy()
swapped_array[[row1, row2]] = swapped_array[[row2, row1]]
print("Array after swapping rows {} and {}:\n".format(row1, row2), swapped_array)

column = 0
reversed_array = array.copy()
reversed_array[:, column] = reversed_array[::-1, column]
print("Array after reversing column {}:\n".format(column), reversed_array)
Practical 2. Do the following using PANDAS Series:
a. Create a series with 5 elements. Display the series sorted on index and also sorted
on values seperately
b. Create a series with N elements with some duplicate values. Find the minimum and
maximum ranks
assigned to the values using ‘first’ and ‘max’ methods
c. Display the index value of the minimum and maximum element of a Series

- Code
import pandas as pd

# "(A)"
series1 = pd.Series([45,50,23,67,30], index = ['a','b','c','d','e'])
print("Original Series:\n",series1)

sorted_series_by_index =series1.sort_index()
print("Series1 sorted by index:\n", sorted_series_by_index)

sorted_series_by_values = series1.sort_values()
print("Series1 sorted by value:\n", sorted_series_by_values)

# "(B)"

series2 = pd.Series([10, 20, 10, 30, 20, 10])


print("Original Series:\n", series2)

ranks_first = series2.rank(method='first')
print("\nRanks (method='first'):\n", ranks_first)

ranks_max = series2.rank(method='max')
print("\nRanks (method='max'):\n", ranks_max)

min_rank_first = ranks_first.min()
max_rank_first = ranks_first.max()
min_rank_max = ranks_max.min()
max_rank_max = ranks_max.max()

print("\nMinimum and Maximum Ranks:")


print("First method: Min Rank =", min_rank_first, ", Max Rank =", max_rank_first)
print("Max method: Min Rank =", min_rank_max, ", Max Rank =", max_rank_max)

#"(C)"

series = pd.Series([45, 23, 78, 12, 56], index=['a', 'b', 'c', 'd', 'e'])
print("Original Series:\n", series)

min_index = series.idxmin()
max_index = series.idxmax()
print("\nIndex of the minimum element:", min_index)
print("Index of the maximum element:", max_index)
Practical 4. Consider two excel files having attendance of two workshops, each of
duration 5 days. Each file has three
fields ‘Name’, ‘Date, duration (in minutes) where names may be repetitve within a file.
Note that duration may
take one of three values (30, 40, 50) only. Import the data into two data frames and do
the following:
a. Perform merging of the two data frames to find the names of students who had
attended both
workshops.
b. Find names of all students who have attended a single workshop only.
c. Merge two data frames row-wise and find the total number of records in the data
frame.
d. Merge two data frames row-wise and use two columns viz. names and dates as
multi-row indexes.
Generate descriptive statistics for this hierarchical data frame.

- Code

import pandas as pd

workshop1 = pd.read_excel(r"C:\Users\fq1089\OneDrive\Documents\DSA
Folder\workshop1.xlsx")
workshop2 = pd.read_excel(r"C:\Users\fq1089\OneDrive\Documents\DSA
Folder\workshop2.xlsx")

workshop1.columns = workshop1.columns.str.strip().str.lower()
workshop2.columns = workshop2.columns.str.strip().str.lower()

if 'name' not in workshop1.columns or 'name' not in workshop2.columns:


raiseValueError("Column'name'ismissinginoneorbothdatafiles.")

common_attendees = pd.merge(workshop1, workshop2, on='name', how='inner')


common_names = common_attendees['name'].unique()
print("Names of students who attended both workshops:")
print(common_names)

workshop1_names = set(workshop1['name'].unique())
workshop2_names = set(workshop2['name'].unique())
single_workshop_names = (workshop1_names ^ workshop2_names)
print("\nNames of students who attended only one workshop:")
print(single_workshop_names)

merged_rowwise = pd.concat([workshop1, workshop2], axis=0)


total_records = len(merged_rowwise)
print("\nTotal number of records in the merged DataFrame:", total_records)

hierarchical_df = merged_rowwise.set_index(['name', 'date'])


desc_stats = hierarchical_df.describe()
print("\nDescriptive statistics for the hierarchical DataFrame:")
print(desc_stats)

- excel data

- Output
Practical 5. Using Iris data, plot the following with proper legend and axis labels:
(Download IRIS data from:
https://archive.ics.uci.edu/ml/datasets/iris or import it from sklearn datasets)
a. Load data into pandas’ data frame. Use pandas.info () method to look at the info on
datatypes in the
dataset.
b. Find the number of missing values in each column (Check number of null values in
a column using
df.isnull().sum())
c. Plot bar chart to show the frequency of each class label in the data.
d. Draw a scatter plot for Petal Length vs Sepal Length and fit a regression line
e. Plot density distribution for feature Petal width.
f. Use a pair plot to show pairwise bivariate distribution in the Iris Dataset.
g. Draw heatmap for any two numeric attributes
h. Compute mean, mode, median, standard deviation, confidence interval and
standard error for each
numeric feature
i. Compute correlation coefficients between each pair of features and plot heatmap

- Code
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import sem, norm

# a. Load Iris dataset


from sklearn.datasets import load_iris
iris = load_iris()

# Convert the dataset to a Pandas DataFrame


iris_df = pd.DataFrame(
data=iris.data,
columns=iris.feature_names
)
iris_df['species'] = iris.target
iris_df['species'] = iris_df['species'].map({0: 'setosa', 1: 'versicolor', 2: 'virginica'})

# a. Use pandas.info()
print("Dataset Info:")
iris_df.info()

# b. Find number of missing values in each column


print("\nNumber of missing values in each column:")
print(iris_df.isnull().sum())

# c. Plot bar chart for frequency of each class label


species_counts = iris_df['species'].value_counts()
species_counts.plot(kind='bar', color=['skyblue', 'lightgreen', 'orange'])
plt.title('Frequency of Each Class Label')
plt.xlabel('Species')
plt.ylabel('Count')
plt.xticks(rotation=0)
plt.show()

# d. Scatter plot for Petal Length vs Sepal Length with regression line
sns.lmplot(
data=iris_df,
x=iris.feature_names[0], # Sepal Length
y=iris.feature_names[2], # Petal Length
hue='species',
markers=['o', 's', 'D'],
ci=None
)
plt.title('Scatter Plot with Regression Line')
plt.show()

# e. Density distribution for Petal Width


sns.kdeplot(iris_df[iris.feature_names[3]], fill=True, color='purple')
plt.title('Density Distribution of Petal Width')
plt.xlabel('Petal Width (cm)')
plt.show()

# f. Pair plot for pairwise bivariate distribution


sns.pairplot(iris_df, hue='species', diag_kind='kde')
plt.suptitle('Pair Plot of Iris Features', y=1.02)
plt.show()

# g. Heatmap for two numeric attributes


numeric_data = iris_df[iris.feature_names]
sns.heatmap(numeric_data[[iris.feature_names[0], iris.feature_names[1]]].corr(),
annot=True, cmap='coolwarm')
plt.title('Heatmap of Sepal Length and Sepal Width')
plt.show()

# h. Compute statistical measures for each numeric feature


stats = {}
for col in numeric_data.columns:
mean = iris_df[col].mean() mode = iris_df[col].mode()[0]
median = iris_df[col].median() std = iris_df[col].std() se =
sem(iris_df[col]) ci_lower, ci_upper = norm.interval(0.95,
loc=mean, scale=se) stats[col] = {

"Mean": mean,
"Mode": mode,
"Median": median,
"Standard Deviation": std,
"Standard Error": se,
"95% Confidence Interval": (ci_lower, ci_upper)
}
print("\nStatistical Measures for Each Numeric Feature:")
for col, stat in stats.items():
print(f"\nFeature: {col}")
for key, value in stat.items():
print(f" {key}: {value}")

# i. Correlation coefficients and heatmap correlation_matrix =


numeric_data.corr() sns.heatmap(correlation_matrix, annot=True,
cmap='coolwarm') plt.title('Heatmap of Feature Correlations')
plt.show()

- Output
Practical 3. Create a data frame having at least 3 columns and 50 rows to store
numeric data generated using a random
function. Replace 10% of the values by null values whose index positions are
generated using random function.
Do the following:
a. Identify and count missing values in a data frame.
b. Drop the column having more than 5 null values.
c. Identify the row label having maximum of the sum of all values in a row and drop
that row.
d. Sort the data frame on the basis of the first column.
e. Remove all duplicates from the first column.
f. Find the correlation between first and second column and covariance between
second and third column.
g. Discretize the second column and create 5 bins.

code
-
import pandas as pd
import numpy as np

np.random.seed(42)
data = np.random.rand(50, 3)
df = pd.DataFrame(data, columns=['Column1', 'Column2', 'Column3'])

num_nulls = int(0.1 * df.size)


null_indices = np.random.choice(df.size, num_nulls, replace=False)
df.values[np.unravel_index(null_indices, df.shape)] = np.nan

missing_values_count = df.isnull().sum().sum()

df_dropped_columns = df.dropna(axis=1, thresh=len(df) - 5)

columns_to_retain = ['Column1', 'Column2', 'Column3']


for col in columns_to_retain:
if col not in df_dropped_columns.columns:
df_dropped_columns[col] = df[col]

row_sum = df_dropped_columns.sum(axis=1, skipna=True)


max_sum_row_label = row_sum.idxmax()
df_dropped_row = df_dropped_columns.drop(index=max_sum_row_label)

df_sorted = df_dropped_row.sort_values(by='Column1', ascending=True)

df_no_duplicates = df_sorted.drop_duplicates(subset='Column1')

correlation = df_no_duplicates['Column1'].corr(df_no_duplicates['Column2'])
covariance = df_no_duplicates['Column2'].cov(df_no_duplicates['Column3'])
bins = pd.cut(df_no_duplicates['Column2'], bins=5, labels=False)

print("Total Missing Values:", missing_values_count)


print("\nModified DataFrame (First 5 Rows):")
print(df_no_duplicates.head())
print("\nCorrelation between Column1 and Column2:", correlation)
print("Covariance between Column2 and Column3:", covariance)
print("\nDiscretized Bins for Column2 (First 5 Rows):")
print(bins.head())
THANK YOU

You might also like