Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
Data Science Using R - Lab Manual-Complete Ver 2.0 - Nov 2024
Dr.P.RAJASEKAR
Associate Professor,
School of Computing
SRMIST
List of Experiments
6. Write a R program to take input from the user (name and age) and display the values.
7. Write a R program to create a sequence of numbers from 20 to 50 and find the mean of
8. Write a R program to create three vectors a,b,c with 3 integers. Combine the three
vectors to become a 3×3 matrix where each column represents a vector. Print the
9. Write a R program to concatenate two given matrixes of same column but different
rows.
10… Write a R program to create a data frame from four given vectors.
12. Write a R program to count the number of NA values in a data frame column.
13. Write a R program to create a simple bar plot of four subjects’ marks.
14. Write a R program to create a simple bar plot for ozone concentration in air with
“airquality” dataset.
15. Write a R program to create a histogram for maximum daily temperature for with
“airquality” dataset.
16. Write a R program to create a boxplot for the variable “wind” with “airquality” dataset.
Experiment No: 1
Aim: To perform the basic mathematical operations in R programming
Theory:
Installing Packages
The most common place to get packages from is CRAN. To install packages from CRAN you
use install.packages("packagename"). For instance, if you want to install the ggplot2
package, which is a very popular visualization package you would type the following in the
console:
# install package from CRAN
install.packages("ggplot2")
Loading Packages
Once the package is downloaded to your computer you can access the functions and
resources provided by the package in two different ways:
# load the package to use in the current R session
library(packagename)
Assignment
The first operator you’ll run into is the assignment operator. The assignment operator is used
to assign a value. For instance we can assign the value 3 to the variable x using the <-
assignment operator.
# assignment
x <- 3
Interestingly, R actually allows for five assignment operators:
# leftward assignment
x <- value
x = value
x <<- value
# rightward assignment
value -> x
value ->> x
The original assignment operator in R was <- and has continued to be the preferred among R
users. The = assignment operator was added in 2001 primarily because it is the accepted
assignment operator in many other languages and beginners to R coming from other
languages were so prone to use it.
The operators <<- is normally only used in functions which we will not get into the details.
Evaluation
We can then evaluate the variable by simply typing x at the command line which will return
the value of x. Note that prior to the value returned you’ll see ## [1] in the command line.
This simply implies that the output returned is the first output. Note that you can type any
comments in your code by preceding the comment with the hash tag (#) symbol. Any values,
symbols, and texts following # will not be evaluated.
# evaluation
x
## [1] 3
Case Sensitivity
Lastly, note that R is a case sensitive programming language. Meaning all variables,
functions, and objects must be called by their exact spelling:
x <- 1
y <- 3
z <- 4
x*y*z
## [1] 12
x*Y*z
## Error in eval(expr, envir, enclos): object 'Y' not found
Basic Arithmetic
At its most basic function R can be used as a calculator. When applying basic arithmetic, the
PEMDAS order of operations applies: parentheses first followed by exponentiation,
multiplication and division, and final addition and subtraction.
8+9/5^2
## [1] 8.36
8 + 9 / (5 ^ 2)
## [1] 8.36
8 + (9 / 5) ^ 2
## [1] 11.24
(8 + 9) / 5 ^ 2
## [1] 0.68
By default R will display seven digits but this can be changed using options() as previously
outlined.
1/7
## [1] 0.1428571
options(digits = 3)
1/7
## [1] 0.143
pi
## [1] 3.141592654
options(digits = 22)
pi
## [1] 3.141592653589793115998
We can also perform integer divide (%/%) and modulo (%%) functions. The integer divide
function will give the integer part of a fraction while the modulo will provide the remainder.
42 / 4 # regular division
## [1] 10.5
42 %/% 4 # integer division
## [1] 10
42 %% 4 # modulo (remainder)
## [1] 2
The workspace environment will also list your user defined objects such as vectors, matrices,
data frames, lists, and functions. For example, if you type the following in your console:
x <- 2
y <- 3
You will now see x and y listed in your workspace environment. To identify or remove the
objects (i.e. vectors, data frames, user defined functions, etc.) in your current R environment:
Result:
Theory:
With R, it’s Important that one understand that there is a difference between the actual
R object and the manner in which that R object is printed to the console. Often, the printed
output may have additional bells and whistles to make the output more friendly to the users.
However, these bells and whistles are not inherently part of the object
R has five basic or “atomic” classes of objects:
• character
• numeric (real numbers)
• integer
• complex
• logical (True/False)
The most basic type of R object is a vector. Empty vectors can be created with the
vector() function. There is really only one rule about vectors in R, which is that A vector can
only contain objects of the same class. But of course, like any good rule, there is an
exception, which is a list, which we will get to a bit later. A list is represented as a vector but
can contain objects of different classes. Indeed, that’s usually why we use them.
There is also a class for “raw” objects, but they are not commonly used directly in data
analysis
Creating Vectors
The c() function can be used to create vectors of objects by concatenating things together.
>x
[1] 0 0 0 0 0 0 0 0 0 0
Numeric vector
x <- c(1,2,3,4,5,6)
Character vector
To calculate frequency for State vector, you can use table function.
Since the above vector contains a NA (not available) value, the mean function returns NA.
To calculate mean for a vector excluding NA values, you can include na.rm = TRUE
parameter in mean function.
You can use subscripts to refer elements of a vector.
data$x = as.numeric(data$x)
Some useful vectors can be created quickly with R. The colon operator is
[1] 1 2 3 4 5 6 7 8 9 10
> -3:4
[1] -3 -2 -1 0 1 2 3 4
> 9:5
[1] 9 8 7 6 5
More generally, the function seq() can generate any arithmetic progression.
[1] 2.0 2.4 2.8 3.2 3.6 4.0 4.4 4.8 5.2 5.6 6.0
Sometimes it’s necessary to have repeated values, for which we use rep()
> rep(5,3)
[1] 5 5 5
> rep(2:5,each=3)
[1] 2 2 2 3 3 3 4 4 4 5 5 5
[1] -1 0 1 2 3 -1 0 1 2 3
We can also use R’s vectorization to create more interesting sequences:
> 2^(0:10)
[1] 1 2 3 11 12 13 21 22 23 31 32 33
Lists:
You can use subscripts to select the specific component of the list.
> x <- list(1:3, TRUE, "Hello", list(1:2, 5))
Here x has 4 elements: a numeric vector, a logical, a string and another list.
> x[[3]]
[1] "Hello"
> x[c(1,3)]
[[1]]
[1] 1 2 3
[[2]]
[1] "Hello"
We can also name some or all of the entries in our list, by supplying argument names to list():
>x
$y
[1] 1 2 3
[[2]]
[1] TRUE
$z
[1] "Hello"
Notice that the [[1]] has been replaced by $y, which gives us a clue as to
how we can recover the entries by their name. We can still use the numeric
position if we prefer:
> x$y
[1] 1 2 3
> x[[1]]
[1] 1 2 3
The function names() can be used to obtain a character vector of all the
> names(x)
Result:
Thus, we have done Implementation of vector and list data objects operations using R.
Experiment No. 3
Theory:
Matrices are much used in statistics, and so play an important role in R. To create a matrix
use the function matrix(), specifying elements by column first:
[1,] 1 4 7 10
[2,] 2 5 8 11
[3,] 3 6 9 12
This is called column-major order. Of course, we need only give one of the dimensions:
[1,] 1 1 1 1
[2,] 2 2 2 2
[3,] 3 3 3 3
> diag(3)
[1,] 1 0 0
[2,] 0 1 0
[3,] 0 0 1
> diag(1:3)
[1,] 1 0 0
[2,] 0 2 0
[3,] 0 0 3
[1,] 1 2 3 4 5
[2,] 2 4 6 8 10 [3,]
3 6 9 12 15 [4,] 4
8 12 16 20 [5,] 5
10 15 20 25
The last operator performs an outer product, so it creates a matrix with (i, j)-th entry xiyj .
The function outer() generalizes this to any function f on two arguments, to create a matrix
with entries f(xi , yj ). (More on functions later.)
[1,] 2 3 4 5
[2,] 3 4 5 6
[3,] 4 5 6 7
[,1]
[1,] 30
[2,] 36
[3,] 45
[1,] 1 4 7
[2,] 4 10 16
[3,] 9 18 30
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 10
[1] -3
[1] 1 5 10
Array:
Of course, if we have a data set consisting of more than two pieces of categorical information
about each subject, then a matrix is not sufficient. The generalization of matrices to higher
dimensions is the array. Arrays are defined much like matrices, with a call to the array()
command. Here is a 2 × 3 × 3 array:
> arr
,,1
[1,] 1 3 5
[2,] 2 4 6
,,2
[1,] 7 9 11
[2,] 8 10 12
,,3
[1,] 13 15 17
[2,] 14 16 18
Each 2-dimensional slice defined by the last co-ordinate of the array is shown as a 2 × 3 matrix.
Note that we no longer specify the number of rows and columns separately, but use a single
vector dim whose length is the number of dimensions. You can recover this vector with the
dim() function.
> dim(arr)
[1] 2 3 3
subsetted and modified in exactly the same way as a matrix, only using the
> arr[1,2,3]
[1] 15
> arr[,2,]
[,1] [,2] [,3]
[1,] 3 9 15
[2,] 4 10 16
> arr[,,1,drop=FALSE]
,,1
[1,] 0 3 5
[2,] 2 4 6
Factors
R has a special data structure to store categorical variables. It tells R that a variable is
nominal or ordinal by making it a factor.
data$x = as.factor(data$x)
Result:
Thus, we have done Implementation of various operations on matrix, array and factors in R.
Experiment No. 4
3. The data stored in a data frame can be of numeric, factor or character type.
The structure of the data frame can be seen by using str() function.
The statistical summary and nature of the data can be obtained by applying summary()
function.
Min.emp_i emp_name
:1 Length:5 Min.salary
:515.2 start_date
Min. :2012-01-01
d
1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
Median :3 Mode :character Median :623.3 Median :2014-05-11
Mean :3 Mean :664.4 Mean :2014-01-14
3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
Max. :5 Max. :843.2 Max. :2015-03-27
print(result)
print(result)
When we execute the above code, it produces the following result −
# Extract 3rd and 5th row with 2nd and 4th column.
print(result)
emp_name
start_date
3 Michelle 2014-11-15
5 Gary 2015-03-27
1. Add Column
v <- emp.data
print(v)
2. Add Row
To add more rows permanently to an existing data frame, we need to bring in the new rows
in the same structure as the existing data frame and use the rbind() function.
In the example below we create a data frame with new rows and merge it with the existing
data frame to create the final data frame.
Conclusion:
Thus, the Implementation and various operations on data frames are performed in R.
Experiment No. 5
Aim: To Create Sample (Dummy) Data in R and perform data manipulation with R
Theory:
This covers how to execute most frequently used data manipulation tasks with R. It includes
various examples with datasets and code. It gives you a quick look at several functions used
in R.
# for multiple
# OR
> DF[keeps]
> DF
d3=data.frame(roll=c(2,4,6,3,1,5),
name=c('a','b','c','d','e','e'),
marks=c(44,55,22,33,66,77))
> d3
d3[order(d3$roll),]
OR
d3[with(d3,order(roll)),]
Subsets: roll=c(1:5)
names=c(letters[1:5])
marks=c(12,33,44,55,66)
d4=data.frame(roll,names,marks)
sub1=subset(d4,marks>33 & roll>4)
sub1
sub1=sub1=subset(d4,marks>33 & roll>4,select = c(roll,names))
sub1
Rename Columns in R
colnames(d)[colnames(d)==“roll"]=“ID“
Sorting a vector
x= sample(1:50)
x = sort(x, decreasing = TRUE)
The function sort() is used for sorting a 1 dimensional vector. It cannot be used for more than
1 dimensional vector.
Write a R program to take input from the user (name and age) and display the values. Also
print the version of R installation.
R Programming Code :
Sample Output:
Result:
Thus, the program is executed successfully.
Experiment No. 7
Write a R program to create a sequence of numbers from 20 to 50 and find the mean of
numbers from 20 to 60 and sum of numbers from 51 to 91.
R Programming Code :
Sample Output:
Result:
Thus, the program is executed successfully.
Experiment No. 8
Write a R program to create three vectors a,b,c with 3 integers. Combine the three vectors to
become a 3×3 matrix where each column represents a vector. Print the content of the matrix.
R Programming Code :
a<-
c(1,2,3)
b<-
c(4,5,6)
c<-
c(7,8,9)
m<-cbind(a,b,c)
print("Content of the said matrix:")
print(m)
Sample Output:
Result:
Thus, the program is executed successfully.
Experiment No. 9
Write a R program to concatenate two given matrixes of same column but different rows.
R Programming Code :
x = matrix(1:12, ncol=3)
y = matrix(13:24, ncol=3)
print("Matrix-
1") print(x)
print("Matrix-
2") print(y)
result = dim(rbind(x,y))
print("After concatenating two given matrices:")
print(result)
Sample Output:
[1] "Matrix-1"
[,1] [,2] [,3] [1,] 1
5 9 [2,] 2 6 10 [3,] 3
7 11 [4,] 4 8 12 [1]
"Matrix-2"
[,1] [,2] [,3]
[1,] 13 17 21 [2,] 14 18
22 [3,] 15 19 23 [4,] 16
20 24
[1] "After concatenating two given matrices:" [1] 8 3
Result:
Thus, the program is executed successfully.
Experiment No. 10
R Programming Code :
name = c('Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin',
'Jonas')
score = c(12.5, 9, 16.5, 12, 9, 20, 14.5, 13.5, 8, 19)
attempts = c(1, 3, 2, 3, 2, 3, 1, 1, 2, 1)
qualify = c('yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes')
print("Original data frame:")
print(name)
print(score)
print(attempts)
print(qualify)
df = data.frame(name, score, attempts, qualify)
print(df)
Sample Output:
R Programming Code :
exam_data = data.frame(
name = c('Anastasia', 'Amsa', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew'),
score = c(12.5, 9, 16.5, 12, 9, 20, 14.5),
attempts = c(1, 3, 2, 3, 2, 3, 1),
qualify = c('yes', 'no', 'yes', 'no', 'no', 'yes', 'yes')
)
print("Original dataframe:")
print(exam_data)
print("dataframe after sorting 'name' and 'score' columns:")
exam_data = exam_data[with(exam_data, order(name, score)),
] print(exam_data)
Sample Output:
[1] "Original dataframe:"
name score attempts qualify
1 Anastasia 12.5 1 yes
2 Amsa 9.0 3 no
3 Katherine 16.5 2 yes
4 James 12.0 3 no
5 Emily 9.0 2 no
6 Michael 20.0 3 yes
7 Matthew 14.5 1 yes
[1] "dataframe after sorting 'name' and 'score' columns:"
name score attempts qualify
2 Amsa 9.0 3 no
1 Anastasia 12.5 1 yes
5 Emily 9.0 2 no
4 James 12.0 3 no
3 Katherine 16.5 2 yes
7 Matthew 14.5 1 yes
6 Michael 20.0 3 yes
Result:
Thus, the program is executed successfully.
Experiment No. 12
R Programming Code :
exam_data = data.frame(
name = c('Anastasia', 'Dima', 'Katherine', 'James', 'Emily', 'Michael', 'Matthew', 'Laura', 'Kevin',
'Jonas'),
score = c(12.5, 9, 16.5, 12, 9, 20, 14.5, 13.5, 8,
19), attempts = c(1, NA, 2, NA, 2, NA, 1, NA, 2,
1),
qualify = c('yes', 'no', 'yes', 'no', 'no', 'yes', 'yes', 'no', 'no', 'yes')
)
print("Original dataframe:")
print(exam_data)
print("The number of NA values in attempts column:")
print(sum(is.na(exam_data$attempts)))
Sample Output:
[1] "Original dataframe:"
name score attempts qualify
1 Anastasia 12.5 1 yes
2 Dima 9.0 NA no
3 Katherine 16.5 2 yes
4 James 12.0 NA no
5 Emily 9.0 2 no
6 Michael 20.0 NA yes
7 Matthew 14.5 1 yes
8 Laura 13.5 NA no
9 Kevin 8.0 2 no
10 Jonas 19.0 1 yes
[1] "The number of NA values in attempts column:" [1] 4
Result:
Thus, the program is executed successfully.
Experiment No. 13
Bar Plot
There are two types of bar plots- horizontal and vertical which represent data points as horizontal
or vertical bars of certain lengths proportional to the value of the data item. They are generally
used for continuous and categorical variable plotting. By setting the horiz parameter to true and
false, we can get horizontal and vertical bar plots respectively.
13. Write a R program to create a simple bar plot of four subjects’ marks.
marks = c(70, 95, 80, 74)
barplot(marks,main = "Comparing marks of 5 subjects",
xlab = "Marks, ylab = "Subject",
names.arg = c("English", "Science", "Math.", "Hist."),
col = "darkred",
horiz = FALSE)
Output:
> marks = c(70, 95, 80, 74)
>barplot(marks,main = "Comparing marks of 5 subjects",
+ xlab = "Marks",
+ ylab = "Subject",
+ names.arg = c("English", "Science", "Math.", "Hist."),
+ col = "darkred",
+ horiz = FALSE)
Result:
14. Write a R program to create a simple bar plot for ozone concentration in air with “airquality”
dataset.
# Horizontal Bar Plot for
# Ozone concentration in air
barplot(airquality$Ozone,
main = 'Ozone Concenteration in air',
xlab = 'ozone levels', horiz = TRUE)
# Vertical Bar Plot for
# Ozone concentration in air
barplot(airquality$Ozone, main = ‘Ozone Concenteration in air’,
xlab = ‘ozone levels’, col =’blue’, horiz = FALSE)
Result:
A histogram is like a bar chart as it uses bars of varying height to represent data distribution.
However, in a histogram values are grouped into consecutive intervals called bins. In a
Histogram, continuous values are grouped and displayed in these bins whose size can be varied.
For a histogram, the parameter xlim can be used to specify the interval within which all values
are to be displayed.
Another parameter freq when set to TRUE denotes the frequency of the various values in the
histogram and when set to FALSE, the probability densities are represented on the y-axis such
that they are of the histogram adds up to one.
Histograms are used in the following scenarios:
• To verify an equal and symmetric distribution of the data.
• To identify deviations from expected values.
15. Write a R program to create a histogram for maximum daily temperature for with “airquality”
dataset.
Result:
Box Plot
The statistical summary of the given data is presented graphically using a boxplot. A boxplot
depicts information like the minimum and maximum data point, the median value, first and third
quartile, and interquartile range.
16. Write a R program to create a boxplot for the variable “wind” with “airquality” dataset.
Result: