Data Science and Big Data Analysis Mcqs
Data Science and Big Data Analysis Mcqs
UNIT 1
(Prepared by Mrs.Sharmila Agnal)
INTRODUCTION TO DATA SCIENCE
Introduction of Data Science, Basic Data Analytics using R, R Graphical User Interfaces, Data
Import and Export, Attribute and Data Types, Descriptive Statistics, Exploratory Data Analysis,
Visualization Before Analysis, Dirty Data, Visualizing a Single Variable, Examining Multiple
Variables, Data Exploration Versus Presentation, Statistical Methods for Evaluation, Hypothesis
Testing, Difference of Means, Wilcoxon Rank-Sum Test, ANOVA
MULTIPLE CHOICE QUESTIONS
1. Which function is used to display first six records of a dataset in R. Ref 1:Pg.No 95
a) Summary() b) head() c)hist() d) read()
2. Select the return value of Class() function in R.Ref 1: Pg.No.103
a) Datatype of variable b) mean and median c) vector d) array
3. Which function is used to create 2 dimensional arrays. Ref 1: Pg.No.104
a) Array() b) matrix() c) vector() d) list()
4. Which is used to provide a structure for storing and accessing several variables of different
data types.Ref 1: Pg.No.105
a) Vector b) matrix c) data frames d) Factors
5. Select the data analysis approach which is used to reveal the important characteristic of
dataset through visualization. Ref 1: Pg.No.111
a)Exploratory Data Analysis
b) Predictive Data Analysis
c) Statistical Data Analysis
d) none of these
6. Which is used to visualize a single variable.Ref 1: Pg.No.119
a) Dotchart
b) Barplot
c) Density Plot
d) All of these
1
7. Choose the common technique which is used to asses the difference or the significance of
difference. Ref 1: Pg.No.133
a) Hypothesis testing
b) Student’s t-test
c) Welch’s t test
d) Wilcoxon Rank-Sum Test
8. Which of the following is the most important language for Data Science?
a) Java b) Ruby c) R d) Python
9. What is the common goal of statistical modeling.
a) Find only one solution for a particular problem
b) Find out the question which is to be answered
c) Find out answer from dataset without asking question
d) None
10. Two vectors M and N are defined as M <- c(3, 2, 4) and N <- c(1, 2). What will be the
output of vector Z that is defined as Z <- M*N.
a) Z <- (3, 2, 4)
b) Z <- (3, 6, 4)
c) Z <- (3, 4, 5)
d) Z <- (3, 4, 4)
11. Functions are defined using the _________ directive and are stored as R objects
a) function()
b) funct()
c) functions()
d) fun()
12. The most convenient way to use R is at a graphics workstation running a ________
system.
a) windowing
b) running
c) interfacing
d) all of the mentioned
2
13. Which of the following will start the R program ?
a) $ R
b) > R
c) * R
d) #R
14. What is the class defined in the following code? y<-c(FALSE,2)
a) Character
b) Numeric
c) Logical
d) Integer
15. Which one of the following is not a basic datatype?
a) Numeric
b) Character
c) Data frame
d) Integer
16. How could be the matrix constructed by using the following code?
m <-matrix(1:6, nrow=2, ncol=3)
a) row-wise
b) column-wise
c) any manner
d) data insufficient
17. What is the function used to test objects (returns a logical operator) if they are NaN?
a) as.nan()
b) is.na()
c) as.na()
d) is.nan()
18. R functionality is divided into a number of ________
a) Packages
b) Functions
c) Domains
d) All of the mentioned
3
19. Which of the following is a base package for R language?
a) util
b) lang
c) tools
d) all of the mentioned
20. Which of the following operator is used to create integer sequences?
a) :
b) ;
c) –
d) ~
4 MARK QUESTIONS
1. How do you install a package in R?
The below command is used to install a package in R:install.packages(“<package_name>”)
2. Two vectors X and Y are defined as follows – X <- c(3, 2, 4) and Y <- c(1, 2). What will be
output of vector Z that is defined as Z <- X*Y.
In R language when the vectors have different lengths, the multiplication begins with the
smaller vector and continues till all the elements in the larger vector have been multiplied.
The output of the above code will be –Z <- (3, 4, 4)
3. What is the difference between data frame and a matrix in R?
Data frame can contain heterogeneous inputs while a matrix cannot. In matrix only similar
data types can be stored whereas in a data frame there can be different data types like
characters, integers or other data frames.
4 What are the data types in R on which binary operators can be applied?
Scalars, Matrices ad Vectors.
5. How will you create scatterplot matrices in R language?
A matrix of scatterplots can be produced using pairs. Pairs function takes various parameters
like formula, data, subset, labels, etc.
The two key parameters required to build a scatterplot matrix are –
• formula- A formula basically like ~a+b+c . Each term gives a separate variable in the
pairs plots where the terms should be numerical vectors. It basically represents the series
of variables used in pairs.
4
• data- It basically represents the dataset from which the variables have to be taken for
building a scatterplot.
6. What are the different data types/objects in R?
• Vectors
• Matrices
• Lists
• Arrays
• Factors
• Data frames
7. Write a custom function in R
myFunction <- function(arg1, arg2, ... ){
statements
return(object)
}
8. How do you import data in R?
Let’s use CSV as an example, as it’s a very common data format. Simply make sure the file is
saved in a CSV format, then use the read function to import the data.
yourRDateHere <- read.csv("Data.csv", header = TRUE)
9. What is Linear Regression?
Linear regression is a statistical technique where the score of a variable Y is predicted from
the score of a second variable X. X is referred to as the predictor variable and Y as the
criterion variable.
10. Illustrate an R code that provides some common R functions that include descriptive
statistics.
sales <- read.csv(“c:/data/yearly_sales.csv”)
# examine the imported dataset
head(sales)
summary(sales)
# plot num_of_orders vs. sales
plot(sales$num_of_orders,sales$sales_total,
main=“Number of Orders vs. Sales”)
5
# perform a statistical analysis (fit a linear regression model)
results <- lm(sales$sales_total ˜ sales$num_of_orders)
summary(results)
# perform some diagnostics on the fitted model
# plot histogram of the residuals
hist(results$residuals, breaks = 800)
head(sales)
cust_id sales_total num_of_orders gender
1 100001 800.64 3 F
2 100002 217.53 3 F
3 100003 74.58 2 M
4 100004 498.60 3 M
5 100005 723.11 4 F
11. Explain on a brief about Hexbin Plot and clearly give the syntax how hexbin plots can
be used for large Datasets?
A hexbinplot combines the ideas of scatterplot and histogram. Similar to a scatterplot, a
hexbinplot visualizes data in the xaxis and y-axis. Data is placed into hexbins, and the third
dimension uses shading to represent the concentration of data in each hexbin.
# plot the data points
plot(log10(MeanHouseholdIncome) ˜ MeanEducation, data=zcta)
# add a straight fitted line of the linear regression
abline(lm(log10(MeanHouseholdIncome) ˜ MeanEducation, data=zcta),
col=‘red’)
12. Illustrate about type I and type II errors.
A hypothesis test may result in two types of errors, depending on whether the test accepts
or rejects the null hypothesis. These two errors are known as type I and type II errors.
A type I error is the rejection of the null hypothesis when the null hypothesis is
TRUE. The probability of the type I error is denoted by the Greek letter .
A type II error is the acceptance of a null hypothesis when the null hypothesis is
FALSE. The probability of the type II error is denoted by the Greek letter .
6
13. Construct a program in R to add two matrices. myMatrixA <-
matrix(data = 1:9, nrow = 3, ncol = 3)
myMatrixA
myMatrixB <- matrix(data = 1:9, nrow = 3, ncol = 3)
myMatrixB
myMatrixCAfterAdding <- myMatrixA + myMatrixB
myMatrixCAfterAdding
14. What is power and sample size in hypothesis testing .
The power of a test is the probability of correctly rejecting the null hypothesis. It is denoted
by , where is the probability of a type II error. Because the power of a test improves as the
sample size increases, power is used to determine the necessary sample size. In the
difference of means, the power of a hypothesis test depends on the true difference of the
population means.
15. What R function is used to encode a vector as a category?
The function factor is used to encode a vector as a factor (the terms ‘category’ and
‘enumerated type’ are also used for factors). If argument ordered is TRUE, the factor levels
are assumed to be ordered. For compatibility with S there is also a function ordered.
16. Explain briefly about ANOVA.
ANOVA is a generalization of the hypothesis testing of the difference of two population
means.ANOVA tests if any of the population means differ from the other population means.
The null hypothesis of ANOVA is that all the population means are equal. The alternative
hypothesis is that at least one pair of the population means is not equal .
7
12 MARK QUESTIONS
1. Explain in detail about R Graphical User Interfaces.
2. Discuss the various statistical method for evaluation.
3. Explain in brief about (i)Welch’s t-test (ii) students t test
4. How many sections does a box-and-whisker divide the data into? What are these sections?
5. What is a type I error? What is a type II error? Is one always more serious than the other?
Why?
6. Explain in detail about hypothesis testing with an example.
7. Describe the techniques used for data exploration of a Multiple Variable with suitable
syntax and example?
8. Explain in detail about exploratory data analysis.
9. Explain in detail about Dotchart, Barplot, scatterplot.
10. Explain in detail about attributes and datatypes in R?
8
UNIT II
(Prepared by Mrs.Sharmila Agnal)
ADVANCED ANALYTICAL THEORY AND METHODS
Overview of Clustering, K-means, Use Cases, Overview of the Method, Perform a K-means
Analysis using R, Classification, Decision Trees, Overview of a Decision Tree, Decision Tree
Algorithms, Evaluating a Decision Tree, Decision Tree in R, Bayes’ Theorem, Naïve Bayes
Classifier, Smoothing, Naïve Bayes in R
MULTIPLE CHOICE QUESTIONS
1. Which of the following is required by K-means clustering ?
a) defined distance metric
b) number of clusters
c) initial guess as to cluster centroids
d) all of the Mentioned
2. Which of the following function is used for k-means clustering ?
a) k-means
b) k-mean
c) heatmap
d) None of the Mentioned
3. Which method of analysis does not classify variables as dependent or independent?
a) regression analysis
b) discriminant analysis
c) analysis of variance
d) cluster analysis
4. Which statement is not true about cluster analysis?
a. Objects in each cluster tend to be similar to each other and dissimilar to objects in the
other clusters.
b. Cluster analysis is also called classification analysis or numerical taxonomy.
c. Groups or clusters are suggested by the data, not defined a priori.
d. Cluster analysis is a technique for analyzing data when the criterion or dependent
variable is categorical and the independent variables are interval in nature.
9
5. A _____ or tree graph is a graphical device for displaying clustering results. Vertical
lines represent clusters that are joined together. The position of the line on the scale
indicates the distances at which clusters were joined.
a. dendrogram
b. scattergram
c. scree plot
d. icicle diagram
6. The most important part of _____ is selecting the variables on which clustering is based.
a. interpreting and profiling clusters
b. selecting a clustering procedure
c. assessing the validity of clustering
d. formulating the clustering problem
7. The most commonly used measure of similarity is the _____ or its square.
a. euclidean distance
b. city-block distance
c. Chebychev’s distance
d. Manhattan distance
8. _____ is a clustering procedure characterized by the development of a tree-like structure.
a. Non-hierarchical clustering
b. Hierarchical clustering
c. Divisive clustering
d. Agglomerative clustering
9. _____ is a clustering procedure where all objects start out in one giant cluster. Clusters
are formed by dividing this cluster into smaller and smaller clusters.
a. Non-hierarchical clustering
b. Hierarchical clustering
c. Divisive clustering
d. Agglomerative clustering
10
10. The _____ method uses information on all pairs of distances, not merely the minimum or
maximum distances.
a. single linkage
b. medium linkage
c. complete linkage
d. average linkage
11. _____ is frequently referred to as k-means clustering.
a. Non-hierarchical clustering
b. Optimizing partitioning
c. Divisive clustering
d. Agglomerative clustering
12. Data used to build a data mining model.
a. validation data
b. training data
c. test data
d. hidden data
13. Supervised learning and unsupervised clustering both require at least one
a. hidden attribute.
b. output attribute.
c. input attribute.
d. categorical attribute.
1. Supervised learning differs from unsupervised clustering in that supervised learning
requires
a. at least one input attribute.
b. input attributes to be categorical.
c. at least one output attribute.
d. ouput attriubutes to be categorical.
11
15. Database query is used to uncover this type of knowledge.
a. deep
b. hidden
c. shallow
d. multidimensional
16. A statement to be tested.
a. theory
b. procedure
c. principle
d. hypothesis
17.A person trained to interact with a human expert in order to capture their knowledge.
a. knowledge programmer
b. knowledge developer
c. knowledge engineer
d. knowledge extractor
18. Which of the following is not a characteristic of a data warehouse?
a. operational database
b. flat file
c. decision tree
d. data warehouse
12
20. A nearest neighbor approach is best used
• ID3
• C4.5
• CART
• C4
4 MARK QUESTIONS
1. Define Clustering?
Clustering is one of the most common exploratory data analysis technique used to get an
intuition about the structure of the data. It can be defined as the task of identifying
subgroups in the data such that data points in the same subgroup (cluster) are very similar
while data points in different clusters are very different.
2. Write the algorithm for k means clustering?
Algorithmic steps for k-means clustering
Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be the set of
centers.
1) Randomly select ‘c’ cluster centers.
2) Calculate the distance between each data point and cluster centers.
3) Assign the data point to the cluster center whose distance from the cluster center is
minimum of all the cluster centers..
4) Recalculate the new cluster center using:
13
where, ‘ci’ represents the number of data points in ith cluster.
5) Recalculate the distance between each data point and new obtained cluster centers.
6) If no data point was reassigned then stop, otherwise repeat from step 3).
3.What are the advantages of k means clustering?
1) Fast, robust and easier to understand.
2) Relatively efficient: O(tknd), where n is # objects, k is # clusters, d is # dimension of each
object, and t is # iterations. Normally, k, t, d << n.
3) Gives best result when data set are distinct or well separated from each other.
4.What are the disadvantages of k means clustering?
1) The learning algorithm requires apriori specification of the number of cluster centers.
2) The use of Exclusive Assignment - If there are two highly overlapping data then k-means
will not be able to resolve that there are two clusters.
3) The learning algorithm is not invariant to non-linear transformations i.e. with different
representation of data we get different results (data represented in form of cartesian co-
ordinates and polar co-ordinates will give different results).
4) Euclidean distance measures can unequally weight underlying factors.
5) The learning algorithm provides the local optima of the squared error function.
6) Randomly choosing of the cluster center cannot lead us to the fruitful result. Pl. refer Fig.
7) Applicable only when mean is defined i.e. fails for categorical data.
8) Unable to handle noisy data and outliers.
9) Algorithm fails for non-linear data set.
5.What are the applications of k-means?
• image processing
• medical
• customer segmentation.
6.Write R program to Perform a K-means Analysis
library(plyr)
library(ggplot2)
library(cluster)
library(lattice)
library(graphics)
14
library(grid)
library(gridExtra)
#import the student grades
grade_input = as.data.frame(read.csv(“c:/data/grades_km_input.csv”)
kmdata_orig = as.matrix(grade_input[,c(“Student”,“English”,
“Math”,“Science”)])
kmdata <- kmdata_orig[,2:4]
kmdata[1:10,]
7. Write short note on decision trees?
A decision tree (also called prediction tree) uses a tree structure to specify sequences ofdecisions
and consequences. Given input , the goal is to predict a response or output variable . Each
member of the set is called an input variable. The prediction can be achieved by constructing a
decision tree with test points and branches. At each test point, a decision is made to pick a
specific branch and traverse down the tree. Eventually, a final point is reached, and a prediction
can be made. Each test point in a decision tree involves testing a particular input variable (or
attribute), and each branch represents the decision being made.
8. List the key terms of decision tree?
Root node
Internal nodes
Leaf nodes
Split
Depth
9.What is a Decision Tree?
A decision tree is a map of the possible outcomes of a series of related choices. It allows an
individual or organization to weigh possible actions against one another based on their costs,
probabilities, and benefits.
As the name goes, it uses a tree-like model of decisions. They can be used either to drive
informal discussion or to map out an algorithm that predicts the best choice mathematically.
A decision tree typically starts with a single node, which branches into possible outcomes. Each
of those outcomes leads to additional nodes, which branch off into other possibilities. This gives
it a tree-like shape.
15
10. Advantages of Decision Trees
• Decision trees generate understandable rules.
• Decision trees perform classification without requiring much computation.
• Decision trees are capable of handling both continuous and categorical variables.
• Decision trees provide a clear indication of which fields are most important for prediction
or classification.
11 Disadvantages of Decision Trees
• Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
• Decision trees are prone to errors in classification problems with many class and a
relatively small number of training examples.
• Decision trees can be computationally expensive to train. The process of growing a
decision tree is computationally expensive. At each node, each candidate splitting field
must be sorted before its best split can be found. In some algorithms, combinations of
fields are used and a search must be made for optimal combining weights. Pruning
algorithms can also be expensive since many candidate sub-trees must be formed and
compared.
12. List the decision tree algorithms
• ID3
• C4.5
• CART
13. What is meant by Naïve Bayes classifier.
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e. every pair of features being classified is independent of each
other.
14. Explain Naive Bayes algorithm
Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods. Bayes theorem provides a way of calculating posterior probability
P(c|x) from P(c), P(x) and P(x|c).
16
• P(c|x) is the posterior probability of class (c, target) given predictor (x, attributes).
• P(c) is the prior probability of class.
• P(x|c) is the likelihood which is the probability of predictor given class.
• P(x) is the prior probability of predictor.
15. Define Smoothing?
Smoothing techniques can be employed to adjust the probabilities of and to ensure a nonzero
A smoothing technique assigns a small nonzero probability to rare events not included in the
training dataset. Also, the smoothing addresses the possibility of taking the logarithm of zero.
12 MARK QUESTIONS
1. Explain k-means clustering algorithm in detail with an example.
2. Explain how k-means clustering algorithm can be implemented using R.
3. Explain the structure and working of decision tree.
4. Explain any two decision tree algorithms in detail.
5. Illustrate how to develop decision tree algorithm using R.
6. Describe Bayes theorem with an example.
7. Explain how classification is done by Naïve Bayes Classifier.
8. Compare and contrast clustering and classification
9. Explain the pros and cons of Naïve Bayes Classifier.
17
UNIT III
(Prepared by Mrs.J.Karthika)
BIG DATA FROM DIFFERENT PERSPECTIVES
Analytics for Unstructured Data, Use Cases, MapReduce, Apache Hadoop, The Hadoop
Ecosystem, Pig, Hive, Hbase, Mahouth, NoSQL, SQL Essentials, Joins, Set Operations,
Grouping Extensions, In-Database Text Analysis, Advanced SQL, Window Functions, User-
defined Functions and Aggregates, Ordered Aggregates, MADlib
MULTIPLE CHOICE QUESTIONS
1. Hadoop is a framework that works with a variety of related tools. Common cohorts include:
a) MapReduce, Hive and HBase
b) MapReduce, MySQL and Google Apps
c) MapReduce, Hummer and Iguana
d) MapReduce, Heron and Trumpet
2. What was Hadoop named after?
a) Creator Doug Cutting’s favorite circus act
b) Cutting’s high school rock band
c) The toy elephant of Cutting’s son
d) A sound Cutting’s laptop made during Hadoop development
3. All of the following accurately describe Hadoop, EXCEPT:
a) Open source
b) Real-time
c) Java-based
d) Distributed computing approach
4. __________ can best be described as a programming model used to develop Hadoop-based
applications that can process massive amounts of data.
a) MapReduce
b) Mahout
c) Oozie
d) All of the mentioned
18
5. __________ has the world’s largest Hadoop cluster.
a) Apple
b) Datamatics
c) Facebook
d) None of the mentioned
6. Facebook Tackles Big Data With _______ based on Hadoop.
a) ‘Project Prism’
b) ‘Prism’
c) ‘Project Big’
d) ‘Project Data’
7. As companies move past the experimental phase with Hadoop, many cite the need for
additional capabilities, including:
a) Improved data storage and information retrieval
b) Improved extract, transform and load features for data integration
c) Improved data warehousing functionality
d) Improved security, workload management and SQL support
8. Point out the correct statement :
a) Hadoop do need specialized hardware to process the data
b) Hadoop 2.0 allows live stream processing of real time data
c) In Hadoop programming framework output files are divided in to lines or records
d) None of the mentioned
9. According to analysts, for what can traditional IT systems provide a foundation when they’re
integrated with big data technologies like Hadoop ?
a) Big data management and data mining
b) Data warehousing and business intelligence
c) Management of Hadoop clusters
d) Collecting and storing unstructured data
10. Point out the wrong statement :
a) Hardtop processing capabilities are huge and it’s real advantage lies in the ability to process
terabytes & petabytes of data
b) Hadoop uses a programming model called “MapReduce”, all the programs should confirms to
19
this model in order to work on Hadoop platform
c) The programming model, MapReduce, used by Hadoop is difficult to write and test
d) All of the mentioned
11. Which utility is used for checking the health of HDFS system?
a) -fchk
b) -fsck
c) –fsch
d) –fcks
12. The datanode and namenode are respectively
a) Master and worker nodes
b) Worker and Master nodes
c) Both are worker nodes
d) None
13. The following is not permitted on HDFS file
a) Deleting
b) Renaming
c) Moving
d) Executing
14. Which of the below apache system deals with ingesting streaming data to Hadoop
a) Ozie
b) Kafka
c) Flume
d) Hive
15. Which of the following is not a goal of HDFS?
a) Fault detection and recovery
b) Handle huge dataset
c) Prevent deletion of data
d) Provide high network bandwidth for data movement
16. The current limiting factor to the size of a Hadoop cluster is
a) Excess heat generated in data center
b) Upper limit of the network bandwidth
20
c) Upper limit of the RAM in namenode
d) 4000 data nodes
17. The namenode loses its only copy of fsimage file. We can recover this from
a) Datanodes
b) Secondary namenode
c) Checkpoint node
d) Never
18. What are the step involved in Pig and PigLatin
a) LOAD
b) TRANSFORM
c) DUMP and STORE
d) All the above
19. What is HQL in hive?
a) Hive Queue Language
b) Hive Query Language
c) Hypertext Query Language
d) Hive Query Learning
20. Flume entities are
a) Source
b) decorators
c) sinks
d) All the above
4 MARK QUESTIONS
1. What are the characteristics of Big Data?
Volume
• The exponential growth in the data storage
• The data can be as videos, music’s and large images
• Terabytes and Petabytes of the storage
• Database grows and data needs to be reevaluated
• Same data is re-evaluated with multiple angles
• The big volume indeed represents Big Data
21
Velocity
• The data growth changed how we look at the data.
• Data of yesterday is not recent now.
• Discard old messages and pay attention to recent updates is not right.
• The data movement is real time
• This high velocity data represent Big Data
Variety
• Multiple format- excel, csv, access or a simple text file.
• May be in the form of video, SMS, pdf
• Data is not in the same format
• This variety of the data represent Big Data
2. Difference between data in warehouse and data in Hadoop?
S.No Data in the warehouse Data in Hadoop
2 data goes through a lot of rigor to make it into data isn’t likely to be
the warehouse distributed like data
warehouse data
3 Quality. it’s cleaned up via cleansing, With all the volume and
enrichment, matching, glossary, metadata, velocity of today’s data,
master data management, modeling, and other there’s just no way that
services before it’s ready for analysis. you can afford to spend
the time and resources
required to cleanse and
document every piece of
data properly, because
it’s just not going to be
economical.
22
4 an expensive process Hadoop use cases cost
prohibitive
23
• heart of the hadoop The namenode manages the file system namespace. It stores the
metadata information of the data blocks permanently on to local disk in the form of
namespace image and edit log file.
• The namenode also knows the location of the data blocks on the data node and creates the
block to datanodemapping when it is restarted. If the NameNode crashes, then the entire
Hadoop system goes down.
4. Secondary Namenode: periodically copy and merge the namespace image and edit log.
5. DataNode: It stores the blocks of data and retrieves them. reports the blocks information to
the NameNode periodically.
6. Job Tracker: to schedule the client’s jobs. creates map and reduce tasks and schedules
them to run on the DataNodes (task trackers). Job Tracker also checks for any failed tasks and
reschedules the failed tasks on another DataNode. Job tracker can be run on the NameNode or
a separate node.
7. Task Tracker: Task tracker runs on the DataNodes & to report the status to the
NameNode.
6. What are the steps involved in Pig and PigLatin?
Pig and PigLatin
• Steps
1. LOAD the data from HDFS.
2. run the data through a set of transformations (which, under the covers, are translated into a
set of mapper and reducer tasks).
3. DUMP the data to the screen or
4. STORE the results in a file somewhere.
7. Define HIVE?
Although Pig a powerful and simple language to use, the downside is that it’s something new
to learn and master.
• Some folks at Facebook developed a runtime Hadoop support structure that allows anyone
who is already fluent with SQL (which is commonplace for relational database developers) to
leverage the Hadoop platform right out of the gate.
• Their creation, called Hive, allows SQL developers to write Hive Query Language (HQL)
statements that are similar to standard SQL statements;
24
• HQL statements are broken down by the Hive service into MapReduce jobs and executed
across a Hadoop cluster.
• As with any database management system (DBMS), you can run your Hive queries in many
ways.
• From a command line interface (known as the Hive shell),
8.What is Jaql?
Jaql is primarily a query language for JavaScript Object Notation (JSON), but it supports
more than just JSON.
• It allows you to process both structured and nontraditional data and was donated by IBM to
the open source community.
• Specifically, Jaql allows you to select, join, group, and filter data that is stored in HDFS,
much like a blend of Pig and Hive.
9.Define FLUME and its entities?
• A flume is a channel that directs water from a source to some other location where water is
needed.
• Flume was created to allow you to flow data from a source into your Hadoop environment.
• In Flume, the entities you work with are called sources, decorators, and sinks.
• A source can be any data source, and Flume has many predefined source adapters. For
example, some adapters allow the flow of anything coming off a TCP port to enter the flow,
or anything coming to standard input (stdin).
• A sink is the target of a specific operation (and in Flume, among other paradigms that use
this term, the sink of one operation can be the source for the next downstream operation).
• A decorator is an operation on the stream that can transform the stream in some manner,
modify data by adding or removing pieces of information, and more.
10.What are sink types and explain the sink types?
Sink types
• Collector Tier Event sink. This is where you would land a flow (or possibly multiple
flows joined together) into an HDFS formatted file system.
• Agent Tier Event This sink is used when you want the sink to be the input source for
another operation. When you use these sinks, Flume will also ensure the integrity of the flow
by sending back acknowledgments that data has actually arrived at the sink.
25
• Basic sink Can be a text file, the console display, a simple HDFS path, or a null bucket
where the data is simply deleted.
11. Define Hbase
HBase
• HBase is a column-oriented database management system that runs on top of HDFS. It is
well suited for sparse data sets, which are common in many Big Data use cases.
• HBase does not support a structured query language like SQL; in fact, HBase isn’t a
relational data store at all. HBase applications are written in Java much like a typical
MapReduce application.
• HBase does support writing applications in Avro, REST, and Thrift.
• An HBase system comprises a set of tables. Each table contains rows and columns, much
like a traditional database.
12. Define Oozie
Oozie
• Oozie is an open source project that simplifies workflow and coordination between jobs.
• It provides users with the ability to define actions and dependencies between actions.
• A workflow in Oozie is defined in what is called a Directed Acyclical Graph (DAG).
• A workflow can be scheduled to begin based on a given time or based on the arrival of
some specific data in the file system. After inception, further workflow actions are executed
based on the completion of the previous actions in the graph.
12 MARK QUESTIONS
1.Ellobortae the difference between in Data in warehouse and data in Hadoop
2.What are the characteristics and big data?
3.Explain Hadoop and its components?
4.Explain the Application development in Hadoop?
5.Define Pig and PigLatin and the steps involved in it
6.What is Flume and explain
a) Source
b) Sink
c) Decorator
d) Sink types
26
7.Explain other Hadoop components
a) Zookeeper
b) HBase
8.Explain the Hadoop components
a) Oozie
b) Lucene
c) Avro
27
UNIT IV
(Prepared by Mrs.J.Karthika)
HADOOP DISTRIBUTED FILE SYSTEM ARCHITECTURE
HDFS Architecture, HDFS Concepts, Blocks, NameNode, Secondary NameNode, DataNode,
HDFS Federation, HDFS High Availability, Basic File System Operations, Data Flow, Anatomy
of File Read, Anatomy of File Write, Anatomy of a MapReduce Job Run
MULTIPLE CHOICE QUESTIONS
1. A ________ serves as the master and there is only one NameNode per cluster.
a) Data Node
b) NameNode
c) Data block
d) Replication
2. Point out the correct statement :
a) DataNode is the slave/worker node and holds the user data in the form of Data Blocks
b) Each incoming file is broken into 32 MB by default
c) Data blocks are replicated across different nodes in the cluster to ensure a low degree of fault
tolerance
d) None of the mentioned
3. HDFS works in a __________ fashion.
a) master-worker
b) master-slave
c) worker/slave
d) all of the mentioned
4. ________ NameNode is used when the Primary NameNode goes down.
a) Rack
b) Data
c) Secondary
d) None of the mentioned
5. Point out the wrong statement :
a) Replication Factor can be configured at a cluster level (Default is set to 3) and also at a file
level
28
b) Block Report from each DataNode contains a list of all the blocks that are stored on that
DataNode
c) User data is stored on the local file system of DataNodes
d) DataNode is aware of the files to which the blocks stored on it belong to
6. Which of the following scenario may not be a good fit for HDFS ?
a) HDFS is not suitable for scenarios requiring multiple/simultaneous writes to the same file
b) HDFS is suitable for storing data related to applications requiring low latency data access
c) HDFS is suitable for storing data related to applications requiring low latency data access
d) None of the mentioned
7. The need for data replication can arise in various scenarios like :
a) Replication Factor is changed
b) DataNode goes down
c) Data Blocks get corrupted
d) All of the mentioned
8. ________ is the slave/worker node and holds the user data in the form of Data Blocks.
a) DataNode
b) NameNode
c) Data block
d) Replication
9. HDFS provides a command line interface called __________ used to interact with HDFS.
a) “HDFS Shell”
b) “FS Shell”
c) “DFS Shell”
d) None of the mentioned
10. HDFS is implemented in _____________ programming language.
a) C++
b) Java
c) Scala
d) None of the mentioned
29
11. ________ is a programming model designed for processing large volumes of data in parallel
by dividing the work into a set of independent tasks.
a) Hive
b) MapReduce
c) Pig
d) Lucene
12. Point out the correct statement :
a) Data locality means movement of algorithm to the data instead of data to algorithm
b) When the processing is done on the data algorithm is moved across the Action Nodes rather
than data to the algorithm
c) Moving Computation is expensive than Moving Data
d) None of the mentioned
13. The daemons associated with the MapReduce phase are ________ and task-trackers.
a) job-tracker
b) map-tracker
c) reduce-tracker
d) all of the mentioned
14. The JobTracker pushes work out to available _______ nodes in the cluster, striving to keep
the work as close to the data as possible
a) DataNodes
b) TaskTracker
c) ActionNodes
d) All of the mentioned
15. Point out the wrong statement :
a) The map function in Hadoop MapReduce have the following general form:map:(K1, V1) →
list(K2, V2)
b) The reduce function in Hadoop MapReduce have the following general form: reduce: (K2,
list(V2)) → list(K3, V3)
c) MapReduce has a complex model of data processing: inputs and outputs for the map and
reduce functions are key-value pairs
d) None of the mentioned
30
16. InputFormat class calls the ________ function and computes splits for each file and then
sends them to the jobtracker.
a) puts
b) gets
c) getSplitsa
d) all of the mentioned
17. On a tasktracker, the map task passes the split to the createRecordReader() method on
InputFormat to obtain a _________ for that split.
a) InputReader
b) RecordReader
c) OutputReader
d) None of the mentioned
18. The default InputFormat is __________ which treats each value of input a new value and the
associated key is byte offset.
a) TextFormat
b) TextInputFormat
c) InputFormat
d) All of the mentioned
19. __________ controls the partitioning of the keys of the intermediate map-outputs.
a) Collector
b) Partitioner
c) InputFormat
d) None of the mentioned
20. Output of the mapper is first written on the local disk for sorting and _________ process.
a) shuffling
b) secondary sorting
c) forking
d) reducing
31
4 MARKS QUESTIONS
1.Define the goals of HDFS?
• Fault detection and recovery:
Since HDFS includes a large number of commodity hardware, failure of components is
frequent. Therefore HDFS should have mechanisms for quick and automatic fault detection
and recovery.
• Huge datasets:
HDFS should have hundreds of nodes per cluster to manage the applications having huge
datasets.
• Hardware at data:
A requested task can be done efficiently, when the computation takes place near the data.
Especially where huge datasets are involved, it reduces the network traffic and increases the
throughput.
2.What are the features of HDFS?
• It is suitable for the distributed storage and processing.
• Hadoop provides a command interface to interact with HDFS.
• The built-in servers of namenode and datanode help users to easily check the status of
cluster.
• Streaming access to file system data.
• HDFS provides file permissions and authentication.
3. Define Namenode and what are the tasks performed at the namenode?
• The namenode is the commodity hardware that contains operating system and the namenode
software.
• It is software that can be run on commodity hardware. The system having the namenode acts
as the master server and it does the following tasks:
– Manages the file system
– Regulates client’s access to files.
– It also executes file system operations such as renaming, closing, and opening files and
directories.
4.Define Secondary namenode ?
HDFS is based on master/slave architecture.
32
• Simplifies the overall HDFS architecture.
• It also creates a single point of failure
• losing the NameNode effectively means losing HDFS.
• To somewhat alleviate this problem, Hadoop implements a Secondary NameNode.
5.Define data node?
The datanode is a commodity hardware having operating system and datanode software.
• For every node in a cluster, there will be a datanode.
• These nodes manage the data storage of their system.
• Datanodes perform read-write operations on the file systems, as per client request.
• They also perform operations such as block creation, deletion, and replication according to
the instructions of the namenode.
6.Explain blocks and its advantages?
A file will be divided into one or more segments and/or stored in individual data nodes. These
file segments are called as blocks.
• In other words, the minimum amount of data that HDFS can read or write is called a Block. •
The default block size is 64MB, but it can be increased as per the need to change in HDFS
configuration.
ADVANTAGES
First, a file can be larger than any single disk in the network
• Secondly, making the unit of abstraction a block rather than a file simplifies the storage
subsystem (it is easy to calculate how many can be stored on a given disk)
• Furthermore, blocks fit well with replication for providing fault tolerance and availability. If
a block becomes unavailable, a copy can be read from another location
7.What are the two types of block storage services?
• Block Storage Service, which has two parts:
– Block Management (performed in the Namenode)
• Provides Datanode cluster membership by handling registrations
• Processes block reports and maintains location of blocks.
• Supports block related operations create, delete and get block location
• Manages replica placement, block replication
33
– Storage - is provided by Datanodes by storing blocks on the local file system and allowing
read/write access.
8. Define block pool and its working?
• A Block Pool is a set of blocks that belong to a single namespace. Datanodes store blocks
for all the block pools in the cluster. Each Block Pool is managed independently.
• This allows a namespace to generate Block IDs for new blocks without the need for
coordination with the other namespaces.
• A Namenode failure does not prevent the Datanode from serving other Namenodes in the
cluster.
• A Namespace and its block pool together are called Namespace Volume. It is a
selfcontained unit of management.
• When a Namenode/namespace is deleted, the corresponding block pool at the Datanodes
is deleted. Each namespace volume is upgraded as a unit, during cluster upgrade.
34
Task tracker
-Daemons that run on What data nodes, the task tracers take care of individual tasks on slave
node as entrusted to them by job tracker.
12 MARK QUESTIONS
1.Explain HDFS architecture, components and its functionality?
2.Decribe the Concepts invoved in HDFS?
3.Explain HDFS Federation?
4.Decribe the basic file system operation involved in HDFS?
5.Explain the Data flow anatomy of file read?
6. Explain the Data flow anatomy of file write?
35
UNIT V
(Prepared by Dr.A.Senthilselvi)
PROCESSING YOUR DATA WITH MAPREDUCE
Getting to know MapReduce, MapReduce Execution Pipeline, Runtime Coordination and Task
Management, MapReduce Application, Hadoop Word Count Implementation, Installing and
Running Pig, Hbase Versus RDBMS, Installing and Running ZooKeeper
MULTIPLE CHOICE QUESTIONS
MCQ on Introduction to Mapreduce
1. A ________ node acts as the Slave and is responsible for executing a Task assigned to it by
the JobTracker.
a) MapReduce
b) Mapper
c) TaskTracker
d) JobTracker
Answer: c
Explanation: TaskTracker receives the information necessary for execution of a Task from
JobTracker, Executes the Task, and Sends the Results back to JobTracker.
2. Point out the correct statement :
a) MapReduce tries to place the data and the compute as close as possible
b) Map Task in MapReduce is performed using the Mapper() function
c) Reduce Task in MapReduce is performed using the Map() function
d) All of the mentioned
Answer: a
Explanation: This feature of MapReduce is “Data Locality”.
3. ___________ part of the MapReduce is responsible for processing one or more chunks of data
and producing the output results.
a) Maptask
b) Mapper
c) Task execution
d) All of the mentioned
36
Answer: a
Explanation: Map Task in MapReduce is performed using the Map() function.
4. _________ function is responsible for consolidating the results produced by each of the Map()
functions/tasks.
a) Reduce
b) Map
c) Reducer
d) All of the mentioned
Answer: a
Explanation: Reduce function collates the work and resolves the results.
5. Point out the wrong statement :
a) A MapReduce job usually splits the input data-set into independent chunks which are
processed by the map tasks in a completely parallel manner
b) The MapReduce framework operates exclusively on pairs
c) Applications typically implement the Mapper and Reducer interfaces to provide the map and
reduce methods
d) None of the mentioned
Answer: d
Explanation: The MapReduce framework takes care of scheduling tasks, monitoring them and
re-executes the failed tasks.
6. Although the Hadoop framework is implemented in Java , MapReduce applications need not
be written in :
a) Java
b) C
c) C#
d) None of the mentioned
Answer: a
Explanation: Hadoop Pipes is a SWIG- compatible C++ API to implement MapReduce
applications (non JNITM based).
7. ________ is a utility which allows users to create and run jobs with any executables as the
mapper and/or the reducer.
37
a) Hadoop Strdata
b) Hadoop Streaming
c) Hadoop Stream
d) None of the mentioned
Answer: b
Explanation: Hadoop streaming is one of the most important utilities in the Apache Hadoop
distribution.
8. __________ maps input key/value pairs to a set of intermediate key/value pairs.
a) Mapper
b) Reducer
c) Both Mapper and Reducer
d) None of the mentioned
Answer: a
Explanation: Maps are the individual tasks that transform input records into intermediate records.
9. The number of maps is usually driven by the total size of :
a) inputs
b) outputs
c) tasks
d) None of the mentioned
Answer: a
Explanation: Total size of inputs means total number of blocks of the input files.
10. _________ is the default Partitioner for partitioning key space.
a) HashPar
b) Partitioner
c) HashPartitioner
d) None of the mentioned
Answer: c
Explanation: The default partitioner in Hadoop is the HashPartitioner which has a method called
getPartition to partition.
38
Hadoop – MapReduce Features-1
1. Which of the following is the default Partitioner for Mapreduce ?
a) MergePartitioner
b) HashedPartitioner
c) HashPartitioner
d) None of the mentioned
Answer: c
Explanation: The total number of partitions is the same as the number of reduce tasks for the job.
2. Point out the correct statement :
a) The right number of reduces seems to be 0.95 or 1.75
b) Increasing the number of reduces increases the framework overhead
c) With 0.95 all of the reduces can launch immediately and start transferring map outputs as the
maps finish
d) All of the mentioned
Answer: c
Explanation: With 1.75 the faster nodes will finish their first round of reduces and launch a
second wave of reduces doing a much better job of load balancing.
3. Which of the following partitions the key space ?
a) Partitioner
b) Compactor
c) Collector
d) All of the mentioned
Answer: a
Explanation: Partitioner controls the partitioning of the keys of the intermediate map-outputs.
4. ____________ is a generalization of the facility provided by the MapReduce framework to
collect data output by the Mapper or the Reducer
a) OutputCompactor
b) OutputCollector
c) InputCollector
d) All of the mentioned
Answer: b
39
Explanation: Hadoop MapReduce comes bundled with a library of generally useful mappers,
reducers, and partitioners.
5. Point out the wrong statement :
a) It is legal to set the number of reduce-tasks to zero if no reduction is desired
b) The outputs of the map-tasks go directly to the FileSystem
c) The Mapreduce framework does not sort the map-outputs before writing them out to the
FileSystem
d) None of the mentioned
Answer: d
Explanation: Outputs of the map-tasks go directly to the FileSystem, into the output path set by
setOutputPath(Path).
6. __________ is the primary interface for a user to describe a MapReduce job to the Hadoop
framework for execution.
a) JobConfig
b) JobConf
c) JobConfiguration
d) All of the mentioned
Answer: b
Explanation: JobConf is typically used to specify the Mapper, combiner (if any), Partitioner,
Reducer, InputFormat, OutputFormat and OutputCommitter implementations.
7. The ___________ executes the Mapper/ Reducer task as a child process in a separate jvm.
a) JobTracker
b) TaskTracker
c) TaskScheduler
d) None of the mentioned
Answer: a
Explanation: The child-task inherits the environment of the parent TaskTracker.
8. Maximum virtual memory of the launched child-task is specified using :
a) mapv
b) mapred
c) mapvim
40
d) All of the mentioned
Answer: b
Explanation: Admins can also specify the maximum virtual memory of the launched child-task,
and any sub-process it launches recursively, using mapred.
9. Which of the following parameter is the threshold for the accounting and serialization buffers
?
a) io.sort.spill.percent
b) io.sort.record.percent
c) io.sort.mb
d) None of the mentioned
Answer: a
Explanation: When percentage of either buffer has filled, their contents will be spilled to disk in
the background.
10. ______________ is percentage of memory relative to the maximum heap size in which map
outputs may be retained during the reduce.
a) mapred.job.shuffle.merge.percent
b) mapred.job.reduce.input.buffer.percen
c) mapred.inmem.merge.threshold
d) io.sort.factor
Answer: b
Explanation: When the reduce begins, map outputs will be merged to disk until those that remain
are under the resource limit this defines.
Hadoop– Metrics in Hbase
This set of Interview Questions & Answers focuses on “Hbase”.
1. _______ can change the maximum number of cells of a column family.
a) set
b) reset
c) alter
d) select
Answer: c
Explanation: Alter is the command used to make changes to an existing table.
41
2. Point out the correct statement :
a) You can add a column family to a table using the method addColumn()
b) Using alter, you can also create a column family
c) Using disable-all, you can truncate a column family
d) None of the mentioned
Answer: a
Explanation: Columns can also be added through HbaseAdmin.
3. Which of the following is not a table scope operator ?
a) MEMSTORE_FLUSH
b) MEMSTORE_FLUSHSIZE
c) MAX_FILESIZE
d) All of the mentioned
Answer: a
Explanation: Using alter, you can set and remove table scope operators such as
MAX_FILESIZE, READONLY, MEMSTORE_FLUSHSIZE, DEFERRED_LOG_FLUSH, etc.
4. You can delete a column family from a table using the method _________ of HBAseAdmin
class.
a) delColumn()
b) removeColumn()
c) deleteColumn()
d) all of the mentioned
Answer: c
Explanation: Alter command also can be used to delete a column family.
5. Point out the wrong statement :
a) To read data from an HBase table, use the get() method of the HTable class
b) You can retrieve data from the HBase table using the get() method of the HTable class
c) While retrieving data, you can get a single row by id, or get a set of rows by a set of row ids,
or scan an entire table or a subset of rows
d) None of the mentioned
Answer: d
Explanation: You can retrieve an HBase table data using the add method variants in Get class.
42
6. __________ class adds HBase configuration files to its object.
a) Configuration
b) Collector
c) Component
d) None of the mentioned
Answer: a
Explanation: You can create a configuration object using the create() method of the
HbaseConfiguration class.
7. The ________ class provides the getValue() method to read the values from its instance.
a) Get
b) Result
c) Put
d) Value
Answer: b
Explanation: Get the result by passing your Get class instance to the get method of the HTable
class. This method returns the Result class object, which holds the requested result.
8. ________ communicate with the client and handle data-related operations.
a) Master Server
b) Region Server
c) Htable
d) All of the mentioned
Answer: b
Explanation: Region Server handle read and write requests for all the regions under it.
9. _________ is the main configuration file of HBase.
a) hbase.xml
b) hbase-site.xml
c) hbase-site-conf.xml
d) none of the mentioned
Answer: b
Explanation: Set the data directory to an appropriate location by opening the HBase home folder
in /usr/local/HBase.
43
10. HBase uses the _______ File System to store its data.
a) Hive
b) Imphala
c) Hadoop
d) Scala
Answer: c
Explanation: The data storage will be in the form of regions (tables). These regions will be split
up and stored in region servers.
Hadoop– Introduction to Pig
1. Pig operates in mainly how many nodes ?
a) Two
b) Three
c) Four
d) Five
Answer: a
Explanation: You can run Pig (execute Pig Latin statements and Pig commands) using various
mode: Interactive and Batch Mode.
2. Point out the correct statement :
a) You can run Pig in either mode using the “pig” command
b) You can run Pig in batch mode using the Grunt shell
c) You can run Pig in interactive mode using the FS shell
d) None of the mentioned
Answer: a
Explanation: You can run Pig in either mode using the “pig” command (the bin/pig Perl script)
or the “java” command (java -cp pig.jar …).
3. You can run Pig in batch mode using __________ .
a) Pig shell command
b) Pig scripts
c) Pig options
d) All of the mentioned
44
Answer: b
Explanation: Pig script contains Pig Latin statements.
4. Pig Latin statements are generally organized in one of the following ways :
a) A LOAD statement to read data from the file system
b) A series of “transformation” statements to process the data
c) A DUMP statement to view results or a STORE statement to save the results
d) All of the mentioned
Answer: d
Explanation: A DUMP or STORE statement is required to generate output.
5. Point out the wrong statement :
a) To run Pig in local mode, you need access to a single machine
b) The DISPLAY operator will display the results to your terminal screen
c) To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation
d) All of the mentioned
Answer: b
Explanation: The DUMP operator will display the results to your terminal screen.
6. Which of the following function is used to read data in PIG ?
a) WRITE
b) READ
c) LOAD
d) None of the mentioned
Answer: c
Explanation: PigStorage is the default load function.
7. You can run Pig in interactive mode using the ______ shell.
a) Grunt
b) FS
c) HDFS
d) None of the mentioned
Answer: a
Explanation: Invoke the Grunt shell using the “pig” command (as shown below) and then enter
your Pig Latin statements and Pig commands interactively at the command line.
45
8. Which of the following is the default mode ?
a) Mapreduce
b) Tez
c) Local
d) All of the mentioned
Answer: a
Explanation: To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS
installation.
9. Which of the following will run pig in local mode ?
a) $ pig -x local …
b) $ pig -x tez_local …
c) $ pig …
d) None of the mentioned
Answer: a
Explanation: Specify local mode using the -x flag (pig -x local).
10.$ pig -x tez_local … will enable ________ mode in Pig.
a) Mapreduce
b) Tez
c) Local
d) None of the mentioned
Answer: d
Explanation: Tez Local Mode is similar to local mode, except internally Pig will invoke tez
runtime engine.
Hadoop– Introduction to Zookeeper
1. ZooKeeper itself is intended to be replicated over a sets of hosts called :
a) chunks
b) ensemble
c) subdomains
d) none of the mentioned
Answer: b
46
Explanation: As long as a majority of the servers are available, the ZooKeeper service will be
available.
2. Point out the correct statement :
a) ZooKeeper can achieve high throughput and high latency numbers.
b) The fault tolerant ordering means that sophisticated synchronization primitives can be
implemented at the client
c) The ZooKeeper implementation puts a premium on high performance, highly available,
strictly ordered access
d) All of the mentioned
Answer: c
Explanation: The performance aspects of ZooKeeper means it can be used in large, distributed
systems.
3. Which of the guarantee is provided by Zookeeper ?
a) Interactivity
b) Flexibility
c) Scalability
d) Reliability
Answer: d
Explanation: Once an update has been applied, it will persist from that time forward until a client
overwrites the update.
4. ZooKeeper is especially fast in ___________ workloads
a) write
b) read-dominant
c) read-write
d) none of the mentioned
Answer: b
Explanation: ZooKeeper applications run on thousands of machines, and it performs best where
reads are more common than writes, at ratios of around 10:1.
5. Point out the wrong statement :
a) Distributed applications use SQL to store important configuration information
b) The service maintains a record of all transactions, which can be used for higher-level
47
abstractions, like synchronization primitives
c) ZooKeeper maintain a standard hierarchical name space, similar to files and directories
d) ZooKeeper provides superior reliability through redundant services
Answer: d
Explanation: Distributed applications use Zookeeper to store and mediate updates to important
configuration information.
6. When a _______ is triggered the client receives a packet saying that the znode has changed.
a) event
b) watch
c) row
d) value
Answer: b
Explanation: ZooKeeper supports the concept of watches. Clients can set a watch on a znodes.
7. The underlying client-server protocol has changed in version _______ of ZooKeeper.
a) 2.0.0
b) 3.0.0
c) 4.0.0
d) 6.0.0
Answer: b
Explanation: Old pre-3.0.0 clients are not guaranteed to operate against upgraded 3.0.0 servers
and vice-versa.
8. The java package structure has changed from com.yahoo.zookeeper* to :
a) apache.zookeeper
b) org.apache.zookeeper
c) org.apache.zookeeper.package
d) all of the mentioned
Answer: b
Explanation: A number of constants used in the client ZooKeeper API were re-specified using
enums (rather than ints).
9. A number of constants used in the client ZooKeeper API were renamed in order to reduce
________ collision
48
a) value
b) namespace
c) counter
d) none of the mentioned
Answer: b
Explanation: ZOOKEEPER-18 removed KeeperStateChanged, use KeeperStateDisconnected
instead.
10. ZooKeeper allows distributed processes to coordinate with each other through registers,
known as :
a) znodes
b) hnodes
c) vnodes
d) rnodes
Answer: a
Explanation: Every znode is identified by a path, with path elements separated by a slash.
4 MARK QUESTIONS
1. Describe Map and Reduce phases of Map Reduce. 04
MapReduce consists of two distinct tasks – Map and Reduce. As the
name MapReduce suggests, reducerphase takes place after mapper phase has been
completed. So, the first is the map job, where a block of data is read and processed to
produce key-value pairs as intermediate outputs.
2. List various stages of MapReduce
A MapReduce Program consists of three different phases. They are:
• Mapper
• Sort and Shuffle
• Reducer
3. List out the features of HBase
Scalability: HBase supports scalability in both linear and modular form
Sharding: HBase supports automatic sharding of tables. It is also configurable.
Distributed storage: HBase supports distributed storage like HDFS
Consistency: It supports consistent read and write operations
49
Failover support: HBase supports automatic failover
API support: HBase supports Java APIs so clients can access it easily
MapReduce support: HBase supports MapReduce for parallel processing of large
volume of data
Back up support: HBase supports back up of Hadoop MapReduce jobs in HBase tables
Real time processing: It supports block cache and Bloom filters. So, real time query
processing is easy
4. What is Zookeeper? Write functions of it. 03
ZooKeeper is a centralized service for maintaining configuration information,
naming, providing distributed synchronization, and providing group services. All of
these kinds of services are used in some form or another by distributed applications.
Each time they are implemented there is a lot of work that goes into fixing the bugs and
race conditions that are inevitable. Because of the difficulty of implementing these kinds
of services, applications initially usually skimp on them, which make them brittle in the
presence of change and difficult to manage. Even when done correctly, different
implementations of these services lead to management complexity when the applications
are deployed.
5. Difference between RDBMS and HBase:
RDBMS HBASE
It requires SQL (structured query NO SQL
language)
It has a fixed schema No fixed schema
It is row oriented It is column oriented
It is not scalable It is scalable
It is static in nature Dynamic in nature
Slower retrieval of data Faster retrieval of data
It follows the ACID (Atomicity, It follows CAP (Consistency,
Consistency, Isolation and Availability,Partition-tolerance)
Durability) property. theorem.
It can handle structured data It can handle structured, unstructured
as well as semi-structured data
50
It cannot handle sparse data It can handle sparse data
6. What is Shuffling and Sorting in Hadoop MapReduce?
Shuffle phase in Hadoop transfers the map output from Mapper to a Reducer in
MapReduce. Sort phase in MapReduce covers the merging and sorting of map outputs.
Data from the mapper are grouped by the key, split among reducers and sorted by the
key. Every reducer obtains all values associated with the same key. Shuffle and sort
phase in Hadoop occur simultaneously and are done by the MapReduce framework.
7. What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file
system. It is an open-source project and is horizontally scalable. HBase is a data model
that is similar to Google’s big table designed to provide quick random access to huge
amounts of structured data. It leverages the fault tolerance provided by the Hadoop File
System (HDFS).It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.One can store the data in HDFS
either directly or through HBase. Data consumer reads/accesses the data in HDFS
randomly using HBase. HBase sits on top of the Hadoop File System and provides read
and write access.
8. Write down the differences between Apache Pig and MapReduce.
MapReduce Apache PIG
MapReduce program expects the In Apache PIG there is no need of much
programming language skills for programming skills. The entire program is
writing the business logic. based on PIG transformations.
Amount of code is very large; we Amount of code is very less when compared
must write huge programming to MapReduce program. 200 lines of
code. MapReduce program is equivalent to 10
lines of Pig script.
MapReduce program is compiled Pig script internally converts into
and executed directly. MapReduce program and gets executed.
Writing and executing Writing and Executing PIG script is a simple
MapReduce programming is a bit task when compared with MapReduce.
complex task.
51
9. What is Apache Pig and why do we need it?
Apache Pig is a high-level platform for creating programs that run on Apache
Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop
jobs in MapReduce, Apache Tez, or Apache Spark. Ultimately Apache Pig reduces the
development time by almost 16 times. Pig Latin is SQL-like language and it is easy to
learn Apache Pig when you are familiar with SQL. Apache Pig provides many built-in
operators to support data operations like joins, filters, ordering, etc.
10. List the Benefits of ZooKeeper
a. The Simple Distributed Coordination Process
b. Synchronization
c. Ordered Messages
d. Serialization
e. Speed
f. Scalability
g. Order Beneficial
h. ZooKeeper is Fast
i. Reliability
j. Atomicity
k. Timeliness
11. List the Limitations of Zookeeper
a. Adding New ZooKeeper Servers Can Lead to Data Loss
b. No Migration
c. Node Count
d. Rack-Aware Replication
e. Scaling In
f. Disk Changes
g. Virtual Networks
h. Kerberos
i. Limited Support
j. Complex
52
12. Explain Avro data serialization technique in MapReduce.
To transfer data over a network or for its persistent storage, you need to serialize
the data. Prior to the serialization APIs provided by Java and Hadoop, we have a special
utility, called Avro, a schema-based serialization technique.
Data is serialized for two objectives −
• For persistent storage
• To transport the data over network
Serialization is the process of translating data structures or objects state into binary or
textual form to transport the data over network or to store on some persisten storage.
Once the data is transported over network or retrieved from the persistent storage, it
needs to be deserialized again. Serialization is termed as marshalling and deserialization
is termed as unmarshalling.
12 MARK QUESTIONS
1. Explain the various stages of MapReduce
2. Explain storage mechanism of HBase with an example.
3. Explain how RDBMS is differ from HBase?
4. Explain in detail about the working of reduce phase of MapReduce with an example.
5. Briefly explain about Shuffling and Sorting in Hadoop MapReduce
6. Explain the Benefits and limitations of ZooKeeper
7. Explain Installing and Running of Pig
8. Explain Installing and Running of zookeeper
9. Explain the application of MapReduce with example
53