distributed-dbms-2170714-lab-manual
distributed-dbms-2170714-lab-manual
CE Department Vision:
To produce technically sound and ethically responsible Computer Engineers to the society by providing Quality
Education.
CE Department Mission:
1) To provide healthy Learning Environment based on current and future Industrial demands.
2) To promote curricular, co-curricular and extra-curricular activities for overall personality development of the
students.
3) To groom technically powerful and ethically dominant engineers having real life problem solving capabilities.
4) To provide platform for Effective Teaching Learning.
IT Department Mission
To provide quality education and assistance to the students through innovative teaching learning methodology for
shaping young mind technically sound and ethically strong.
IT Department Mission:
1) To serve society by producing technically and ethically sound engineers.
2) To generate groomed and efficient problem solvers as per Industrial needs by adopting innovative teaching
learning methods.
3) To emphasis on overall development of the students through various curricular, co-curricular and extra-curricular
activities.
INDEX
Page No.
Sr.No. Experiment Date Marks Signature
From To
A) Introduction of Database
1. management systems, Oracle 3 9
concepts and Create a table.
B) How to insert data in a table using
insert and display the records in a
table.
A) Update or Delete records of a
2. table and modifying structure of a 10 15
table using Alter and Drop
command.
B) Study of character functions for
manipulation of data items.
To perform join operation between
3. various tables. 16 17
Page 2 of 48
THEORY:
❖ Introduction of Oracle:
The relational model, sponsored by IBM (in June 1970), then came to accepted as
the definitive model for RDBMS. The language developed by IBM to manipulate
the data stored within model (Dr. E.F.Codd model) was originally called
Structured English Query Language (SEQUEL) with the word English later
dropped in favor Structured Query Language(SQL).
In 1979 a company called Relational Software, Inc. released the first commercially
available implementation of SQL. Relational Software later come to be known as
Oracle Corporation. Oracle Corporation is a company that produces the most
widely used, Server based, Multi user RDBMS named Oracle.
❖ Oracle Tools:
The Oracle product is primarily divided into
Oracle Server tools: Oracle Server Product is either called Oracle Workgroup
Server or Oracle Enterprise Server. Oracle Workgroup Server or Oracle
Enterprise Server is used for datastorage.
Oracle Client tools: The client roll most commonly used for Commercial
Application Development is called Oracle Developer 2000. Oracle Developer
2000, Oracle’s tool box which consists of Oracle Forms, Oracle Reports and
Oracle Graphics. This suite of tools is used to capture, validate and display
data according to user and system needs.
Page 3 of 48
❖ Components of SQL:
1) DDL (Data Definition Language):
Is a language, which includes the commands, which are used dynamically to set
up, change and remove any data structure e.g. tables, views and indexes. The
examples are CREATE, ALTER & DROP.
2) DML (Data Manipulation Language):
Is a language, which includes the commands, which are used to enter new rows,
change existing rows and remove unwanted rows from the tables in database.
The examples are INSERT, UPDATE & DELETE.
3) DCL (Data Control Language):
Is a language, which includes the commands, which are used to give or remove
access rights to both the Oracle database and the structures within it. The
examples are GRANT & REVOKE.
4) DQL (Data Query Language):
It is the component of SQL statement that allows getting data from the database
and imposing ordering upon it. In includes the SELECT statement. It allows
getting the data out of the database perform operations with it.
Page 4 of 48
EXCERCISE:
1) Create a table “emp” with the following fields:
EMPNO ENAME JOB HIREDATE SAL COMM DEPTNO MGR
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 5 of 48
DATE: / /
THEORY:
❖ Inserting Data into Tables using INSERT INTO command:
Once a table is created, most natural thing to do is load this table with data to be
manipulated later.
When inserting a single raw of data into the table, insert operation:
Creates a new raw (empty) in the databasetable.
Loads the values passed (by the SQL insert) into the columns specified.
Note: Character value (expression) placed within the INSERT INTO statement
must be enclosed in single quotes (‘).
Page 6 of 48
All Rows and All Columns: When data from all rows and columns from the table
are to be viewed the syntax of the SELECT statement will be used. The syntax is:
Oracle allows the use of the Meta character asterisk (*), this is expanded by Oracle to
mean all rows and all columns in the table.
Page 7 of 48
to use DISTINCT clause. The DISTINCT clause allows removing duplicates from the
result set. The DISTINCT clause can be only be used with SELECT statements.
The SELECT DISTINCT * SQL syntax scans through entire rows, and
eliminates rows that have exactly the same contents in each column.
Page 8 of 48
SELECT[DISTINCT]{*, column[alias],…}
FROM table
WHERE condition(s)
Group by column(s)
HAVING group of row condition(s)
ORDER BY {column. Expr} [ASC/DESC];
EXERCISES:
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 9 of 48
THEORY:
Example: Update the address details by changing its city name to Ahmedabad.
UPDATE ADDR_DTLS SET City = ‘Ahmedabad’;
Example: Update the branch details by changing the AMP (HO) to Head Office.
UPDATE BRANCH_MSTR SET NAME = ‘Head Office’
WHERE NAME = ‘AMP (HO)’;
❖ Delete Operations:
The DELETE command deletes rows from the table that satisfies the condition
provided by its WHERE clause, and returns the number of records deleted.
Page 10 of 48
Example: Remove only the savings bank account details from the ACCT_DTLS
table.
DELETE FROM ACCT_DTLS WHERE ACCT_NO LIKE ‘SB%’;
Here the WHERE clause is optional. If you are not specify the WHERE clause
then all the from source table to target table is copied.
Example: Insert only the savings bank accounts details in the target table
ACCT_DTLS from the source table ACCT_MSTR.
Page 11 of 48
❖ Destroying Tables:
Sometimes tables within a particular database become obsolete and need to be
discarded. In such situation using DROP TABLE statement with the table name
can destroy a specific table. If a table is dropped all records held within it are lost
and cannot be recovered.
Syntax: DROP TABLE <TableName>;
Example: Remove the table BRANCH_MSTR along with the data held.
DROP TABLE BRANCH_MSTR;
EXCERCISES:
1) Add a column “SPOUSE” to the emp table that will hold the name of an
employee’s spouse.
2) Modify the job of employees to “programmer” whose job is “trainee”.
3) Delete record whose location is “Baroda” from dept table.
4) Drop a table “stud_master”.
5) Create a table “ManagerHist” from emp whose job is “Manager”.
6) Copy all the information of department 20 into the “ManagerHist” table.
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 12 of 48
DATE: / /
B) TITLE: Study of character functions for manipulation of data items.
THEORY:
❖ Character functions:
Character functions are described as follow:
Page 13 of 48
Page 14 of 48
EXCERCISE:
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 15 of 48
❖ Join: A join is used when a SQL query requires data from more than one table on
database.
There are two main types of joinconditions: -
• Equi-join
• Non-equi join
❖ Equi-join: The relationship between two tables is equi join when any one column
corresponds to the same column in oyher table e.g. deptno in EMP table as well as in
DEPT table. Here relationship is obtained using “=”operator.
❖ Non Equi-join: The relationship between two tables is non equi join when no
column in one table corresponds directly to a column in other table. Here
relationship is obtained other than “=” operator
❖ Self Joins:
A self join is a join of a table to itself. This table appears twice in the FROM clause
and is followed by table aliases that qualify column names in the join condition.
To perform a self join, Oracle combines and returns rows of the table that satisfy
the join condition.
❖ Inner Joins:
An inner join (sometimes called a "simple join") is a join of two or more tables that
returns only those rows that satisfy the joincondition.
❖ Cross Joins:
If two tables in a join query have no join condition, Oracle returns their Cartesian
product. Oracle combines each row of one table with each row of the other. A
Cartesian product always generates many rows and is rarely useful. For example,
the Cartesian product of two tables, each with 100 rows, has 10,000 rows. Always
include a join condition unless you specifically need a Cartesianproduct.
Page 16 of 48
Outer Joins:
An outer join extends the result of a simple join. An outer join returns all rows
that satisfy the join condition and also returns some or all of those rows from one
table for which no rows from the other satisfy the join condition.
• To write a query that performs an outer join of tables A and B and returns all
rows from A (a left outer join), use the LEFT [OUTER] JOIN syntax in the
FROM clause, or apply the outer join operator (+) to all columns of B in the
join condition in the WHERE clause. For all rows in A that have no matching
rows in B, Oracle returns null for any select list expressions containing
columns of B.
• To write a query that performs an outer join of tables A and B and returns all
rows from B (a right outer join), use the RIGHT [OUTER] JOIN syntax in the
FROM clause, or apply the outer join operator (+) to all columns of A in the
join condition in the WHERE clause. For all rows in B that have no matching
rows in A, Oracle returns null for any select list expressions containing
columns of A.
• To write a query that performs an outer join and returns all rows from A and
B, extended with nulls if they do not satisfy the join condition (a full outer
join), use the FULL [OUTER] JOIN syntax in the FROM clause.
EXCERCISE:
1) Define: Join. Explain self join.
2) Retrieve employee number, employee name and their department name, in department
name order.
3) Show all employee details who lives in Baroda.
4) Display the name, salary and department number of employees whose salary is more
than 10000.
5) List the employee name, job, salary and department name for everyone in the company
except clerks. Sort on salary displaying the highest salary first.
6) List all employees by name and number along with their manager’s name and number.
7) Display all the employees who earn less than theirmanagers.
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 17 of 48
THEORY:
2. Column constraints
These reference a single column and are defined within the specification for the
owning column.
❖ Constraint types-
You may define the following constrainttypes-
1. Primary key
2. Foreign key
3. Unique
4. Null /Not null
5. Check
Primary key constraint: A primary key is a one or more column(s) in a table used to
uniquely identify each row in the table. None of the fields that are part of the primary key
can contain a null value. A table can have only one primary key.
Page 18 of 48
➢ Foreign key represent relationships between tables. A foreign key is table whose values
are derived from the primary key or unique key of some other table.
➢ The table in which the foreign key is defined is called a foreign table or Detail table.
➢ The table that defines the primary or unique key and is referenced by the foreign key is
➢ The master table can be referenced in the foreign key definition by using the
REFERENCES adverb. If the name of the column is not specified, by default, oracle
Unique constraint: The Unique column constraint permits multiple entries of NULL
into a column. These NULL values are clubbed at the top the column in order in which
they were entered into the table. This is the essential difference between the Primary Key
and Unique Constraints when applied to tablecolumn(s).
Page 19 of 48
The CHECK Constraint: Business rule validations can be applied to a table column by
using CHECK constraint. It must be specified as a logical expression that evaluates either
to TRUE or FALSE.
Page 20 of 48
EXCERCISE:
1) Create the a table client_master with the following fields:
clientno, name, address, city, pincode, state, bal_due.
Consider the appropriate data type and size for the columns. In addition, define
clientno as the primary key column.
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 21 of 48
TITLE: How to retrieve data from different tables using sub queries and correlated
queries.
THEORY:
❖ Steps:
1. The inner queries must be enclosed in parentheses, and must be on the right hand
side of the condition.
3. The ORDER BY clause appears at the end of the main select statement.
4. Sub queries are always executed from the most deeply nested to the least deeply
nested, unless they are correlated queries.
5. Logical and SQL operators may not be used as well as ANY and ALL.
EXCERCISE:
1. Find the employees who earn the maximum salary for their department. Display the
result in ascending order ofsalary.
2. Find the most recently hired employees in each department. Order by hire date.
3. Find the employees who earn the highest salary in each job type. Sort in descending
salary order.
Page 22 of 48
4. Show the following details for any employee who earns a salary less than the average
for their department.
ENAME SALARY DNAME JOB
5. Who are the top three earners in the company? Display their name andsalary.
6. Display the empno, name, job and deptno for employees whose salary is greaterthan
the highest salary in any SALES department.
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 23 of 48
❖ Creation of views:
Syntax: CREATE VIEW <ViewName> AS
SELECT <ColumnName1>, <ColumnName2>
FROM <TableName>
WHERE <ColumnName>=expression list
GROUP BY <Grouping Criteria>
HAVING <Predicate>;
Example: Create view on the emp table for the Department 10 which access for the
columns empno,ename,sal.
Answer: create view vw_emp10 as select empno,ename,sal from emp
where deptno = 10;
EXCERCISE:
Page 24 of 48
1. Create view on the emp table for the job “Clerk” which access for the columns empno,
ename, job, sal and rename the column empno as
“empnumber”. And access the data of view.
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 25 of 48
THEORY:
❖ Introduction of Index:
An index is an ordered list of the contents of a column, (or a group of columns) of a
table.
Indexing involves forming a two dimensional matrix completely independent of the
table on which the index is being created. This two dimensional matrix will have a
single column, which will hold sorted data, extracted from the table column(s) on
which the index is created.
Another column called the address field identifies the location of the record in the
oracle database.
❖ Creation of an Index:
An index can be created on one or more columns. Based on the number of columns
included in the index, an index canbe:
• Simple Index
• Composite Index
• Unique Index
❖ Creation of Index:
An index is created on a single column of a table is called a Simple Index. The syntax
for creating simple index that allows duplicate values is asdescribed:
❖ Dropping Index:
Indexes associated with the tables can be removed by using the DROP INDEX
command.
Syntax: DROP INDEX <IndexName>;
Page 26 of 48
When a table, which has associated indexes, is dropped, the oracle engine
automatically drops all the associated indexes aswell.
❖ Introduction of View:
A VIEW is a virtual table in the database whose contents are defined by a query it
can represent.
A view holds no data at all, until a specific call to the view is made. This reduces
redundant data on a HDD to a very large extent.
❖ Creation of views:
Syntax: CREATE VIEW <ViewName> AS
SELECT <ColumnName1>, <ColumnName2>
FROM <TableName>
WHERE <ColumnName>=expression list
GROUP BY <Grouping Criteria>
HAVING <Predicate>;
Example: Create view on the emp table for the Department 10 which access for the
columns empno,ename,sal.
Answer: create view vw_emp10 as select empno,ename,sal from emp
where deptno = 10;
❖ Introduction of Sequence:
Most application requires automatic generation of numeric value.
Sequences are tools used to generate a unique sequential number that can be used
in the data tables. One of the best features of sequences is that they guarantee that
you will get a unique value when you access the sequence.
The value generated can have a maximum of 38 digits.
Page 27 of 48
❖ Creation of Sequence:
Syntax:CREATESEQUENCE<SequenceName>
[INCREMENT BY <IntegerValue>
START WITH <IntegerValue>
MAXVALUE <IntegerValue> / NOMAXVALUE
MINVALUE <IntegerValue> / NOMINVALUE
CYCLE/ NOCYCLE
CACHE <IntegerValue>/ NOCACHE
ORDER / NOORDER]
Note:
Sequence is always given a name so that it can be referenced later whenrequired.
The ORDER, NOORDER Clause has no significance, if Oracle is configured with
Single server option. It is useful only when you are using Parallel Server in Parallel
mode option.
If the CACHE / NOCACHE clause is omitted oracle caches 20 sequence numbersby
default.
Example:
Create sequence order_seq, which will generate numbers from 1 to 9999 in ascending
order with an interval of 1. The sequence must restart from the number 1 after
generating number 9999.
CREATE SEQUENCE order_seq INCREMENT BY 1 START WITH 1
MINVALUE 1 MAXVALUE 9999 CYCLE;
❖ Referencing a Sequence:
Once a sequence is created SQL can be used to view the values held in its cache. To
simply view sequence value use a select sentence as described below.
SELECT <sequence_name>.NextVal FROM dual;
This will display the next value held in the cache on the VDU screen. Every time
nextval references a sequence its output is automatically incremented from the old
value to the new value ready foruse.
After creating a table you can add the data by using the INSERT command like this:
INSERT INTO sales_order(o_no, o_date, c_no)
VALUES (order_seq.nextval, sysdate, ‘c0001’);
To references the current value of asequence:
SELECT <sequence_name>.CurrVal FROM dual;
Page 28 of 48
❖ Introduction of Synonyms:
A synonym is an alternative name for objects such as tables, views, sequences,
stored procedures, and other database objects.
Syntax: CREATE [OR REPLACE] [PUBLIC] SYNONYM [SCHEMA.]
SYNONYM_NAME FOR [SCHEMA.] OBJECT_NAME [@DBLINK];
Now, users of other schemas can references the table EMP, which is now called
EMPLOYEES without having the prefix the table name with the schema named
SCOTT.
EXCERCISE:
1. Create a sequence “seq3” with the followingparameters:
Increment by -1, cache 20, cycle, noorder and which will generate the numbers
from 1 to 5000 in descending order.
2. Create a simple index on “orderid” column of a table ‘order’.
3. Create a synonym “employee“ from the tableemp.
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 29 of 48
THEORY:
❖ WHAT IS NORMALIZATION?
➢ “Normalization is essentially the process of taking a wide table with lots of columns
but few rows and redesigning it as several narrow tables with fewer columns but
more rows.”
A properly normalized design allows you to use storage space efficiently, eliminate
redundant data, reduce or eliminate inconsistent data, and ease the data
maintenance burden. Before looking at the forms of normalization, you need to
know one cardinal rule for normalizing adatabase:
“You must be able to reconstruct the original flat view of the data.”
❖ Forms of normalization:
Relational database theorists have divided normalization into several rules called
normal forms.
• First Normal Form: No repeatinggroups.
• Second Normal Form: No nonkey attributes depend on a portion of the primary
key.
• Third Normal Form: No attributes depend on other non-key attributes.
• Boyce-Codd normal form (BCNF): Every non-trivial functional dependency in
the table is a dependency on asuperkey.
• Fourth Normal Form: Every non-trivial multivalued dependency in the table is a
dependency on a superkey.
Page 30 of 48
• Fifth Normal Form: Every non-trivial join dependency in the table is implied by
the superkeys of the table.
EXCERCISE:
6) Normalize the following table upto third normalform:
Author Author
Collection or
Last First Book Title Subject Publisher Building
Library
Name Name
PCL General
Berdahl Robert Politics History Wiley B – Block
Stacks
Legal
Yudof Mark Child Abuse Person Law Library C – Block
Procedures
Human Memory Cognitive PCL General
Harmon Glynn TMH B – Block
and Knowledge Psychology Stacks
Greek
Graves Robert The Golden Fleece Wiley Classics Library D – Block
Literature
Library and
Charles Ammi Library Information
Miksa Francis Person B – Block
Cutter Biography Science
Collection
Music Publishing Music
Hunter David TMH Fine Arts Library A – Block
and Collecting Literature
English and PCL General
Graves Robert Folksong Mahajan B – Block
Scottish Ballads Stacks
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 31 of 48
THEORY:
❖ WHAT IS NOSQL?
➢ NoSQL is a non-relational database management systems, different from traditional relational
database management systems in some significant ways. It is designed for distributed data stores
where very large scale of data storing needs (for example Google or Facebook which collects terabits
of data every day for their users). These type of data storing may not require fixed schema, avoid join
operations and typically scale horizontally.
Example:
Social-network graph:
➢ Task: Retrieve all pages regarding athletics of Summer Olympic before 1950.
❖ RDBMS vs NoSQL
➢ RDBMS
- Structured and organized data
- Structured query language (SQL)
- Data and its relationships are stored in separate tables.
- Data Manipulation Language, Data Definition Language
- Tight Consistency
➢ NoSQL
➢
- Stands for Not Only SQL
- No declarative query language
- No predefined schema
- Key-Value pair storage, Column Store, Document Store, Graph databases
- Eventual consistency rather ACID property
- Unstructured and unpredictable data
- CAP Theorem
- Prioritizes high performance, high availability and scalability
- BASE Transaction
Consistency - This means that the data in the database remains consistent after the execution of an operation. For
example after an update operation all clients see the same data.
Availability - This means that the system is always on (service guarantee availability), no downtime.
Page 33 of 48
Partition Tolerance - This means that the system continues to function even the communication among the servers
is unreliable, i.e. the servers may be partitioned into multiple groups that cannot communicate with one another.
In theoretically it is impossible to fulfill all 3 requirements. CAP provides the basic requirements for a distributed
system to follow 2 of the 3 requirements. Therefore all the current NoSQL database follow the different
combinations of the C, A, P from the CAP theorem. Here is the brief description of three combinations CA, CP, AP :
CA - Single site cluster, therefore all nodes are always in contact. When a partition occurs, the system blocks.
CP -Some data may not be accessible, but the rest is still consistent/accurate.
AP - System is still available under partitioning, but some of the data returned may be inaccurate.
❖ NoSQL pros/cons
➢ Advantages :
• High scalability
• Distributed Computing
• Lower cost
• Schema flexibility, semi-structure data
• No complicated Relationships
➢ Disadvantages
• No standardization
• Limited query capabilities (so far)
• Eventual consistent is not intuitive to program for
Page 34 of 48
➢ The BASE
The CAP theorem states that a distributed computer system cannot guarantee all of the following three properties at
the same time:
• Consistency
• Availability
• Partition tolerance
A BASE system gives up on consistency.
• Basically Available indicates that the system does guarantee availability, in terms of the CAP theorem.
• Soft state indicates that the state of the system may change over time, even without input. This is because of
the eventual consistency model.
• Eventual consistency indicates that the system will become consistent over time, given that the system
doesn't receive input during that time.
➢ ACID vs BASE
ACID BASE
Durable
❖ NoSQL Categories
There are four general types (most common categories) of NoSQL databases. Each of these categories has its own
specific attributes and limitations. There is not a single solutions which is better than all the others, however there
are some databases that are better to solve specific problems. To clarify the NoSQL databases, lets discuss the most
common categories :
• Key-value stores
• Column-oriented
Page 35 of 48
• Graph
• Document oriented
➢ Key-value stores
Pictorial Presentation :
Page 36 of 48
➢ Column-oriented databases
• Column-oriented databases primarily work on columns and every column is treated individually.
• Values of a single column are stored contiguously.
• Column stores data in column specific files.
• In Column stores, query processors work on columns too.
• All data within each column datafile have the same type which makes it ideal for compression.
• Column stores can improve the performance of queries as it can access specific column data.
• High performance on aggregation queries (e.g. COUNT, SUM, AVG, MIN, MAX).
• Works on data warehouses and business intelligence, customer relationship management (CRM), Library
card catalogs etc.
Example of Column-oriented databases : BigTable, Cassandra, SimpleDB etc.
Page 37 of 48
Pictorial Presentation :
➢ Graph databases
A graph data structure consists of a finite (and possibly mutable) set of ordered pairs, called edges or arcs, of certain
entities called nodes or vertices.
The following picture presents a labeled graph of 6 vertices and 7 edges.
Page 38 of 48
Rows Vertices
Joins Edges
Pictorial Presentation :
Page 39 of 48
• A collection of documents
• Data in this model is stored inside documents.
• A document is a key value collection where the key allows access to its value.
• Documents are not typically forced to have a schema and therefore are flexible and easy to change.
• Documents are stored into collections in order to group different kinds of data.
• Documents can contain many different key-value pairs, or key-array pairs, or even nested documents.
Here is a comparison between the classic relational model and the document model :
Tables Collections
Rows Documents
Pictorial Presentation :
Page 40 of 48
➢ Production deployment
There is a large number of companies using NoSQL. To name a few :
• Google
• Facebook
• Mozilla
• Adobe
• Foursquare
• LinkedIn
• Digg
• McGraw-Hill Education
• Vermont Public Radio
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 41 of 48
Page 42 of 48
THEORY:
❖ WHAT IS BIGDATA?
Data which are very large in size is called Big Data. Normally we work on data of size MB(WordDoc ,Excel)
or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is called Big Data. It is stated that
almost 90% of today's data has been generated in the past 3 years.
o Social networking sites: Facebook, Google, LinkedIn all these sites generates huge amount of data on a day
to day basis as they have billions of users worldwide.
o E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs from which users
buying trends can be traced.
o Weather Station: All the weather station and satellite gives very huge data which are stored and
manipulated to forecast weather.
o Telecom company: Telecom giants like Airtel, Vodafone study the user trends and accordingly publish
their plans and for this they store the data of its million users.
o Share Market: Stock exchange across the world generates huge amount of data through its daily transaction.
❖ WHAT IS HADOOP?
Hadoop is an open source framework from Apache and is used to store process and analyze data which are very
huge in volume. Hadoop is written in Java and is not OLAP (online analytical processing). It is used for
batch/offline processing.It is being used by Facebook, Yahoo, Google, Twitter, LinkedIn and many more. Moreover
it can be scaled up just by adding nodes in the cluster.
Page 43 of 48
➢ Modules of Hadoop
1. HDFS: Hadoop Distributed File System. Google published its paper GFS and on the basis of that HDFS
was developed. It states that the files will be broken into blocks and stored in nodes over the distributed
architecture.
2. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster.
3. Map Reduce: This is a framework which helps Java programs to do the parallel computation on data using
key value pair. The Map task takes input data and converts it into a data set which can be computed in Key
value pair. The output of Map task is consumed by reduce task and then the out of reducer gives the desired
result.
4. Hadoop Common: These Java libraries are used to start Hadoop and are used by other Hadoop modules.
➢ Advantages of Hadoop
o Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the
tools to process the data are often on the same servers, thus reducing the processing time. It is able to
process terabytes of data in minutes and Peta bytes in hours.
o Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
o Cost Effective: Hadoop is open source and uses commodity hardware to store data so it really cost effective
as compared to traditional relational database management system.
o Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one
node is down or some other network failure happens, then Hadoop takes the other copy of data and use it.
Normally, data are replicated thrice but the replication factor is configurable.
➢ Hadoop Installation
Environment required for Hadoop: The production environment of Hadoop is UNIX, but it can also be used in
Windows using Cygwin. Java 1.6 or above is needed to run Map Reduce Programs. For Hadoop installation from
tar ball on the UNIX environment you need
1. Java Installation
2. SSH installation
3. Hadoop Installation and File Configuration
➢ 1) Java Installation
Step 1. Type "java -version" in prompt to find if the java is installed or not. If not then download java from
http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html . The tar filejdk-7u71-
linux-x64.tar.gz will be downloaded to your system.
Step 3. To make java available for all the users of UNIX move the file to /usr/local and set the path. In the prompt
switch to root user and then type the command below to move the jdk to /usr/lib.
# mv jdk1.7.0_71 /usr/lib/
Page 44 of 48
Now in ~/.bashrc file add the following commands to set up the path.
# export JAVA_HOME=/usr/lib/jdk1.7.0_71
# export PATH=PATH:$JAVA_HOME/bin
Now, you can check the installation by typing "java -version" in the prompt.
➢ 2) SSH Installation
SSH is used to interact with the master and slaves computer without any prompt for password. First of
all create a Hadoop user on the master and slave systems
# useradd hadoop
# passwd Hadoop
To map the nodes open the hosts file present in /etc/ folder on all the machines and put the ip address along
with their host name.
# vi /etc/hosts
190.12.1.114 hadoop-master
190.12.1.121 hadoop-salve-one
190.12.1.143 hadoop-slave-two
Set up SSH key in every node so that they can communicate among themselves without password.
Commands for the same are:
# su hadoop
$ ssh-keygen -t rsa
$ ssh-copy-id -i ~/.ssh/id_rsa.pub tutorialspoint@hadoop-master
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp1@hadoop-slave-1
$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop_tp2@hadoop-slave-2
$ chmod 0600 ~/.ssh/authorized_keys
$ exit
➢ 3) Hadoop Installation
$ mkdir /usr/hadoop
$ sudo tar vxzf hadoop-2.2.0.tar.gz ?c /usr/hadoop
Page 45 of 48
export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71
1. <configuration>
2. <property>
3. <name>fs.default.name</name>
4. <value>hdfs://hadoop-master:9000</value>
5. </property>
6. <property>
7. <name>dfs.permissions</name>
8. <value>false</value>
9. </property>
10. </configuration>
1. <configuration>
2. <property>
3. <name>dfs.data.dir</name>
4. <value>usr/hadoop/dfs/name/data</value>
5. <final>true</final>
6. </property>
7. <property>
8. <name>dfs.name.dir</name>
9. <value>usr/hadoop/dfs/name</value>
10. <final>true</final>
11. </property>
12. <property>
13. <name>dfs.replication</name>
14. <value>1</value>
15. </property>
16. </configuration>
1. <configuration>
2. <property>
Page 46 of 48
3. <name>mapred.job.tracker</name>
4. <value>hadoop-master:9001</value>
5. </property>
6. </configuration>
1. cd $HOME
2. vi .bashrc
3. Append following lines in the end and save and exit
4. #Hadoop variables
5. export JAVA_HOME=/usr/lib/jvm/jdk/jdk1.7.0_71
6. export HADOOP_INSTALL=/usr/hadoop
7. export PATH=$PATH:$HADOOP_INSTALL/bin
8. export PATH=$PATH:$HADOOP_INSTALL/sbin
9. export HADOOP_MAPRED_HOME=$HADOOP_INSTALL
10. export HADOOP_COMMON_HOME=$HADOOP_INSTALL
11. export HADOOP_HDFS_HOME=$HADOOP_INSTALL
12. export YARN_HOME=$HADOOP_INSTALL
1. # su hadoop
2. $ cd /opt/hadoop
3. $ scp -r hadoop hadoop-slave-one:/usr/hadoop
4. $ scp -r hadoop hadoop-slave-two:/usr/Hadoop
1. $ vi etc/hadoop/masters
2. hadoop-master
3.
4. $ vi etc/hadoop/slaves
5. hadoop-slave-one
6. hadoop-slave-two
After this format the name node and start all the deamons
1. # su hadoop
2. $ cd /usr/hadoop
3. $ bin/hadoop namenode -format
4.
5. $ cd $HADOOP_HOME/sbin
6. $ start-all.sh
Page 47 of 48
EVALUATION:
Understanding /
Involvement Timely
Problem solving Total
Completion
(10)
(4) (3)
(3)
Page 48 of 48
Page 49 of 48