0% found this document useful (0 votes)

90 views124 pages

Unit 1 Big Data

Uploaded by

alexvictor987123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views124 pages

Unit 1 Big Data

Uploaded by

alexvictor987123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 124

18CSE391T- BIG DATA TOOLS AND

TECHNIQUES
Unit 1

Overview of Big Data Analytics

Overview of Big
Data Analytics
What is Big Data?
Big data refers to data that is so large, fast or complex that it’s difficult or impossible to process using
traditional methods.

the concept of big data gained momentum in the early 2000s

The three Vs of big data
Volume. Organizations collect data from a variety of sources, including transactions, smart (IoT)
devices, industrial equipment, videos, images, audio, social media and more. In the past, storing
all that data would have been too costly – but cheaper storage using data lakes, Hadoop and the
cloud have eased the burden.

Velocity. With the growth in the Internet of Things, data streams into businesses at an
unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart
meters are driving the need to deal with these torrents of data in near-real time.

Variety. Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, emails, videos, audios, stock ticker data and financial
transactions.
Big data use cases

Product development: Companies like Netflix and Procter & Gamble use big data to anticipate
customer demand. They build predictive models for new products and services by classifying key
attributes of past and current products or services and modeling the relationship between those
attributes and the commercial success of the offerings.

Predictive maintenance: Factors that can predict mechanical failures may be deeply buried in
structured data, such as the year, make, and model of equipment, as well as in unstructured data that
covers millions of log entries, sensor data, error messages, and engine temperature.

Customer experience
Big data enables you to gather data from social media, web visits, call logs, and other sources to
improve the interaction experience and maximize the value delivered. Start delivering personalized
offers, reduce customer churn, and handle issues proactively.
Fraud and compliance
. Big data helps you identify patterns in data that indicate fraud and aggregate large volumes of
information to make regulatory reporting much faster.
Machine learning
big data—is one of the reasons why. We are now able to teach machines instead of program them.
The availability of big data to train machine learning models makes that possible.
What is big data analytics?
Big data analytics describes the
process of uncovering trends,
patterns, and correlations in
large amounts of raw data to
help make data-informed
decisions. These processes use
familiar statistical analysis
techniques—like clustering and
regression—and apply them to
more extensive datasets with
the help of newer tools.
How big data analytics works
Collect Data
Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets to
help organizations operationalize their big data.

Data collection looks different for every organization. With today’s technology,
organizations can gather both structured and unstructured data from a variety of sources
— from cloud storage to mobile applications to in-store IoT sensors and beyond. Some
data will be stored in data warehouses where business intelligence tools and solutions can
access it easily. Raw or unstructured data that is too diverse or complex for a warehouse
may be assigned metadata and stored in a data lake.
Process and Clean data

2. Process Data

Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing
exponentially, making data processing a challenge for organizations. One processing option is
batch processing, which looks at large data blocks over time. Batch processing is useful when
there is a longer turnaround time between collecting and analyzing data. Stream processing
looks at small batches of data at once, shortening the delay time between collection and
analysis for quicker decision-making. Stream processing is more complex and often more
expensive.

3. Clean Data

Data big or small requires scrubbing to improve data quality and get stronger results; all
data must be formatted correctly, and any duplicative or irrelevant data must be eliminated
or accounted for. Dirty data can obscure and mislead, creating flawed insights.
Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics
processes can turn big data into big insights. Some of these big data analysis
methods include:

● Data mining sorts through large datasets to identify patterns and

relationships by identifying anomalies and creating data clusters.
● Predictive analytics uses an organization’s historical data to make
predictions about the future, identifying upcoming risks and opportunities.
● Deep learning imitates human learning patterns by using artiﬁcial
intelligence and machine learning to layer algorithms and ﬁnd patterns in
the most complex and abstract data.
Big data analytics tools and technology
several types of tools work together to help you collect, process, cleanse, and analyze big data. Some
of the major players in big data ecosystems are listed below.

● Hadoop is an open-source framework that efficiently stores and processes big datasets on
clusters of commodity hardware. This framework is free and can handle large amounts of
structured and unstructured data, making it a valuable mainstay for any big data operation.
● NoSQL databases are non-relational data management systems that do not require a fixed
scheme, making them a great option for big, raw, unstructured data. NoSQL stands for “not
only SQL,” and these databases can handle a variety of data models.
● MapReduce is an essential component to the Hadoop framework serving two functions.
● The first is mapping, which filters data to various nodes within the cluster.
● The second is reducing, which organizes and reduces the results from each node to answer a
query.
● YARN stands for “Yet Another Resource Negotiator.” It is another component of
second-generation Hadoop. The cluster management technology helps with job
scheduling and resource management in the cluster.
● Spark is an open source cluster computing framework that uses implicit data
parallelism and fault tolerance to provide an interface for programming entire
clusters. Spark can handle both batch and stream processing for fast computation.
● Tableau is an end-to-end data analytics platform that allows you to prep,
analyze, collaborate, and share your big data insights. Tableau excels in
self-service visual analysis, allowing people to ask new questions of governed big
data and easily share those insights across the organization.
big benefits of big data analytics
The ability to analyze more data at a faster rate can provide big
benefits to an organization, allowing it to more efficiently use data
to answer important questions

Some beneﬁts of big data analytics include:

● Cost savings. Helping organizations identify ways to do business

more efficiently
● Product development. Providing a better understanding of
customer needs
● Market insights. Tracking purchase behavior and market trends
Tableau
Pig

Pig is a high-level platform or tool which is used to process the large datasets.
Flink

Flink has been designed to run in all common cluster environments, perform computations at
in-memory speed and at any scale.
Spark

Spark is a general-purpose distributed processing system used for big data workloads. It has
been deployed in every type of big data use case to detect patterns, and provide real-time insight.
Mongo DB

MongoDB is a document database that provides high

performance, high availability, and easy scalability. It
is a cross-platform document-oriented database system
classified as a NoSQL database, that bridges the gap
between key-value and traditional RDBMS systems.
Building blocks of big data analytics

big data systems can be considered to have four major layers, each
of which are indispensable

Hardware: Servers that provide the computing backbone, storage

devices that store the data, and network connectivity across
different server components are some of the elements that deﬁne
the hardware stack.

Software: Software resources that facilitate analytics on the

datasets hosted in the hardware layer,Analytics software can be
classiﬁed into various subdivisions
Data mining: Software that provides facilities for aggregations, joins across datasets, and
pivot tables on large datasets fall into this category. Standard NoSQL platforms such as
Cassandra, Redis, and others are high-level, data mining tools for big data analytics.

Statistical analytics: Platforms that provide analytics capabilities beyond simple data mining,
such as running algorithms that can range from simple regressions to advanced neural
networks such as Google TensorFlow or R, fall into this category.
Data management: Data encryption, governance, access, compliance, and other features
salient to any enterprise and production environment to manage and, in some ways, reduce
operational complexity form the next basic layer.

End user: The end user of the analytics software forms the ﬁnal aspect of a big data
analytics engagement.The term data scientist is often used to denote individuals who
implement the underlying big data analytics capabilities while business users reap the
beneﬁts of faster access and analytics capabilities not available in traditional systems.
Types of Big Data

Data can be broadly classiﬁed as being structured, unstructured, or semi-structure

Structured

Structured data, as the name implies, indicates datasets that have a deﬁned
organizational structure such as Microsoft Excel or CSV ﬁles.

Commercial databases such as Teradata, Greenplum as well as Redis, Cassandra,

and Hive in the open source domain are examples of technologies that provide the
ability to manage and query structured data.
Unstructured

Unstructured data consists of any dataset that does not have a predeﬁned organizational
schema as in the table in the prior section. Spoken words, music, videos, and even books,
including this one, would be considered unstructured.

Data from social media, such as posts on Twitter, messages from friends on Facebook, and
photos on Instagram, are all examples of unstructured data.

For textual data, technologies such as Lucene/Solr, Elasticsearch, and others are generally
used to query, index, and other operations
Semi-structured

, semi-structured data is usually represented by formats, such as JSON, that can encapsulate
both structured as well as schemaless or arbitrary associations, generally using key-value
pairs.

A more common example could be email messages, which have both a structured part, such
as name of the sender, time when the message was received, and an unstructured portion
represented by the body or content of the email.

Platforms such as Mongo and CouchDB are generally used to store and query
semistructured datasets.
What is big data mining?

big data mining refers to the entire life cycle of processing

large-scale datasets, from procurement to implementation of
the respective tools to analyze them.
Big data mining in the enterprise

The success of any big data platform depends largely on ﬁnding relevant problems in
business units that will deliver measurable value for the department or organization.

Who needs big data mining: Determining which business groups will benefit most
significantly from a big data mining solution is the first step in this process

As an example, in a pharmaceutical organization, this could include Commercial Research,

Epidemiology, Health Economics, and Outcomes. At a ﬁnancial services organization, this
could include Algorithmic Trading Desks, Quantitative Research, and even Back Ofﬁce.
Determining the use cases:

Most organizations follow a hierarchical structure where the interaction among business
colleagues is likely to be mainly along rank lines.

a close collaboration between both the practitioner as well as the stakeholder;

Stakeholders' buy-in:

getting these people involved in the decision-making process and on board with the vision
and mission.

1. Engage stakeholders as early as possible. ...

2. Provide a clear rationale for the project. ...
3. Speak their language. ...
4. Identify and manage risks. ...
5. Make expectations clear.
Early-wins and the effort-to-reward ratio:

A relatively small use case that can be implemented in a short

time within a smaller budget to optimize a speciﬁc
business-critical function helps in showcasing early-wins, thus
adding credibility to the big data solution in question

effort is the time and work required to implement the use

case.This includes aspects such as how long it would take to
procure the relevant hardware and/or software that is part of the
solution, the resources or equivalent man-hours it will take to
implement the solution, and the overall operational overhead.
Leveraging the early-wins:

The successful implementation of one or more of the projects

in the early-wins phase often lays the groundwork to
develop a bigger strategy for the big data analytics platform
that goes far beyond the needs of just a single department
and has a broader organizational-level impact.

An early win is the implementation of a small outcome or

objective that can be realized quickly and early in a project's
timeframe.
Implementation life cycle

As outlined earlier, the implementation process can span multiple steps. These steps are
often iterative in nature and require a trial-and-error approach.

Big Data strategy will include multiple stakeholders and a collaborative approach often
yields the best results. Business sponsors, business support and IT & Analytics are three
broad categories of stakeholders that together create a proper uniﬁed solution, catering to
the needs of the business to the extent that budget and IT capabilities will permit.
Stakeholders of the solution

Business sponsor: The individual or department that provides the support and/or funding for
the project.

Implementation group: The team that implements the solution from a hands-on perspective.
This is usually the IT or Analytics department

IT procurement:Compliance with internal IT policies and assessment of other aspects such as

licensing costs are some of services provided by procurement, especially for commercial
products.

Legal:Open source software licenses such as Apache 2.0, MIT, and BSD are generally more
permissible relative to GNU GPL (General Purpose License). For commercial solutions, the
process is more involved as it requires the analysis of vendor-specific agreements
Implementing the solution
The final implementation of the solution is the
culmination of the collaboration between the
implementation group, business beneficiaries,
and auxiliary departments.

The time to undertake projects from start to

end can vary anywhere from 3-6 months for
most smallsized projects as explained in the
section on early-wins.

Larger endeavors can take several months to

years to accomplish and are marked by an agile
framework of product management where
capabilities are added incrementally during the
implementation and deployment period
Hadoop

Hadoop is an open-source platform that provides analytical technologies and computational

power required to work with such large volumes of data.

Hadoop platform provides an improved programming model, which is used to create and run
distributed systems quickly and efficiently.

A Hadoop cluster consists of single MasterNode and multiple worker nodes. The MasterNode
contains a NameNode and JobTracker and a slave or worker node acts as both a DataNode and
TaskTracker. Hadoop requires Java Runtime Environment (JRE) 1.6 or a higher version of JRE.

There are two main components of Apache Hadoop – the Hadoop Distributed File System
(HDFS) and the MapReduce parallel processing framework.
Who Uses Hadoop?

● British Airways
● Uber
● The Bank of Scotland
● Netﬂix
● The National Security Agency (NSA), of the United States
● The UK’s Royal Mail system
● Expedia
● Twitter
Installing Hadoop

There are several ways to install Hadoop. T

he most common ones are: Installing Hadoop from the source ﬁles from
https://hadoop.apache.org

1. Installing using open source distributions from commercial vendors such as

2. Cloudera and Hortonworks

Cloudera Distribution of Apache Hadoop (CDH), an integrated platform consisting of several

Hadoop and Apache-related products. Cloudera is a popular commercial Hadoop vendor
that provides managed services for enterprise-scale Hadoop deployments
Hadoop

Hadoop distributed file system (HDFS) is a fault-tolerant storage system in Hadoop. It

stores large size files from terabytes to petabytes across different terminals.
Data is replicated on three nodes: two on the same rack and one on a different rack.
The file in HDFS is split into large blocks size of 64 MB by default (typically 64 to 128
megabytes) and each block of the file is independently replicated at multiple data nodes.
The NameNode actively monitors the number of replicas of a block (by default 3 times).
When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode
creates another replica of the block.
R
R is an open source programming language and an application environment for
statistical computing with graphics, developed by R Foundation for Statistical Computing.

It is an interpreted language like Python and uses a command line interpreter. It
supports procedural as well as generic functions with OOP.

R is extensively used by data miners and statisticians, providing a vast variety of

graphical and statistical techniques, with linear and nonlinear modelling, time-series
analysis, classical statistical tests, clustering, classification and others.

R is easily extendable and implementable through functions and available extensions.

Technical Elements of Big
Data Platform
Big Data Techniques

To analyse the datasets, there are many techniques available, some of which are as follows:

Massive Parallelism

Data Distribution

High-Performance Computing

Task and Thread Management

Data Mining and Analytics

Data Retrieval
Massive Parallelism

According to the simplest definition available, a parallel system is a system where multiple
processors are involved and associated to carry out the concurrent computations.
Massive parallelism refers to a parallel system where multiple systems interconnected with
each other pose as a single mighty conjoint processor and carry out tasks received from the
data sets parallelly.
In terms of Big Data dynamics, the systems can not only be processor, but also memory,
hardware and even network conjoint to scale up the operational efficiency posing as a
massive system that can eat humongous datasets parallelly without breaking a sweat.
Data Distribution

There are approaches to data distribution in a Big Data system described as follows:

Centralised Approach: A central repository is used to store and download the essential

dataset by virtual machines.

Semi-Centralised Approach: Semi-centralised approach reduces the stress on the

networking infrastructure.

Hierarchical Approach: In a hierarchical approach, the data is fetched from the parent

node, i.e., the virtual machine, in the hierarchy.

P2P Approach: P2P streaming connections are based on hierarchical multi-trees

High-Performance Computing

The words supercomputing and high-performance computing are often

used to resemble each other.
High-performance computing is the simultaneous use of supercomputers
and parallel processing techniques for solving intricate computation
problems.
High-performance computing is used for performing research activities
and cracking advanced problems through computer simulation, modelling
and analysis.
Task and Thread Management

It is a useful feature when complex applications having multiple tasks need
to be performed at the same time.
Task parallelism refers to the execution of computer programmes
throughout the multiple processors on the different or same machines. It
emphasises on performing diverse operations in parallel to best utilise the
accessible computing resources like memory and processors.
Data Mining and Analytics

Data mining is the process of data extraction, evaluating it from multiple

perspectives and then producing the information summary in a meaningful
form that identifies one or more relationships within the dataset.
Data analytics is about applying an algorithmic or logical process to derive
the insights from a given dataset. For example, looking at the past year’s
weather and pest data, for the current month, we can determine that a
particular type of fungus grows often when the humidity levels reach a
definite point.
Data Retrieval

Big Data refers to the large amounts of multi-structural data that

continuously flows around and within the organisations, and includes text,
video, transactional records and sensor logs.

Big Data systems utilise the Hadoop and the HDFS architecture to retrieve
the data using MapReduce – a distributed processing framework.

It helps programmers in solving parallel data problems where the dataset
can be divided into small chunks and handled autonomously.
Machine Learning

Machine learning systems utilise multiple algorithms to discover and show

the patterns hidden in the datasets.
Data Visualisation
Cloud Providers in Big Data Market

. Some of these providers are rendering services that are relevant to Big
Data analytics only. Some such providers are as follows:

Amazon

Google

Windows Azure
In-Memory Technology for Big Data

The in-memory big data computing tool supports processing of high

velocity data in real-time and also faster processing of the stationary data.
Technologies like event-streaming platforms, in-memory databases and
analytics, and high-level messaging structures
At the technical level, there are primarily two main considerations: Selection of the hardware
stack Selection of the software and BI (business intelligence) platform

Over the recent 2-3 years, it has become increasingly common for corporations to move their
processes to cloud-based environments as a complementary solution for in-house
infrastructures

On-premises can be used interchangeably with In-house, On-site, and other similar
terminologies
Selection of the hardware stack

type of solution

type of data (structured, unstructured, or semi-structured),

the size of data (gigabytes versus terabytes versus petabytes), and, to an extent,

the frequency with which the data will be updated.

three broad models of hardware
architecture:

Multinode architecture: multiple nodes (or servers) that are interconnected and work on the
principle of multinode or distributed computing

Hadoop, where multiple servers maintain bi-directional communication to coordinate a job.

Other technologies such as a NoSQL database like Cassandra and search and analytics
platform like Elasticsearch also run on the principle of multinode computing architecture.
Single-node architecture:

Single-node architecture: Single-node refers to computation done on a single server.

If the dataset is structured, contains primarily textual data, and is in the order of 1-5 TB, in
today’s computing environment, it is entirely possible to host such datasets on single-node
machines

Cloud-based architecture: Cloud platforms such as Amazon Web Services, Azure from
Microsoft, and the Google Compute Environment permit enterprises to provision 10s to
1000s of nodes at costs starting as low as 1 cent per hour per instance.

The exponential decrease in the cost of hardware: The cost of hardware has gone down
exponentially over the past few years.
Selection of the software stack
The Hadoop ecosystem: Hadoop supports nearly all the various types of datasets—such as
structured, unstructured, and semi-structured—well-known in the big data space. Its thriving
ecosystem of auxiliary tools that add new functionalities

There are four primary components of Hadoop, apart from the projects present in the large
ecosystem. They are as follows:

Hadoop Common: The common utilities that support the other Hadoop modules Hadoop
Distributed File System (HDFS™): A distributed ﬁlesystem that provides high-throughput
access to application data Hadoop

YARN: A framework for job scheduling and cluster resource management Hadoop
MapReduce: A YARN-based system for parallel processing of large datasets
Apache Spark
multinode computing framework ﬁrst conceived at University of California at Berkeley’s
AMPLab as a platform that provided a seamless interface to run parallel computations and
overcome limitations in the Hadoop MapReduce framework

In particular, Spark internally leverages a concept known as DAG—directed acyclic

graphs—which indicates a functionality that optimizes a set of operations into a smaller, or
more computationally efﬁcient, set of operations

Spark exposes several APIs—application programming interfaces—to commonly used

languages such as Python (PySpark) and Scala (natively available interface).

Spark introduces a data structure called Resilient Distributed Datasets (RDD), which
provides a mechanism to store data in-memory
Resilient Distributed Datasets (RDD)

Cluster manager: The nodes constituting a Spark cluster communicate using cluster
managers, which manage the overall coordination among the nodes that are part of the
cluster

Distributed storage: Spark can access data from a range of underlying distributed storage
systems such as HDFS, S3 (AWS Storage), Cassandra, HBase, Hive, Tachyon, and any
Hadoop data source

Spark can support multiple types of cluster managers as well as backend storage systems,
NoSQL and traditional database

The term NoSQL came into existence recently and is meant to distinguish databases that do
not follow the traditional relational-database models. There are both open source and
commercial variations of NoSQL databases and indeed even cloud-based options that have
become increasingly common

Key-value: These NoSQL databases store data on a principle of hashing—a unique key
identiﬁes a set of properties about the key

In-memory: such as storing caches in the memory to provide faster access relative to storing
on disk

NoSQL databases, such as Redis and KDB+, leverage temporary in-memory storage in order
to provide faster access to frequently used data
Columnar

These databases append multiple columns of data as opposed to rows to create a table. The
primary advantage of columnar storage over row-based storage is that a columnar layout
provides the means to access data faster with reduced I/O overhead

Well-known columnar databases include Cassandra, Google BigTable

Document-oriented

document-oriented databases store data that do not conform to any speciﬁc schema such as
unstructured text like news articles

document databases such as MongoDB are used widely in media-related organizations such
as NY Times
Cloud-based solutions

Finally, cloud-based solutions for large-scale data mining such as AWS Redshift, Azure SQL
Data Warehouse, and Google Bigquery permit users to query datasets directly on the
cloud-vendor’s platform without having to create their own architecture
● HDFS -> Hadoop Distributed File System
● YARN -> Yet Another Resource Negotiator
● MapReduce -> Data processing using
programming
● Spark -> In-memory Data Processing
● PIG, HIVE-> Data Processing Services using
Query (SQL-like)
● HBase -> NoSQL Database
● Mahout, Spark MLlib -> Machine Learning
● Apache Drill -> SQL on Hadoop
● Zookeeper -> Managing Cluster
● Oozie -> Job Scheduling
● Flume, Sqoop -> Data Ingesting Services
● Solr & Lucene -> Searching & Indexing
● Ambari -> Provision, Monitor and Maintain
cluster
Distributed and parallel Computing
for Big Data
BIG DATA

•Big Data can’t be handled by traditional data storage and processing systems.

•For handling such type of data,

Distributed and Parallel Technologies are more suitable

DISTRIBUTED AND PARALLEL
COMPUTING FOR BIG DATA
•Distributed Computing

Multiple computing resources are connected in a network and computing tasks are

distributed across these resources.

Increases the Speed

Increases the Efficiency
more suitable to process huge amount of data in a limited time
Parallel Computing

•Also improves the processing capability of a computer system by adding additional

computational resources to it.

•Divide complex computations into subtasks, handled individually by processing units, running in

parallel.

Concept – processing capability will increase with the increase in the level of parallelism.
Data parallelism paradigm

To solve the big data problems, the data parallelism paradigm can be considered.

Given a large scale dataset D, it can be divided into n subsets, denoted as D1, D2, D3,
.....Dn, where each subset may contain different numbers of data samples and each subset
may or may not have duplicate data samples.

Then, a specific data mining algorithm implemented in n local machines (or computer
nodes) individually is performed over each subset.

Finally, the n mining results are combined via one combination component to produce the
final output.
PARALLEL COMPUTING
BIG DATA PROCESSING TECHNIQUES
With the increase in data, forcing organizations to adopt a data analysis strategy that can be used for
•

analysing the entire data in a very short time.

Done by Powerful h/w components and new s/w programs.

The procedure followed by the s/w applications are:
1)Break up the given task
2)Surveying the available resources
3)Assigning the subtask to the nodes
DISTRIBUTED COMPUTING TECHNIQUE
FOR PROCESSING LARGE DATA
MERITS OF THE SYSTEM
Scalability

The system with added scalability, can accommodate the growing amounts of data

more efficiently and flexibly.

J Virtualization and Load Balancing Features

Load Balancing – The sharing of workload across various systems.
Virtualization – creates a virtual environment
h/w platform, storage device and OS
Distributed System Parallel System
•Independent autonomous system •Computer s/m with several processing
connected in a n/w for accomplishing units attached to it.
specific task.

•Coordination is possible b/w connected •Common shared m/y can be directly

computers that have their own m/y and accessed by every processing unit in a
CPU n/w.

•Loose coupling of computers •Tight coupling of processing resources

connected in a n/w, providing access to that are used for solving a single,
data and remotely located resources. complex problem.
CLOUD COMPUTING AND BIG DATA
Cloud can provide almost limitless computing resources and services that make big data
initiatives possible for any business.

Cloud Computing is the delivery of computing services—servers, storage, databases,

networking, software, analytics and more—over the Internet (“the cloud”).

Companies offering these computing services are called cloud providers and typically charge
for cloud computing services based on usage, similar to how you are billed for water or
electricity at home.

A cloud has the resources and services that a business can use on demand, and the business
doesn't have to build, own or maintain the infrastructure.

Thus, the cloud makes big data technologies accessible and affordable to almost any size of
enterprise.
CLOUD COMPUTING AND BIG DATA
FEATURES OF CLOUD COMPUTING

•Scalability – addition of new resources to an existing infrastructure.

-increase in the amount of data , requires organization to improve h/w components.

-The new h/w may not provide complete support to the s/w, that used to run properly on the

earlier set of h/w.

-Solution to this problem is using cloud services - that employ the distributed computing

technique to provide scalability.

•Elasticity – Hiring certain resources, as and when required, and paying for those

resources.

-no extra payment is required for acquiring specific cloud services.

-A cloud does not require customers to declare their resource requirements in advance.

•Resource Pooling - multiple organizations, which use similar kinds of resources to carry

out computing practices, have no need to individually hire all the resources.
•S elf Service – cloud computing involves a simple user interface that helps customers to directly access

the cloud services they want.

•Low Cost – cloud offers customized solutions, especially to organizations that cannot afford too

much initial investment.

- cloud provides pay-us-you-use option, in which organizations need to sign for those resources only

that are essential.

•Fault Tolerance – offering uninterrupted services to customers

Cloud deployment models.

•Depending upon the architecture used in forming the n/w, services and applications used,

and the target consumers, cloud services form various deployment models.They are,

§Public Cloud
§Private Cloud
§Community Cloud
§Hybrid Cloud
Public Cloud (End-User Level Cloud)
Owned and managed by a company than the one using it.
•-

-Third party administrator.

-Eg : Verizon, Amazon Web Services, and Rackspace.
-The workload is categorized on the basis of service
category,
h/w customization is possible to provide optimized
performance.
-The process of computing becomes very flexible and

scalable through customized h/w resources.

-The primary concern with a public cloud include security and

latency.
•Private Cloud (Enterprise Level Cloud)

-Remains entirely in the ownership of the organization using it.

-Infrastructure is solely designed for a single organization.
-Can automate several processes and operations that require manual handling in a public cloud.

-Can also provide firewall protection to the cloud, solving latency and security concerns.

-A private cloud can be either on-premises or hosted externally.

on premises : service is exclusively used and hosted by a single organization.

•Community Cloud

Type of cloud that is shared among various organizations with a common tie.-

Managed by third party cloud services.-Available on or off premises.

Eg. In any state, the community cloud can provided so that almost all govt. organizations of that
state can share the resources available on the cloud.

Because of the sharing of resources on community cloud, the data of all citizens of that state can
be easily managed by the govt. organizations
•Hybrid Cloud
various internal or external service providers offer services to many organizations.

-In hybrid clouds, an organization can use both types of cloud, i.e. public

and private together – situations such as cloud bursting.

•Organization uses its own computing infrastructure, high load requirement, access clouds.

The organization using the hybrid cloud can manage an internal private cloud for general
use and migrate the entire or part of an application to the public cloud during the peak
periods.
big data services from the top three
providers
AWS
Amazon Elastic MapReduce Google Cloud

AWS Deep Learning AMIs Google BigQuery

Amazon SageMaker Google Cloud Dataproc

Microsoft Azure Google Cloud AutoML

Azure HDInsight
Azure Analysis Services
Azure Databricks
CLOUD SERVICES FOR BIG DATA

•In big data Iaas, Paas and Saas clouds are used in following manner.
•Iaas:- Huge storage and computational power requirement for big data are
fulfilled by limitless storage space and computing ability obtained by iaas cloud.
•Paas:- offerings of various vendors have stared adding various popular big
data platforms that include mapreduce, Hadoop. These offerings save
organisations from a lot of hassels which occur in managing individual hardware
components and software applications.
•Saas:- Various organisation require identifying and analysing the voice of
customers particularly on social media. Social media data and platform are
provided by SAAS vendors.In addition, private cloud facilitates access to
enterprise data which enable these analyses.
Big data as a service (BDaaS) solutions:
comparing IaaS, PaaS and SaaS
There are three different BDaaS models. These closely align with the 3 models of
cloud infrastructure: IaaS, PaaS, and SaaS.
Big Data Infrastructure as a Service (IaaS) – Basic data services from a cloud
service provider.
Big Data Platform as a Service (PaaS) – Offerings of an all-round Big Data stack
like those provided by Amazon S3, EMR or RedShift. This excludes ETL and BI.
Big Data Software as a Service (SaaS) – A complete Big Data stack within a single
tool.

https://hub.packtpub.com/big-data-as-a-service-bdaas-solutions-comparing-iaas-
paas-and-saas/
Big Data IaaS

FFor the data layer you have the option of choosing from among:

Hadoop – The Hadoop ecosystem can be run on an EC2 instance giving you complete control

NoSQL Databases – These include MongoDB or Cassandra

Relational Databases – These include PostgreSQL or MySQL

For the compute layer, you can choose from among:

Self-built ETL scripts that run on EC2 instances

Commercial ETL tools that can run on Amazon’s infrastructure and use S3

Open source processing tools that run on AWS instances, like Kafka
Big Data PaaS
A standard Hadoop cloud-based Big Data Infrastructure on Amazon contains the
following:
Data Ingestion – Logs ﬁle data from any data source
Amazon S3 Data Storage Layer
Amazon EMR – A scalable set of instances that run Map/Reduce against the S3 data.
Amazon RDS – A hosted MySQL database that stores the results from Map/Reduce
computations.
Analytics and Visualization – Using an in-house BI tool.
A similar set up can be replicated using Microsoft’s Azure HDInsight
Big Data SaaS

A fully hosted Big Data stack complete that includes everything from data storage to data
visualization contains the following:

Data Layer – Data needs to be pulled into a basic SQL database. An automated data warehouse
does this efﬁciently

Integration Layer – Pulls the data from the SQL database into a ﬂexible modeling layer

Processing Layer – Prepares the data based on the custom business requirements and logic
provided by the user

Analytics and BI Layer – Fully featured BI abilities which include visualizations, dashboards, and
charts, etc.

Azure Data Warehouse and AWS Redshift are the popular SaaS options that offer a complete data
warehouse solution in the cloud
HADOOP
•Hadoop is a distributed system like distributed database.
•
•Hadoop is a ‘software library’ that allows its users to process large datasets across

distributed clusters of computers, thereby enabling them to gather, store and analyse

huge sets of data.

•
•It provides various tools and technologies, collectively termed as the Hadoop

Ecosystem.
HADOOP MULTINODE CLUSTER
ARCHITECTURE
•Hadoop cluster consist of single Master Node and a multiple Worker Nodes.

•Master Node Worker Node

Name Node Data Node

Job Tracker Task Tracker
Three Major components in HAdoop

1. HDFS- STORAGE
2. YARN- RESOURCE MANAGER
3. MAP REDUCER- PROCESSING UNIT
HDFS

We can store any format of data(XML,JASON..)

Modiﬁcations are not possible(seek time,no random access)

Hadoop is OLAP and not OLTP

Hadoop is written in JAVA

HADOOP MULTINODE CLUSTER
ARCHITECTURE
•To process the data, Job Tracker assigns tasks to the Task Tracker.
••If a data node cluster goes down while processing is going on, then the NN should
know that some data node is down in the cluster, otherwise it can’t continue
processing.
••Each DN sends a “Heart Beat Signal” after every few minutes to make NN
aware of active/inactive status of DNs. – Heartbeat Mechanism.
Hadoop Distributed File System (HDFS)

•fault tolerant storage s/m.

•Large size files from terabytes to petabytes.
•attains reliability by replicating the data over
multiple hosts.
•The default replication value is 3.
MAPREDUCE

•Parallel processing frame work.

•Helps developers to write programs to process large volumes of

unstructured data.

-Mapper function
-Reducer function

Foundation of Data Science
100% (1)
Foundation of Data Science
201 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
Big Data Analysis by Deshbandhu
No ratings yet
Big Data Analysis by Deshbandhu
368 pages
Databricks
No ratings yet
Databricks
56 pages
Big Data Engineering - PySpark
100% (2)
Big Data Engineering - PySpark
120 pages
Unit 1-BigDataTools
No ratings yet
Unit 1-BigDataTools
69 pages
Kwasu-Csc204 Big Data Computing and Security-1
No ratings yet
Kwasu-Csc204 Big Data Computing and Security-1
57 pages
Data, Big
No ratings yet
Data, Big
90 pages
Unit 5
No ratings yet
Unit 5
68 pages
Big Data Analytics
No ratings yet
Big Data Analytics
19 pages
Data Exam 3
No ratings yet
Data Exam 3
42 pages
BDA Unit 1
No ratings yet
BDA Unit 1
39 pages
Big Data 1 - 1
No ratings yet
Big Data 1 - 1
98 pages
Unit 1 Big Data Analytics Full
No ratings yet
Unit 1 Big Data Analytics Full
29 pages
Big Data Complete Notes
No ratings yet
Big Data Complete Notes
33 pages
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
No ratings yet
Kwasu-Csc204 Module 1 Big Data Computing and Security 2
22 pages
BIG Data1
No ratings yet
BIG Data1
49 pages
Big Data Analytics Tools, BHARATH.S (Assignment-1)
No ratings yet
Big Data Analytics Tools, BHARATH.S (Assignment-1)
17 pages
BDA1-4 Bunits
No ratings yet
BDA1-4 Bunits
113 pages
Unit 1 Data Science and Big Data
No ratings yet
Unit 1 Data Science and Big Data
23 pages
Introduction To Big Data Platform
No ratings yet
Introduction To Big Data Platform
20 pages
Big Data Components
No ratings yet
Big Data Components
31 pages
Big Data - Module 1
No ratings yet
Big Data - Module 1
35 pages
Cloud Data Engineering V1.0
No ratings yet
Cloud Data Engineering V1.0
5 pages
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Big Data Analytics02
No ratings yet
Big Data Analytics02
20 pages
Unit-1 Introduction To Data Analytics
No ratings yet
Unit-1 Introduction To Data Analytics
35 pages
UNIT Two Emerging Technology
No ratings yet
UNIT Two Emerging Technology
43 pages
Big Data Analytics
No ratings yet
Big Data Analytics
58 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
Unit - 4
No ratings yet
Unit - 4
22 pages
Lauras
No ratings yet
Lauras
33 pages
BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning Raj Kamal All Chapter Instant Download
100% (2)
BIG DATA ANALYTICS: Introduction To Hadoop, Spark, and Machine-Learning Raj Kamal All Chapter Instant Download
49 pages
Big Data A Comprehensive Overview
No ratings yet
Big Data A Comprehensive Overview
25 pages
Bda Unit1
No ratings yet
Bda Unit1
19 pages
Big Data Analytics
No ratings yet
Big Data Analytics
73 pages
Hadoop Report
No ratings yet
Hadoop Report
110 pages
Akash Mavle Links To Lot of Scalable Big Data Architectures
No ratings yet
Akash Mavle Links To Lot of Scalable Big Data Architectures
57 pages
Big Data Analytics Unit-1
100% (2)
Big Data Analytics Unit-1
5 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
Lecture 1 - Introduction
No ratings yet
Lecture 1 - Introduction
14 pages
BIG DATA INTRODUCTION Hadoop
No ratings yet
BIG DATA INTRODUCTION Hadoop
24 pages
Venkata Sai (Sr. GCP Data Engineer)
No ratings yet
Venkata Sai (Sr. GCP Data Engineer)
7 pages
Azure Data Engineering Course Content Day Wise.
No ratings yet
Azure Data Engineering Course Content Day Wise.
6 pages
Big Data Quiz For Final
No ratings yet
Big Data Quiz For Final
6 pages
Big Data Analytics
No ratings yet
Big Data Analytics
8 pages
CC Unit 4
No ratings yet
CC Unit 4
22 pages
Lecture 2 - Hadoop 221
No ratings yet
Lecture 2 - Hadoop 221
28 pages
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
No ratings yet
How To Convert Casuals To Members?": Google Data Analytics Course Capstone Project: Case Study 1 "Cyclistic"
18 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Mehdi RESUME
No ratings yet
Mehdi RESUME
8 pages
Pyspark-1 6 1
No ratings yet
Pyspark-1 6 1
32 pages
Document
No ratings yet
Document
5 pages
TP 4 2docuatrimestre
No ratings yet
TP 4 2docuatrimestre
10 pages
Big Data Analytics 1
No ratings yet
Big Data Analytics 1
22 pages
Ankita
No ratings yet
Ankita
5 pages
Getting Started With Spark Redis PDF
0% (1)
Getting Started With Spark Redis PDF
9 pages
Rishika Lekkala
No ratings yet
Rishika Lekkala
2 pages
RIT Question Bank
No ratings yet
RIT Question Bank
2 pages
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
No ratings yet
Title - Concept of Big Data: Presented by - Divyanshu Upadhyay Naman Gupta Adarsh Pandey Pankaj Chaudhary Shivbrat Singh
17 pages
Harinath Data Engineer
No ratings yet
Harinath Data Engineer
4 pages
Big Data Introduction
No ratings yet
Big Data Introduction
7 pages
Big Data Manual - Edited
No ratings yet
Big Data Manual - Edited
69 pages
BigData - BCom Unit 2
No ratings yet
BigData - BCom Unit 2
10 pages
Pue Big Data
No ratings yet
Pue Big Data
2 pages
Big Data Analytics
No ratings yet
Big Data Analytics
4 pages
Bhavnesh Baghel's Resume
No ratings yet
Bhavnesh Baghel's Resume
2 pages
Big Data Analytics
No ratings yet
Big Data Analytics
37 pages
Big Data Analytics Project Proposal by Slidesgo
No ratings yet
Big Data Analytics Project Proposal by Slidesgo
12 pages
What Is Big Data Analytics-1
No ratings yet
What Is Big Data Analytics-1
9 pages
Reviewed Big Data Assignment
No ratings yet
Reviewed Big Data Assignment
6 pages
Harnessing Big Data
No ratings yet
Harnessing Big Data
29 pages
Wa0003.
No ratings yet
Wa0003.
1 page
What Is Big Data
No ratings yet
What Is Big Data
4 pages
Big Data
No ratings yet
Big Data
3 pages
Data Analysis: Types, Process, Methods, Techniques and Tools
No ratings yet
Data Analysis: Types, Process, Methods, Techniques and Tools
6 pages
Big Data Engineering: Post Graduate Program in
No ratings yet
Big Data Engineering: Post Graduate Program in
4 pages
Big Data Analytics-Report
No ratings yet
Big Data Analytics-Report
7 pages
Introduction To Big Data
No ratings yet
Introduction To Big Data
4 pages
File 1
No ratings yet
File 1
3 pages
Big Data Analytics: Free Guide: 5 Data Science Tools To Consider
No ratings yet
Big Data Analytics: Free Guide: 5 Data Science Tools To Consider
8 pages
ABD22 1st Exam - 6 January - Attempt Review
No ratings yet
ABD22 1st Exam - 6 January - Attempt Review
13 pages
Unit 1 - From Big Data Analytics PDF
No ratings yet
Unit 1 - From Big Data Analytics PDF
5 pages
Ccs 334
No ratings yet
Ccs 334
16 pages

Uploaded by

Uploaded by

18CSE391T- BIG DATA TOOLS AND

Overview of Big Data Analytics

the concept of big data gained momentum in the early 2000s

● Data mining sorts through large datasets to identify patterns and

Some beneﬁts of big data analytics include:

● Cost savings. Helping organizations identify ways to do business

MongoDB is a document database that provides high

Hardware: Servers that provide the computing backbone, storage

Software: Software resources that facilitate analytics on the

Data can be broadly classiﬁed as being structured, unstructured, or semi-structure

Commercial databases such as Teradata, Greenplum as well as Redis, Cassandra,

big data mining refers to the entire life cycle of processing

As an example, in a pharmaceutical organization, this could include Commercial Research,

a close collaboration between both the practitioner as well as the stakeholder;

1. Engage stakeholders as early as possible. ...

A relatively small use case that can be implemented in a short

effort is the time and work required to implement the use

The successful implementation of one or more of the projects

An early win is the implementation of a small outcome or

IT procurement:Compliance with internal IT policies and assessment of other aspects such as

The time to undertake projects from start to

Larger endeavors can take several months to

Hadoop is an open-source platform that provides analytical technologies and computational

There are several ways to install Hadoop. T

1. Installing using open source distributions from commercial vendors such as

2. Cloudera and Hortonworks

Cloudera Distribution of Apache Hadoop (CDH), an integrated platform consisting of several

Hadoop distributed file system (HDFS) is a fault-tolerant storage system in Hadoop. It

R is extensively used by data miners and statisticians, providing a vast variety of

R is easily extendable and implementable through functions and available extensions.

Task and Thread Management

Data Mining and Analytics

dataset by virtual machines.

Semi-Centralised Approach: Semi-centralised approach reduces the stress on the

node, i.e., the virtual machine, in the hierarchy.

P2P Approach: P2P streaming connections are based on hierarchical multi-trees

The words supercomputing and high-performance computing are often

Data mining is the process of data extraction, evaluating it from multiple

Big Data refers to the large amounts of multi-structural data that

Machine learning systems utilise multiple algorithms to discover and show

The in-memory big data computing tool supports processing of high

type of data (structured, unstructured, or semi-structured),

the frequency with which the data will be updated.

Hadoop, where multiple servers maintain bi-directional communication to coordinate a job.

Single-node architecture: Single-node refers to computation done on a single server.

In particular, Spark internally leverages a concept known as DAG—directed acyclic

Spark exposes several APIs—application programming interfaces—to commonly used

Well-known columnar databases include Cassandra, Google BigTable

•For handling such type of data,

Distributed and Parallel Technologies are more suitable

distributed across these resources.

Increases the Speed

•Also improves the processing capability of a computer system by adding additional

computational resources to it.

analysing the entire data in a very short time.

Done by Powerful h/w components and new s/w programs.

more efficiently and flexibly.

J Virtualization and Load Balancing Features

•Coordination is possible b/w connected •Common shared m/y can be directly

•Loose coupling of computers •Tight coupling of processing resources

Cloud Computing is the delivery of computing services—servers, storage, databases,

•Scalability – addition of new resources to an existing infrastructure.

earlier set of h/w.

technique to provide scalability.

-no extra payment is required for acquiring specific cloud services.

the cloud services they want.

much initial investment.

that are essential.

•Fault Tolerance – offering uninterrupted services to customers

-Third party administrator.

scalable through customized h/w resources.

-The primary concern with a public cloud include security and

-Remains entirely in the ownership of the organization using it.

-A private cloud can be either on-premises or hosted externally.

on premises : service is exclusively used and hosted by a single organization.

Managed by third party cloud services.-Available on or off premises.

and private together – situations such as cloud bursting.

AWS Deep Learning AMIs Google BigQuery

Hadoop is an open-source platform that provides analytical technologies and computational

Hadoop distributed file system (HDFS) is a fault-tolerant storage system in Hadoop. It

R is extensively used by data miners and statisticians, providing a vast variety of

R is easily extendable and implementable through functions and available extensions.

Task and Thread Management

Data Mining and Analytics

Semi-Centralised Approach: Semi-centralised approach reduces the stress on the

The words supercomputing and high-performance computing are often

Data mining is the process of data extraction, evaluating it from multiple

Big Data refers to the large amounts of multi-structural data that

Machine learning systems utilise multiple algorithms to discover and show

The in-memory big data computing tool supports processing of high