0% found this document useful (0 votes)
90 views124 pages

Unit 1 Big Data

Uploaded by

alexvictor987123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views124 pages

Unit 1 Big Data

Uploaded by

alexvictor987123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

18CSE391T- BIG DATA TOOLS AND

TECHNIQUES
Unit 1

Overview of Big Data Analytics


Overview of Big
Data Analytics
What is Big Data?
Big data refers to data that is so large, fast or complex that it’s difficult or impossible to process using
traditional methods.

the concept of big data gained momentum in the early 2000s


The three Vs of big data
Volume. Organizations collect data from a variety of sources, including transactions, smart (IoT)
devices, industrial equipment, videos, images, audio, social media and more. In the past, storing
all that data would have been too costly – but cheaper storage using data lakes, Hadoop and the
cloud have eased the burden.

Velocity. With the growth in the Internet of Things, data streams into businesses at an
unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart
meters are driving the need to deal with these torrents of data in near-real time.

Variety. Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, emails, videos, audios, stock ticker data and financial
transactions.
Big data use cases

Product development: Companies like Netflix and Procter & Gamble use big data to anticipate
customer demand. They build predictive models for new products and services by classifying key
attributes of past and current products or services and modeling the relationship between those
attributes and the commercial success of the offerings.

Predictive maintenance: Factors that can predict mechanical failures may be deeply buried in
structured data, such as the year, make, and model of equipment, as well as in unstructured data that
covers millions of log entries, sensor data, error messages, and engine temperature.

Customer experience
Big data enables you to gather data from social media, web visits, call logs, and other sources to
improve the interaction experience and maximize the value delivered. Start delivering personalized
offers, reduce customer churn, and handle issues proactively.
Fraud and compliance
. Big data helps you identify patterns in data that indicate fraud and aggregate large volumes of
information to make regulatory reporting much faster.
Machine learning
big data—is one of the reasons why. We are now able to teach machines instead of program them.
The availability of big data to train machine learning models makes that possible.
What is big data analytics?
Big data analytics describes the
process of uncovering trends,
patterns, and correlations in
large amounts of raw data to
help make data-informed
decisions. These processes use
familiar statistical analysis
techniques—like clustering and
regression—and apply them to
more extensive datasets with
the help of newer tools.
How big data analytics works
Collect Data
Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets to
help organizations operationalize their big data.

Data collection looks different for every organization. With today’s technology,
organizations can gather both structured and unstructured data from a variety of sources
— from cloud storage to mobile applications to in-store IoT sensors and beyond. Some
data will be stored in data warehouses where business intelligence tools and solutions can
access it easily. Raw or unstructured data that is too diverse or complex for a warehouse
may be assigned metadata and stored in a data lake.
Process and Clean data

2. Process Data

Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing
exponentially, making data processing a challenge for organizations. One processing option is
batch processing, which looks at large data blocks over time. Batch processing is useful when
there is a longer turnaround time between collecting and analyzing data. Stream processing
looks at small batches of data at once, shortening the delay time between collection and
analysis for quicker decision-making. Stream processing is more complex and often more
expensive.

3. Clean Data

Data big or small requires scrubbing to improve data quality and get stronger results; all
data must be formatted correctly, and any duplicative or irrelevant data must be eliminated
or accounted for. Dirty data can obscure and mislead, creating flawed insights.
Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics
processes can turn big data into big insights. Some of these big data analysis
methods include:

● Data mining sorts through large datasets to identify patterns and


relationships by identifying anomalies and creating data clusters.
● Predictive analytics uses an organization’s historical data to make
predictions about the future, identifying upcoming risks and opportunities.
● Deep learning imitates human learning patterns by using artificial
intelligence and machine learning to layer algorithms and find patterns in
the most complex and abstract data.
Big data analytics tools and technology
several types of tools work together to help you collect, process, cleanse, and analyze big data. Some
of the major players in big data ecosystems are listed below.

● Hadoop is an open-source framework that efficiently stores and processes big datasets on
clusters of commodity hardware. This framework is free and can handle large amounts of
structured and unstructured data, making it a valuable mainstay for any big data operation.
● NoSQL databases are non-relational data management systems that do not require a fixed
scheme, making them a great option for big, raw, unstructured data. NoSQL stands for “not
only SQL,” and these databases can handle a variety of data models.
● MapReduce is an essential component to the Hadoop framework serving two functions.
● The first is mapping, which filters data to various nodes within the cluster.
● The second is reducing, which organizes and reduces the results from each node to answer a
query.
● YARN stands for “Yet Another Resource Negotiator.” It is another component of
second-generation Hadoop. The cluster management technology helps with job
scheduling and resource management in the cluster.
● Spark is an open source cluster computing framework that uses implicit data
parallelism and fault tolerance to provide an interface for programming entire
clusters. Spark can handle both batch and stream processing for fast computation.
● Tableau is an end-to-end data analytics platform that allows you to prep,
analyze, collaborate, and share your big data insights. Tableau excels in
self-service visual analysis, allowing people to ask new questions of governed big
data and easily share those insights across the organization.
big benefits of big data analytics
The ability to analyze more data at a faster rate can provide big
benefits to an organization, allowing it to more efficiently use data
to answer important questions

Some benefits of big data analytics include:

● Cost savings. Helping organizations identify ways to do business


more efficiently
● Product development. Providing a better understanding of
customer needs
● Market insights. Tracking purchase behavior and market trends
Tableau
Pig

Pig is a high-level platform or tool which is used to process the large datasets.
Flink

Flink has been designed to run in all common cluster environments, perform computations at
in-memory speed and at any scale.
Spark

Spark is a general-purpose distributed processing system used for big data workloads. It has
been deployed in every type of big data use case to detect patterns, and provide real-time insight.
Mongo DB

MongoDB is a document database that provides high


performance, high availability, and easy scalability. It
is a cross-platform document-oriented database system
classified as a NoSQL database, that bridges the gap
between key-value and traditional RDBMS systems.
Building blocks of big data analytics

big data systems can be considered to have four major layers, each
of which are indispensable

Hardware: Servers that provide the computing backbone, storage


devices that store the data, and network connectivity across
different server components are some of the elements that define
the hardware stack.

Software: Software resources that facilitate analytics on the


datasets hosted in the hardware layer,Analytics software can be
classified into various subdivisions
Data mining: Software that provides facilities for aggregations, joins across datasets, and
pivot tables on large datasets fall into this category. Standard NoSQL platforms such as
Cassandra, Redis, and others are high-level, data mining tools for big data analytics.

Statistical analytics: Platforms that provide analytics capabilities beyond simple data mining,
such as running algorithms that can range from simple regressions to advanced neural
networks such as Google TensorFlow or R, fall into this category.
Data management: Data encryption, governance, access, compliance, and other features
salient to any enterprise and production environment to manage and, in some ways, reduce
operational complexity form the next basic layer.

End user: The end user of the analytics software forms the final aspect of a big data
analytics engagement.The term data scientist is often used to denote individuals who
implement the underlying big data analytics capabilities while business users reap the
benefits of faster access and analytics capabilities not available in traditional systems.
Types of Big Data

Data can be broadly classified as being structured, unstructured, or semi-structure

Structured

Structured data, as the name implies, indicates datasets that have a defined
organizational structure such as Microsoft Excel or CSV files.

Commercial databases such as Teradata, Greenplum as well as Redis, Cassandra,


and Hive in the open source domain are examples of technologies that provide the
ability to manage and query structured data.
Unstructured

Unstructured data consists of any dataset that does not have a predefined organizational
schema as in the table in the prior section. Spoken words, music, videos, and even books,
including this one, would be considered unstructured.

Data from social media, such as posts on Twitter, messages from friends on Facebook, and
photos on Instagram, are all examples of unstructured data.

For textual data, technologies such as Lucene/Solr, Elasticsearch, and others are generally
used to query, index, and other operations
Semi-structured

, semi-structured data is usually represented by formats, such as JSON, that can encapsulate
both structured as well as schemaless or arbitrary associations, generally using key-value
pairs.

A more common example could be email messages, which have both a structured part, such
as name of the sender, time when the message was received, and an unstructured portion
represented by the body or content of the email.

Platforms such as Mongo and CouchDB are generally used to store and query
semistructured datasets.
What is big data mining?

big data mining refers to the entire life cycle of processing


large-scale datasets, from procurement to implementation of
the respective tools to analyze them.
Big data mining in the enterprise

The success of any big data platform depends largely on finding relevant problems in
business units that will deliver measurable value for the department or organization.

Who needs big data mining: Determining which business groups will benefit most
significantly from a big data mining solution is the first step in this process

As an example, in a pharmaceutical organization, this could include Commercial Research,


Epidemiology, Health Economics, and Outcomes. At a financial services organization, this
could include Algorithmic Trading Desks, Quantitative Research, and even Back Office.
Determining the use cases:

Most organizations follow a hierarchical structure where the interaction among business
colleagues is likely to be mainly along rank lines.

a close collaboration between both the practitioner as well as the stakeholder;


Stakeholders' buy-in:

getting these people involved in the decision-making process and on board with the vision
and mission.

1. Engage stakeholders as early as possible. ...


2. Provide a clear rationale for the project. ...
3. Speak their language. ...
4. Identify and manage risks. ...
5. Make expectations clear.
Early-wins and the effort-to-reward ratio:

A relatively small use case that can be implemented in a short


time within a smaller budget to optimize a specific
business-critical function helps in showcasing early-wins, thus
adding credibility to the big data solution in question

effort is the time and work required to implement the use


case.This includes aspects such as how long it would take to
procure the relevant hardware and/or software that is part of the
solution, the resources or equivalent man-hours it will take to
implement the solution, and the overall operational overhead.
Leveraging the early-wins:

The successful implementation of one or more of the projects


in the early-wins phase often lays the groundwork to
develop a bigger strategy for the big data analytics platform
that goes far beyond the needs of just a single department
and has a broader organizational-level impact.

An early win is the implementation of a small outcome or


objective that can be realized quickly and early in a project's
timeframe.
Implementation life cycle

As outlined earlier, the implementation process can span multiple steps. These steps are
often iterative in nature and require a trial-and-error approach.

Big Data strategy will include multiple stakeholders and a collaborative approach often
yields the best results. Business sponsors, business support and IT & Analytics are three
broad categories of stakeholders that together create a proper unified solution, catering to
the needs of the business to the extent that budget and IT capabilities will permit.
Stakeholders of the solution

Business sponsor: The individual or department that provides the support and/or funding for
the project.

Implementation group: The team that implements the solution from a hands-on perspective.
This is usually the IT or Analytics department

IT procurement:Compliance with internal IT policies and assessment of other aspects such as


licensing costs are some of services provided by procurement, especially for commercial
products.

Legal:Open source software licenses such as Apache 2.0, MIT, and BSD are generally more
permissible relative to GNU GPL (General Purpose License). For commercial solutions, the
process is more involved as it requires the analysis of vendor-specific agreements
Implementing the solution
The final implementation of the solution is the
culmination of the collaboration between the
implementation group, business beneficiaries,
and auxiliary departments.

The time to undertake projects from start to


end can vary anywhere from 3-6 months for
most smallsized projects as explained in the
section on early-wins.

Larger endeavors can take several months to


years to accomplish and are marked by an agile
framework of product management where
capabilities are added incrementally during the
implementation and deployment period
Hadoop

—Hadoop is an open-source platform that provides analytical technologies and computational


power required to work with such large volumes of data.

—Hadoop platform provides an improved programming model, which is used to create and run
distributed systems quickly and efficiently.

—A Hadoop cluster consists of single MasterNode and multiple worker nodes. The MasterNode
contains a NameNode and JobTracker and a slave or worker node acts as both a DataNode and
TaskTracker. Hadoop requires Java Runtime Environment (JRE) 1.6 or a higher version of JRE.

—There are two main components of Apache Hadoop – the Hadoop Distributed File System
(HDFS) and the MapReduce parallel processing framework.
Who Uses Hadoop?

● British Airways
● Uber
● The Bank of Scotland
● Netflix
● The National Security Agency (NSA), of the United States
● The UK’s Royal Mail system
● Expedia
● Twitter
Installing Hadoop

There are several ways to install Hadoop. T

he most common ones are: Installing Hadoop from the source files from
https://hadoop.apache.org

1. Installing using open source distributions from commercial vendors such as

2. Cloudera and Hortonworks

Cloudera Distribution of Apache Hadoop (CDH), an integrated platform consisting of several


Hadoop and Apache-related products. Cloudera is a popular commercial Hadoop vendor
that provides managed services for enterprise-scale Hadoop deployments
Hadoop

—Hadoop distributed file system (HDFS) is a fault-tolerant storage system in Hadoop. It


stores large size files from terabytes to petabytes across different terminals.
—Data is replicated on three nodes: two on the same rack and one on a different rack.
The file in HDFS is split into large blocks size of 64 MB by default (typically 64 to 128
megabytes) and each block of the file is independently replicated at multiple data nodes.
—The NameNode actively monitors the number of replicas of a block (by default 3 times).
When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode
creates another replica of the block.
R
—R is an open source programming language and an application environment for
statistical computing with graphics, developed by R Foundation for Statistical Computing.

—It is an interpreted language like Python and uses a command line interpreter. It
supports procedural as well as generic functions with OOP.

—R is extensively used by data miners and statisticians, providing a vast variety of


graphical and statistical techniques, with linear and nonlinear modelling, time-series
analysis, classical statistical tests, clustering, classification and others.

—R is easily extendable and implementable through functions and available extensions.


Technical Elements of Big
Data Platform
Big Data Techniques

—To analyse the datasets, there are many techniques available, some of which are as follows:

—Massive Parallelism

—Data Distribution

—High-Performance Computing

—Task and Thread Management

—Data Mining and Analytics

—Data Retrieval
Massive Parallelism

—According to the simplest definition available, a parallel system is a system where multiple
processors are involved and associated to carry out the concurrent computations.
—Massive parallelism refers to a parallel system where multiple systems interconnected with
each other pose as a single mighty conjoint processor and carry out tasks received from the
data sets parallelly.
—In terms of Big Data dynamics, the systems can not only be processor, but also memory,
hardware and even network conjoint to scale up the operational efficiency posing as a
massive system that can eat humongous datasets parallelly without breaking a sweat.
Data Distribution

—There are approaches to data distribution in a Big Data system described as follows:

—Centralised Approach: A central repository is used to store and download the essential

dataset by virtual machines.

——Semi-Centralised Approach: Semi-centralised approach reduces the stress on the

networking infrastructure.

—Hierarchical Approach: In a hierarchical approach, the data is fetched from the parent

node, i.e., the virtual machine, in the hierarchy.

P2P Approach: P2P streaming connections are based on hierarchical multi-trees


High-Performance Computing

—The words supercomputing and high-performance computing are often


used to resemble each other.
—High-performance computing is the simultaneous use of supercomputers
and parallel processing techniques for solving intricate computation
problems.
—High-performance computing is used for performing research activities
and cracking advanced problems through computer simulation, modelling
and analysis.
Task and Thread Management

—It is a useful feature when complex applications having multiple tasks need
to be performed at the same time.
—Task parallelism refers to the execution of computer programmes
throughout the multiple processors on the different or same machines. It
emphasises on performing diverse operations in parallel to best utilise the
accessible computing resources like memory and processors.
Data Mining and Analytics

—Data mining is the process of data extraction, evaluating it from multiple


perspectives and then producing the information summary in a meaningful
form that identifies one or more relationships within the dataset.
—Data analytics is about applying an algorithmic or logical process to derive
the insights from a given dataset. For example, looking at the past year’s
weather and pest data, for the current month, we can determine that a
particular type of fungus grows often when the humidity levels reach a
definite point.
Data Retrieval

—Big Data refers to the large amounts of multi-structural data that


continuously flows around and within the organisations, and includes text,
video, transactional records and sensor logs.

—Big Data systems utilise the Hadoop and the HDFS architecture to retrieve
the data using MapReduce – a distributed processing framework.

—It helps programmers in solving parallel data problems where the dataset
can be divided into small chunks and handled autonomously.
Machine Learning

—Machine learning systems utilise multiple algorithms to discover and show


the patterns hidden in the datasets.
Data Visualisation
Cloud Providers in Big Data Market

. Some of these providers are rendering services that are relevant to Big
Data analytics only. Some such providers are as follows:

—Amazon

—Google

—Windows Azure
In-Memory Technology for Big Data

—The in-memory big data computing tool supports processing of high


velocity data in real-time and also faster processing of the stationary data.
Technologies like event-streaming platforms, in-memory databases and
analytics, and high-level messaging structures
At the technical level, there are primarily two main considerations: Selection of the hardware
stack Selection of the software and BI (business intelligence) platform

Over the recent 2-3 years, it has become increasingly common for corporations to move their
processes to cloud-based environments as a complementary solution for in-house
infrastructures

On-premises can be used interchangeably with In-house, On-site, and other similar
terminologies
Selection of the hardware stack

type of solution

type of data (structured, unstructured, or semi-structured),

the size of data (gigabytes versus terabytes versus petabytes), and, to an extent,

the frequency with which the data will be updated.


three broad models of hardware
architecture:

Multinode architecture: multiple nodes (or servers) that are interconnected and work on the
principle of multinode or distributed computing

Hadoop, where multiple servers maintain bi-directional communication to coordinate a job.


Other technologies such as a NoSQL database like Cassandra and search and analytics
platform like Elasticsearch also run on the principle of multinode computing architecture.
Single-node architecture:

Single-node architecture: Single-node refers to computation done on a single server.

If the dataset is structured, contains primarily textual data, and is in the order of 1-5 TB, in
today’s computing environment, it is entirely possible to host such datasets on single-node
machines

Cloud-based architecture: Cloud platforms such as Amazon Web Services, Azure from
Microsoft, and the Google Compute Environment permit enterprises to provision 10s to
1000s of nodes at costs starting as low as 1 cent per hour per instance.

The exponential decrease in the cost of hardware: The cost of hardware has gone down
exponentially over the past few years.
Selection of the software stack
The Hadoop ecosystem: Hadoop supports nearly all the various types of datasets—such as
structured, unstructured, and semi-structured—well-known in the big data space. Its thriving
ecosystem of auxiliary tools that add new functionalities

There are four primary components of Hadoop, apart from the projects present in the large
ecosystem. They are as follows:

Hadoop Common: The common utilities that support the other Hadoop modules Hadoop
Distributed File System (HDFS™): A distributed filesystem that provides high-throughput
access to application data Hadoop

YARN: A framework for job scheduling and cluster resource management Hadoop
MapReduce: A YARN-based system for parallel processing of large datasets
Apache Spark
multinode computing framework first conceived at University of California at Berkeley’s
AMPLab as a platform that provided a seamless interface to run parallel computations and
overcome limitations in the Hadoop MapReduce framework

In particular, Spark internally leverages a concept known as DAG—directed acyclic


graphs—which indicates a functionality that optimizes a set of operations into a smaller, or
more computationally efficient, set of operations

Spark exposes several APIs—application programming interfaces—to commonly used


languages such as Python (PySpark) and Scala (natively available interface).

Spark introduces a data structure called Resilient Distributed Datasets (RDD), which
provides a mechanism to store data in-memory
Resilient Distributed Datasets (RDD)

Cluster manager: The nodes constituting a Spark cluster communicate using cluster
managers, which manage the overall coordination among the nodes that are part of the
cluster

Distributed storage: Spark can access data from a range of underlying distributed storage
systems such as HDFS, S3 (AWS Storage), Cassandra, HBase, Hive, Tachyon, and any
Hadoop data source

Spark can support multiple types of cluster managers as well as backend storage systems,
NoSQL and traditional database

The term NoSQL came into existence recently and is meant to distinguish databases that do
not follow the traditional relational-database models. There are both open source and
commercial variations of NoSQL databases and indeed even cloud-based options that have
become increasingly common

Key-value: These NoSQL databases store data on a principle of hashing—a unique key
identifies a set of properties about the key

In-memory: such as storing caches in the memory to provide faster access relative to storing
on disk

NoSQL databases, such as Redis and KDB+, leverage temporary in-memory storage in order
to provide faster access to frequently used data
Columnar

These databases append multiple columns of data as opposed to rows to create a table. The
primary advantage of columnar storage over row-based storage is that a columnar layout
provides the means to access data faster with reduced I/O overhead

Well-known columnar databases include Cassandra, Google BigTable


Document-oriented

document-oriented databases store data that do not conform to any specific schema such as
unstructured text like news articles

document databases such as MongoDB are used widely in media-related organizations such
as NY Times
Cloud-based solutions

Finally, cloud-based solutions for large-scale data mining such as AWS Redshift, Azure SQL
Data Warehouse, and Google Bigquery permit users to query datasets directly on the
cloud-vendor’s platform without having to create their own architecture
● HDFS -> Hadoop Distributed File System
● YARN -> Yet Another Resource Negotiator
● MapReduce -> Data processing using
programming
● Spark -> In-memory Data Processing
● PIG, HIVE-> Data Processing Services using
Query (SQL-like)
● HBase -> NoSQL Database
● Mahout, Spark MLlib -> Machine Learning
● Apache Drill -> SQL on Hadoop
● Zookeeper -> Managing Cluster
● Oozie -> Job Scheduling
● Flume, Sqoop -> Data Ingesting Services
● Solr & Lucene -> Searching & Indexing
● Ambari -> Provision, Monitor and Maintain
cluster
Distributed and parallel Computing
for Big Data
BIG DATA

•Big Data can’t be handled by traditional data storage and processing systems.

•For handling such type of data,

Distributed and Parallel Technologies are more suitable


DISTRIBUTED AND PARALLEL
COMPUTING FOR BIG DATA
•Distributed Computing

Multiple computing resources are connected in a network and computing tasks are

distributed across these resources.

Increases the Speed


Increases the Efficiency
more suitable to process huge amount of data in a limited time
Parallel Computing

•Also improves the processing capability of a computer system by adding additional

computational resources to it.

•Divide complex computations into subtasks, handled individually by processing units, running in

parallel.

Concept – processing capability will increase with the increase in the level of parallelism.
Data parallelism paradigm

To solve the big data problems, the data parallelism paradigm can be considered.

Given a large scale dataset D, it can be divided into n subsets, denoted as D1, D2, D3,
.....Dn, where each subset may contain different numbers of data samples and each subset
may or may not have duplicate data samples.

Then, a specific data mining algorithm implemented in n local machines (or computer
nodes) individually is performed over each subset.

Finally, the n mining results are combined via one combination component to produce the
final output.
PARALLEL COMPUTING
BIG DATA PROCESSING TECHNIQUES
With the increase in data, forcing organizations to adopt a data analysis strategy that can be used for

analysing the entire data in a very short time.

Done by Powerful h/w components and new s/w programs.


The procedure followed by the s/w applications are:
1)Break up the given task
2)Surveying the available resources
3)Assigning the subtask to the nodes
DISTRIBUTED COMPUTING TECHNIQUE
FOR PROCESSING LARGE DATA
MERITS OF THE SYSTEM
Scalability

The system with added scalability, can accommodate the growing amounts of data

more efficiently and flexibly.

J Virtualization and Load Balancing Features


Load Balancing – The sharing of workload across various systems.
Virtualization – creates a virtual environment
h/w platform, storage device and OS
Distributed System Parallel System
•Independent autonomous system •Computer s/m with several processing
connected in a n/w for accomplishing units attached to it.
specific task.

•Coordination is possible b/w connected •Common shared m/y can be directly


computers that have their own m/y and accessed by every processing unit in a
CPU n/w.

•Loose coupling of computers •Tight coupling of processing resources


connected in a n/w, providing access to that are used for solving a single,
data and remotely located resources. complex problem.
CLOUD COMPUTING AND BIG DATA
Cloud can provide almost limitless computing resources and services that make big data
initiatives possible for any business.

Cloud Computing is the delivery of computing services—servers, storage, databases,


networking, software, analytics and more—over the Internet (“the cloud”).

Companies offering these computing services are called cloud providers and typically charge
for cloud computing services based on usage, similar to how you are billed for water or
electricity at home.

A cloud has the resources and services that a business can use on demand, and the business
doesn't have to build, own or maintain the infrastructure.

Thus, the cloud makes big data technologies accessible and affordable to almost any size of
enterprise.
CLOUD COMPUTING AND BIG DATA
FEATURES OF CLOUD COMPUTING

•Scalability – addition of new resources to an existing infrastructure.


-increase in the amount of data , requires organization to improve h/w components.

-The new h/w may not provide complete support to the s/w, that used to run properly on the

earlier set of h/w.

-Solution to this problem is using cloud services - that employ the distributed computing

technique to provide scalability.


•Elasticity – Hiring certain resources, as and when required, and paying for those

resources.

-no extra payment is required for acquiring specific cloud services.


-A cloud does not require customers to declare their resource requirements in advance.

•Resource Pooling - multiple organizations, which use similar kinds of resources to carry

out computing practices, have no need to individually hire all the resources.
•S elf Service – cloud computing involves a simple user interface that helps customers to directly access

the cloud services they want.

•Low Cost – cloud offers customized solutions, especially to organizations that cannot afford too

much initial investment.

- cloud provides pay-us-you-use option, in which organizations need to sign for those resources only

that are essential.

•Fault Tolerance – offering uninterrupted services to customers


Cloud deployment models.

•Depending upon the architecture used in forming the n/w, services and applications used,

and the target consumers, cloud services form various deployment models.They are,

§Public Cloud
§Private Cloud
§Community Cloud
§Hybrid Cloud
Public Cloud (End-User Level Cloud)
Owned and managed by a company than the one using it.
•-

-Third party administrator.


-Eg : Verizon, Amazon Web Services, and Rackspace.
-The workload is categorized on the basis of service
category,
h/w customization is possible to provide optimized
performance.
-The process of computing becomes very flexible and

scalable through customized h/w resources.

-The primary concern with a public cloud include security and

latency.
•Private Cloud (Enterprise Level Cloud)

-Remains entirely in the ownership of the organization using it.


-Infrastructure is solely designed for a single organization.
-Can automate several processes and operations that require manual handling in a public cloud.

-Can also provide firewall protection to the cloud, solving latency and security concerns.

-A private cloud can be either on-premises or hosted externally.

on premises : service is exclusively used and hosted by a single organization.


•Community Cloud

Type of cloud that is shared among various organizations with a common tie.-

Managed by third party cloud services.-Available on or off premises.

Eg. In any state, the community cloud can provided so that almost all govt. organizations of that
state can share the resources available on the cloud.

Because of the sharing of resources on community cloud, the data of all citizens of that state can
be easily managed by the govt. organizations
•Hybrid Cloud
various internal or external service providers offer services to many organizations.

-In hybrid clouds, an organization can use both types of cloud, i.e. public

and private together – situations such as cloud bursting.

•Organization uses its own computing infrastructure, high load requirement, access clouds.

The organization using the hybrid cloud can manage an internal private cloud for general
use and migrate the entire or part of an application to the public cloud during the peak
periods.
big data services from the top three
providers
AWS
Amazon Elastic MapReduce Google Cloud

AWS Deep Learning AMIs Google BigQuery

Amazon SageMaker Google Cloud Dataproc

Microsoft Azure Google Cloud AutoML

Azure HDInsight
Azure Analysis Services
Azure Databricks
CLOUD SERVICES FOR BIG DATA

•In big data Iaas, Paas and Saas clouds are used in following manner.
•Iaas:- Huge storage and computational power requirement for big data are
fulfilled by limitless storage space and computing ability obtained by iaas cloud.
•Paas:- offerings of various vendors have stared adding various popular big
data platforms that include mapreduce, Hadoop. These offerings save
organisations from a lot of hassels which occur in managing individual hardware
components and software applications.
•Saas:- Various organisation require identifying and analysing the voice of
customers particularly on social media. Social media data and platform are
provided by SAAS vendors.In addition, private cloud facilitates access to
enterprise data which enable these analyses.
Big data as a service (BDaaS) solutions:
comparing IaaS, PaaS and SaaS
There are three different BDaaS models. These closely align with the 3 models of
cloud infrastructure: IaaS, PaaS, and SaaS.
Big Data Infrastructure as a Service (IaaS) – Basic data services from a cloud
service provider.
Big Data Platform as a Service (PaaS) – Offerings of an all-round Big Data stack
like those provided by Amazon S3, EMR or RedShift. This excludes ETL and BI.
Big Data Software as a Service (SaaS) – A complete Big Data stack within a single
tool.

https://hub.packtpub.com/big-data-as-a-service-bdaas-solutions-comparing-iaas-
paas-and-saas/
Big Data IaaS

FFor the data layer you have the option of choosing from among:

Hadoop – The Hadoop ecosystem can be run on an EC2 instance giving you complete control

NoSQL Databases – These include MongoDB or Cassandra

Relational Databases – These include PostgreSQL or MySQL

For the compute layer, you can choose from among:

Self-built ETL scripts that run on EC2 instances

Commercial ETL tools that can run on Amazon’s infrastructure and use S3

Open source processing tools that run on AWS instances, like Kafka
Big Data PaaS
A standard Hadoop cloud-based Big Data Infrastructure on Amazon contains the
following:
Data Ingestion – Logs file data from any data source
Amazon S3 Data Storage Layer
Amazon EMR – A scalable set of instances that run Map/Reduce against the S3 data.
Amazon RDS – A hosted MySQL database that stores the results from Map/Reduce
computations.
Analytics and Visualization – Using an in-house BI tool.
A similar set up can be replicated using Microsoft’s Azure HDInsight
Big Data SaaS

A fully hosted Big Data stack complete that includes everything from data storage to data
visualization contains the following:

Data Layer – Data needs to be pulled into a basic SQL database. An automated data warehouse
does this efficiently

Integration Layer – Pulls the data from the SQL database into a flexible modeling layer

Processing Layer – Prepares the data based on the custom business requirements and logic
provided by the user

Analytics and BI Layer – Fully featured BI abilities which include visualizations, dashboards, and
charts, etc.

Azure Data Warehouse and AWS Redshift are the popular SaaS options that offer a complete data
warehouse solution in the cloud
HADOOP
•Hadoop is a distributed system like distributed database.

•Hadoop is a ‘software library’ that allows its users to process large datasets across

distributed clusters of computers, thereby enabling them to gather, store and analyse

huge sets of data.


•It provides various tools and technologies, collectively termed as the Hadoop

Ecosystem.
HADOOP MULTINODE CLUSTER
ARCHITECTURE
•Hadoop cluster consist of single Master Node and a multiple Worker Nodes.

•Master Node Worker Node

Name Node Data Node


Job Tracker Task Tracker
Three Major components in HAdoop

1. HDFS- STORAGE
2. YARN- RESOURCE MANAGER
3. MAP REDUCER- PROCESSING UNIT
HDFS

We can store any format of data(XML,JASON..)

Modifications are not possible(seek time,no random access)

Hadoop is OLAP and not OLTP

Hadoop is written in JAVA


HADOOP MULTINODE CLUSTER
ARCHITECTURE
•To process the data, Job Tracker assigns tasks to the Task Tracker.
••If a data node cluster goes down while processing is going on, then the NN should
know that some data node is down in the cluster, otherwise it can’t continue
processing.
••Each DN sends a “Heart Beat Signal” after every few minutes to make NN
aware of active/inactive status of DNs. – Heartbeat Mechanism.
Hadoop Distributed File System (HDFS)

•fault tolerant storage s/m.


•Large size files from terabytes to petabytes.
•attains reliability by replicating the data over
multiple hosts.
•The default replication value is 3.
MAPREDUCE

•Parallel processing frame work.


•Helps developers to write programs to process large volumes of

unstructured data.

-Mapper function
-Reducer function

You might also like