Unit 1 Big Data
Unit 1 Big Data
TECHNIQUES
Unit 1
Velocity. With the growth in the Internet of Things, data streams into businesses at an
unprecedented speed and must be handled in a timely manner. RFID tags, sensors and smart
meters are driving the need to deal with these torrents of data in near-real time.
Variety. Data comes in all types of formats – from structured, numeric data in traditional
databases to unstructured text documents, emails, videos, audios, stock ticker data and financial
transactions.
Big data use cases
Product development: Companies like Netflix and Procter & Gamble use big data to anticipate
customer demand. They build predictive models for new products and services by classifying key
attributes of past and current products or services and modeling the relationship between those
attributes and the commercial success of the offerings.
Predictive maintenance: Factors that can predict mechanical failures may be deeply buried in
structured data, such as the year, make, and model of equipment, as well as in unstructured data that
covers millions of log entries, sensor data, error messages, and engine temperature.
Customer experience
Big data enables you to gather data from social media, web visits, call logs, and other sources to
improve the interaction experience and maximize the value delivered. Start delivering personalized
offers, reduce customer churn, and handle issues proactively.
Fraud and compliance
. Big data helps you identify patterns in data that indicate fraud and aggregate large volumes of
information to make regulatory reporting much faster.
Machine learning
big data—is one of the reasons why. We are now able to teach machines instead of program them.
The availability of big data to train machine learning models makes that possible.
What is big data analytics?
Big data analytics describes the
process of uncovering trends,
patterns, and correlations in
large amounts of raw data to
help make data-informed
decisions. These processes use
familiar statistical analysis
techniques—like clustering and
regression—and apply them to
more extensive datasets with
the help of newer tools.
How big data analytics works
Collect Data
Big data analytics refers to collecting, processing, cleaning, and analyzing large datasets to
help organizations operationalize their big data.
Data collection looks different for every organization. With today’s technology,
organizations can gather both structured and unstructured data from a variety of sources
— from cloud storage to mobile applications to in-store IoT sensors and beyond. Some
data will be stored in data warehouses where business intelligence tools and solutions can
access it easily. Raw or unstructured data that is too diverse or complex for a warehouse
may be assigned metadata and stored in a data lake.
Process and Clean data
2. Process Data
Once data is collected and stored, it must be organized properly to get accurate results on
analytical queries, especially when it’s large and unstructured. Available data is growing
exponentially, making data processing a challenge for organizations. One processing option is
batch processing, which looks at large data blocks over time. Batch processing is useful when
there is a longer turnaround time between collecting and analyzing data. Stream processing
looks at small batches of data at once, shortening the delay time between collection and
analysis for quicker decision-making. Stream processing is more complex and often more
expensive.
3. Clean Data
Data big or small requires scrubbing to improve data quality and get stronger results; all
data must be formatted correctly, and any duplicative or irrelevant data must be eliminated
or accounted for. Dirty data can obscure and mislead, creating flawed insights.
Analyze Data
Getting big data into a usable state takes time. Once it’s ready, advanced analytics
processes can turn big data into big insights. Some of these big data analysis
methods include:
● Hadoop is an open-source framework that efficiently stores and processes big datasets on
clusters of commodity hardware. This framework is free and can handle large amounts of
structured and unstructured data, making it a valuable mainstay for any big data operation.
● NoSQL databases are non-relational data management systems that do not require a fixed
scheme, making them a great option for big, raw, unstructured data. NoSQL stands for “not
only SQL,” and these databases can handle a variety of data models.
● MapReduce is an essential component to the Hadoop framework serving two functions.
● The first is mapping, which filters data to various nodes within the cluster.
● The second is reducing, which organizes and reduces the results from each node to answer a
query.
● YARN stands for “Yet Another Resource Negotiator.” It is another component of
second-generation Hadoop. The cluster management technology helps with job
scheduling and resource management in the cluster.
● Spark is an open source cluster computing framework that uses implicit data
parallelism and fault tolerance to provide an interface for programming entire
clusters. Spark can handle both batch and stream processing for fast computation.
● Tableau is an end-to-end data analytics platform that allows you to prep,
analyze, collaborate, and share your big data insights. Tableau excels in
self-service visual analysis, allowing people to ask new questions of governed big
data and easily share those insights across the organization.
big benefits of big data analytics
The ability to analyze more data at a faster rate can provide big
benefits to an organization, allowing it to more efficiently use data
to answer important questions
Pig is a high-level platform or tool which is used to process the large datasets.
Flink
Flink has been designed to run in all common cluster environments, perform computations at
in-memory speed and at any scale.
Spark
Spark is a general-purpose distributed processing system used for big data workloads. It has
been deployed in every type of big data use case to detect patterns, and provide real-time insight.
Mongo DB
big data systems can be considered to have four major layers, each
of which are indispensable
Statistical analytics: Platforms that provide analytics capabilities beyond simple data mining,
such as running algorithms that can range from simple regressions to advanced neural
networks such as Google TensorFlow or R, fall into this category.
Data management: Data encryption, governance, access, compliance, and other features
salient to any enterprise and production environment to manage and, in some ways, reduce
operational complexity form the next basic layer.
End user: The end user of the analytics software forms the final aspect of a big data
analytics engagement.The term data scientist is often used to denote individuals who
implement the underlying big data analytics capabilities while business users reap the
benefits of faster access and analytics capabilities not available in traditional systems.
Types of Big Data
Structured
Structured data, as the name implies, indicates datasets that have a defined
organizational structure such as Microsoft Excel or CSV files.
Unstructured data consists of any dataset that does not have a predefined organizational
schema as in the table in the prior section. Spoken words, music, videos, and even books,
including this one, would be considered unstructured.
Data from social media, such as posts on Twitter, messages from friends on Facebook, and
photos on Instagram, are all examples of unstructured data.
For textual data, technologies such as Lucene/Solr, Elasticsearch, and others are generally
used to query, index, and other operations
Semi-structured
, semi-structured data is usually represented by formats, such as JSON, that can encapsulate
both structured as well as schemaless or arbitrary associations, generally using key-value
pairs.
A more common example could be email messages, which have both a structured part, such
as name of the sender, time when the message was received, and an unstructured portion
represented by the body or content of the email.
Platforms such as Mongo and CouchDB are generally used to store and query
semistructured datasets.
What is big data mining?
The success of any big data platform depends largely on finding relevant problems in
business units that will deliver measurable value for the department or organization.
Who needs big data mining: Determining which business groups will benefit most
significantly from a big data mining solution is the first step in this process
Most organizations follow a hierarchical structure where the interaction among business
colleagues is likely to be mainly along rank lines.
getting these people involved in the decision-making process and on board with the vision
and mission.
As outlined earlier, the implementation process can span multiple steps. These steps are
often iterative in nature and require a trial-and-error approach.
Big Data strategy will include multiple stakeholders and a collaborative approach often
yields the best results. Business sponsors, business support and IT & Analytics are three
broad categories of stakeholders that together create a proper unified solution, catering to
the needs of the business to the extent that budget and IT capabilities will permit.
Stakeholders of the solution
Business sponsor: The individual or department that provides the support and/or funding for
the project.
Implementation group: The team that implements the solution from a hands-on perspective.
This is usually the IT or Analytics department
Legal:Open source software licenses such as Apache 2.0, MIT, and BSD are generally more
permissible relative to GNU GPL (General Purpose License). For commercial solutions, the
process is more involved as it requires the analysis of vendor-specific agreements
Implementing the solution
The final implementation of the solution is the
culmination of the collaboration between the
implementation group, business beneficiaries,
and auxiliary departments.
Hadoop platform provides an improved programming model, which is used to create and run
distributed systems quickly and efficiently.
A Hadoop cluster consists of single MasterNode and multiple worker nodes. The MasterNode
contains a NameNode and JobTracker and a slave or worker node acts as both a DataNode and
TaskTracker. Hadoop requires Java Runtime Environment (JRE) 1.6 or a higher version of JRE.
There are two main components of Apache Hadoop – the Hadoop Distributed File System
(HDFS) and the MapReduce parallel processing framework.
Who Uses Hadoop?
● British Airways
● Uber
● The Bank of Scotland
● Netflix
● The National Security Agency (NSA), of the United States
● The UK’s Royal Mail system
● Expedia
● Twitter
Installing Hadoop
he most common ones are: Installing Hadoop from the source files from
https://hadoop.apache.org
It is an interpreted language like Python and uses a command line interpreter. It
supports procedural as well as generic functions with OOP.
To analyse the datasets, there are many techniques available, some of which are as follows:
Massive Parallelism
Data Distribution
High-Performance Computing
Data Retrieval
Massive Parallelism
According to the simplest definition available, a parallel system is a system where multiple
processors are involved and associated to carry out the concurrent computations.
Massive parallelism refers to a parallel system where multiple systems interconnected with
each other pose as a single mighty conjoint processor and carry out tasks received from the
data sets parallelly.
In terms of Big Data dynamics, the systems can not only be processor, but also memory,
hardware and even network conjoint to scale up the operational efficiency posing as a
massive system that can eat humongous datasets parallelly without breaking a sweat.
Data Distribution
There are approaches to data distribution in a Big Data system described as follows:
Centralised Approach: A central repository is used to store and download the essential
networking infrastructure.
Hierarchical Approach: In a hierarchical approach, the data is fetched from the parent
It is a useful feature when complex applications having multiple tasks need
to be performed at the same time.
Task parallelism refers to the execution of computer programmes
throughout the multiple processors on the different or same machines. It
emphasises on performing diverse operations in parallel to best utilise the
accessible computing resources like memory and processors.
Data Mining and Analytics
Big Data systems utilise the Hadoop and the HDFS architecture to retrieve
the data using MapReduce – a distributed processing framework.
It helps programmers in solving parallel data problems where the dataset
can be divided into small chunks and handled autonomously.
Machine Learning
. Some of these providers are rendering services that are relevant to Big
Data analytics only. Some such providers are as follows:
Amazon
Windows Azure
In-Memory Technology for Big Data
Over the recent 2-3 years, it has become increasingly common for corporations to move their
processes to cloud-based environments as a complementary solution for in-house
infrastructures
On-premises can be used interchangeably with In-house, On-site, and other similar
terminologies
Selection of the hardware stack
type of solution
the size of data (gigabytes versus terabytes versus petabytes), and, to an extent,
Multinode architecture: multiple nodes (or servers) that are interconnected and work on the
principle of multinode or distributed computing
If the dataset is structured, contains primarily textual data, and is in the order of 1-5 TB, in
today’s computing environment, it is entirely possible to host such datasets on single-node
machines
Cloud-based architecture: Cloud platforms such as Amazon Web Services, Azure from
Microsoft, and the Google Compute Environment permit enterprises to provision 10s to
1000s of nodes at costs starting as low as 1 cent per hour per instance.
The exponential decrease in the cost of hardware: The cost of hardware has gone down
exponentially over the past few years.
Selection of the software stack
The Hadoop ecosystem: Hadoop supports nearly all the various types of datasets—such as
structured, unstructured, and semi-structured—well-known in the big data space. Its thriving
ecosystem of auxiliary tools that add new functionalities
There are four primary components of Hadoop, apart from the projects present in the large
ecosystem. They are as follows:
Hadoop Common: The common utilities that support the other Hadoop modules Hadoop
Distributed File System (HDFS™): A distributed filesystem that provides high-throughput
access to application data Hadoop
YARN: A framework for job scheduling and cluster resource management Hadoop
MapReduce: A YARN-based system for parallel processing of large datasets
Apache Spark
multinode computing framework first conceived at University of California at Berkeley’s
AMPLab as a platform that provided a seamless interface to run parallel computations and
overcome limitations in the Hadoop MapReduce framework
Spark introduces a data structure called Resilient Distributed Datasets (RDD), which
provides a mechanism to store data in-memory
Resilient Distributed Datasets (RDD)
Cluster manager: The nodes constituting a Spark cluster communicate using cluster
managers, which manage the overall coordination among the nodes that are part of the
cluster
Distributed storage: Spark can access data from a range of underlying distributed storage
systems such as HDFS, S3 (AWS Storage), Cassandra, HBase, Hive, Tachyon, and any
Hadoop data source
Spark can support multiple types of cluster managers as well as backend storage systems,
NoSQL and traditional database
The term NoSQL came into existence recently and is meant to distinguish databases that do
not follow the traditional relational-database models. There are both open source and
commercial variations of NoSQL databases and indeed even cloud-based options that have
become increasingly common
Key-value: These NoSQL databases store data on a principle of hashing—a unique key
identifies a set of properties about the key
In-memory: such as storing caches in the memory to provide faster access relative to storing
on disk
NoSQL databases, such as Redis and KDB+, leverage temporary in-memory storage in order
to provide faster access to frequently used data
Columnar
These databases append multiple columns of data as opposed to rows to create a table. The
primary advantage of columnar storage over row-based storage is that a columnar layout
provides the means to access data faster with reduced I/O overhead
document-oriented databases store data that do not conform to any specific schema such as
unstructured text like news articles
document databases such as MongoDB are used widely in media-related organizations such
as NY Times
Cloud-based solutions
Finally, cloud-based solutions for large-scale data mining such as AWS Redshift, Azure SQL
Data Warehouse, and Google Bigquery permit users to query datasets directly on the
cloud-vendor’s platform without having to create their own architecture
● HDFS -> Hadoop Distributed File System
● YARN -> Yet Another Resource Negotiator
● MapReduce -> Data processing using
programming
● Spark -> In-memory Data Processing
● PIG, HIVE-> Data Processing Services using
Query (SQL-like)
● HBase -> NoSQL Database
● Mahout, Spark MLlib -> Machine Learning
● Apache Drill -> SQL on Hadoop
● Zookeeper -> Managing Cluster
● Oozie -> Job Scheduling
● Flume, Sqoop -> Data Ingesting Services
● Solr & Lucene -> Searching & Indexing
● Ambari -> Provision, Monitor and Maintain
cluster
Distributed and parallel Computing
for Big Data
BIG DATA
•Big Data can’t be handled by traditional data storage and processing systems.
Multiple computing resources are connected in a network and computing tasks are
•Divide complex computations into subtasks, handled individually by processing units, running in
parallel.
Concept – processing capability will increase with the increase in the level of parallelism.
Data parallelism paradigm
To solve the big data problems, the data parallelism paradigm can be considered.
Given a large scale dataset D, it can be divided into n subsets, denoted as D1, D2, D3,
.....Dn, where each subset may contain different numbers of data samples and each subset
may or may not have duplicate data samples.
Then, a specific data mining algorithm implemented in n local machines (or computer
nodes) individually is performed over each subset.
Finally, the n mining results are combined via one combination component to produce the
final output.
PARALLEL COMPUTING
BIG DATA PROCESSING TECHNIQUES
With the increase in data, forcing organizations to adopt a data analysis strategy that can be used for
•
The system with added scalability, can accommodate the growing amounts of data
Companies offering these computing services are called cloud providers and typically charge
for cloud computing services based on usage, similar to how you are billed for water or
electricity at home.
A cloud has the resources and services that a business can use on demand, and the business
doesn't have to build, own or maintain the infrastructure.
Thus, the cloud makes big data technologies accessible and affordable to almost any size of
enterprise.
CLOUD COMPUTING AND BIG DATA
FEATURES OF CLOUD COMPUTING
-The new h/w may not provide complete support to the s/w, that used to run properly on the
-Solution to this problem is using cloud services - that employ the distributed computing
resources.
•Resource Pooling - multiple organizations, which use similar kinds of resources to carry
out computing practices, have no need to individually hire all the resources.
•S elf Service – cloud computing involves a simple user interface that helps customers to directly access
•Low Cost – cloud offers customized solutions, especially to organizations that cannot afford too
- cloud provides pay-us-you-use option, in which organizations need to sign for those resources only
•Depending upon the architecture used in forming the n/w, services and applications used,
and the target consumers, cloud services form various deployment models.They are,
§Public Cloud
§Private Cloud
§Community Cloud
§Hybrid Cloud
Public Cloud (End-User Level Cloud)
Owned and managed by a company than the one using it.
•-
latency.
•Private Cloud (Enterprise Level Cloud)
-Can also provide firewall protection to the cloud, solving latency and security concerns.
Type of cloud that is shared among various organizations with a common tie.-
Eg. In any state, the community cloud can provided so that almost all govt. organizations of that
state can share the resources available on the cloud.
Because of the sharing of resources on community cloud, the data of all citizens of that state can
be easily managed by the govt. organizations
•Hybrid Cloud
various internal or external service providers offer services to many organizations.
-In hybrid clouds, an organization can use both types of cloud, i.e. public
•Organization uses its own computing infrastructure, high load requirement, access clouds.
The organization using the hybrid cloud can manage an internal private cloud for general
use and migrate the entire or part of an application to the public cloud during the peak
periods.
big data services from the top three
providers
AWS
Amazon Elastic MapReduce Google Cloud
Azure HDInsight
Azure Analysis Services
Azure Databricks
CLOUD SERVICES FOR BIG DATA
•In big data Iaas, Paas and Saas clouds are used in following manner.
•Iaas:- Huge storage and computational power requirement for big data are
fulfilled by limitless storage space and computing ability obtained by iaas cloud.
•Paas:- offerings of various vendors have stared adding various popular big
data platforms that include mapreduce, Hadoop. These offerings save
organisations from a lot of hassels which occur in managing individual hardware
components and software applications.
•Saas:- Various organisation require identifying and analysing the voice of
customers particularly on social media. Social media data and platform are
provided by SAAS vendors.In addition, private cloud facilitates access to
enterprise data which enable these analyses.
Big data as a service (BDaaS) solutions:
comparing IaaS, PaaS and SaaS
There are three different BDaaS models. These closely align with the 3 models of
cloud infrastructure: IaaS, PaaS, and SaaS.
Big Data Infrastructure as a Service (IaaS) – Basic data services from a cloud
service provider.
Big Data Platform as a Service (PaaS) – Offerings of an all-round Big Data stack
like those provided by Amazon S3, EMR or RedShift. This excludes ETL and BI.
Big Data Software as a Service (SaaS) – A complete Big Data stack within a single
tool.
https://hub.packtpub.com/big-data-as-a-service-bdaas-solutions-comparing-iaas-
paas-and-saas/
Big Data IaaS
FFor the data layer you have the option of choosing from among:
Hadoop – The Hadoop ecosystem can be run on an EC2 instance giving you complete control
Commercial ETL tools that can run on Amazon’s infrastructure and use S3
Open source processing tools that run on AWS instances, like Kafka
Big Data PaaS
A standard Hadoop cloud-based Big Data Infrastructure on Amazon contains the
following:
Data Ingestion – Logs file data from any data source
Amazon S3 Data Storage Layer
Amazon EMR – A scalable set of instances that run Map/Reduce against the S3 data.
Amazon RDS – A hosted MySQL database that stores the results from Map/Reduce
computations.
Analytics and Visualization – Using an in-house BI tool.
A similar set up can be replicated using Microsoft’s Azure HDInsight
Big Data SaaS
A fully hosted Big Data stack complete that includes everything from data storage to data
visualization contains the following:
Data Layer – Data needs to be pulled into a basic SQL database. An automated data warehouse
does this efficiently
Integration Layer – Pulls the data from the SQL database into a flexible modeling layer
Processing Layer – Prepares the data based on the custom business requirements and logic
provided by the user
Analytics and BI Layer – Fully featured BI abilities which include visualizations, dashboards, and
charts, etc.
Azure Data Warehouse and AWS Redshift are the popular SaaS options that offer a complete data
warehouse solution in the cloud
HADOOP
•Hadoop is a distributed system like distributed database.
•
•Hadoop is a ‘software library’ that allows its users to process large datasets across
distributed clusters of computers, thereby enabling them to gather, store and analyse
•
•It provides various tools and technologies, collectively termed as the Hadoop
Ecosystem.
HADOOP MULTINODE CLUSTER
ARCHITECTURE
•Hadoop cluster consist of single Master Node and a multiple Worker Nodes.
1. HDFS- STORAGE
2. YARN- RESOURCE MANAGER
3. MAP REDUCER- PROCESSING UNIT
HDFS
unstructured data.
-Mapper function
-Reducer function