0% found this document useful (0 votes)
44 views

Bigdata Mod-1

Uploaded by

preeti sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Bigdata Mod-1

Uploaded by

preeti sahu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Syllabus

MODULE-I [8 Hrs]

Introduction: Big Data Overview, BI Versus Data Science, Current Analytical Architecture,
Drivers of Big Data. Data Analytics Lifecycle - Overview, Phases - Discovery, Data Preparation
and Model planning, Model building, Communicate Results and Operationalize.
Industry examples of Big Data

Introduction:

Big Data Overview

90% of the world’s data was generated in the last few years.”
Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year. The
amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If
you pile up the data in the form of disks it may fill an entire football field. The same amount
was created in every two days in 2011, and in every ten minutes in 2013. This rate is still
growing enormously. Though all this information produced is meaningful and can be useful
when processed, it is being neglected.

What is Big Data?

Big data is a collection of large datasets that cannot be processed using traditional computing
techniques. It is not a single technique or a tool, rather it has become a complete subject, which
involves various tools, technqiues and frameworks.

What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
 Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
 Social Media Data − Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
 Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
 Power Grid Data − The power grid data holds information consumed by a particular
node with respect to a base station.
 Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
 Search Engine Data − Search engines retrieve lots of data from different databases.

Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in
it will be of three types.
 Structured data − Relational data.
 Semi Structured data − XML data.
 Unstructured data − Word, PDF, Text, Media Logs.

Benefits of Big Data

 Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
 Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
 Using the data regarding the previous medical history of patients, hospitals are providing
better and quick service.

Big Data Technologies

Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.
There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we
examine the following two classes of technology −

Operational Big Data

This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud computing architectures
that have emerged over the past decade to allow massive computations to be run inexpensively
and efficiently. This makes operational big data workloads much easier to manage, cheaper, and
faster to implement.
Some NoSQL systems can provide insights into patterns and trends based on real-time data with
minimal coding and without the need for data scientists and additional infrastructure.

Analytical Big Data

These includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that may
touch most or all of the data.
MapReduce provides a new method of analyzing data that is complementary to the capabilities
provided by SQL, and a system based on MapReduce that can be scaled up from single servers
to thousands of high and low end machines.
These two classes of technology are complementary and frequently deployed together.

Operational vs. Analytical Systems

Operational Analytical

Latency 1 ms - 100 ms 1 min - 100 min


Concurrency 1000 - 100,000 1 - 10

Access Pattern Writes and Reads Reads

Queries Selective Unselective

Data Scope Operational Retrospective

End User Customer Data Scientist

Technology NoSQL MapReduce, MPP Database


BDA

Big data analytics is the often complex process of examining big data to uncover information --
such as hidden patterns, correlations, market trends and customer preferences -- that can help
organizations make informed business decisions.

On a broad scale, data analytics technologies and techniques give organizations a way to analyze
data sets and gather new information. Business intelligence (BI) queries answer basic questions
about business operations and performance.

Big data analytics is a form of advanced analytics, which involve complex applications with
elements such as predictive models, statistical algorithms and what-if analysis powered by
analytics systems.

Why is big data analytics important?

Organizations can use big data analytics systems and software to make data-driven decisions that
can improve business-related outcomes. The benefits may include more effective marketing, new
revenue opportunities, customer personalization and improved operational efficiency. With an
effective strategy, these benefits can provide competitive advantages over rivals.

What is big data analytics with examples?


Big data analytics helps businesses to get insights from today's huge data resources. People,
organizations, and machines now produce massive amounts of data. Social media, cloud
applications, and machine sensor data are just some examples.
Big Data Challenges

The major challenges associated with big data are as follows −

 Capturing data
 Curation
 Storage
 Searching
 Sharing
 Transfer
 Analysis
 Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
BI Versus Data Science

Difference Between Data Science and Business Intelligence

Data Science:
Data science is basically a field in which information and knowledge are extracted from the data
by using various scientific methods, algorithms, and processes. It can thus be defined as a
combination of various mathematical tools, algorithms, statistics, and machine learning
techniques which are thus used to find the hidden patterns and insights from the data which helps
in the decision making process. Data science deals with both structured as well as unstructured
data. It is related to both data mining and big data. Data science involves studying the historic
trends and thus using its conclusions to redefine present trends and also predict future trends.
Business Intelligence:
Business intelligence(BI) is basically a set of technologies, applications, and processes that are
used by enterprises for business data analysis. It is basically used for the conversion of raw data
into meaningful information which is thus used for business decision making and profitable
actions. It deals with the analysis of structured and sometimes unstructured data which paves the
way for new and profitable business opportunities. It supports decision making based on facts
rather than assumption-based decision making. Thus it has a direct impact on the business
decisions of an enterprise. Business intelligence tools enhance the chances of an enterprise to
enter a new market as well as help in studying the impact of marketing efforts.

Below is a table of differences between Data Science and Business Intelligence:

Factor Data Science Business Intelligence

It is a field that uses mathematics, It is basically a set of technologies,


statistics and various other tools applications and processes that are
to discover the hidden patterns in used by the enterprises for business
Concept the data. data analysis.

Focus It focuses on the future. It focuses the past and present.

It deals with both structured as It mainly deals only with structured


Data well as unstructured data. data.

Data science is much more It is less flexible as in case of


flexible as data sources can be business intelligence data sources
Flexibility added as per requirement. need to be pre-planned.

It makes the use of scientific


Method method. It makes the use of analytic method.

Complexit It has a higher complexity in It is much simpler when compared to


Factor Data Science Business Intelligence

comparison to business
y intelligence. data science.

Expertise It’s expertise is data scientist. It’s expertise is business user.

It deals with the questions what It deals with the question what
Questions will happen and what if. happened.

It’s tools are InsightSquared Sales


It’s tools are SAS, BigML, Analytics, Klipfolio, ThoughtSpot,
Tools MATLAB, Excel etc. Cyfe, TIBCO Spotfire etc.

Current Analytical Architecture


What is current analytical architecture?

Analytics architecture refers to the systems, protocols, and technology


used to collect, store, and analyze data. The concept is an umbrella term
for a variety of technical layers that allow organizations to more effectively
collect, organize, and parse the multiple data streams they utilize.
What is big data analytics architecture?
A big data architecture is designed to handle the ingestion, processing,
and analysis of data that is too large or complex for traditional database
systems. Big data solutions typically involve one or more of the following
types of workload: Batch processing of big data sources at rest.

1. WHAT IS BIG DATA?


There is no definition per se for Big Data. Big data, in short, is a technique used to solve business
or worldly problems with the help of all the data that makes up the business or the specific
context of a worldly problem, which includes data in any form, and data at any rate. A popular
description of Big Data uses the 3, 4 or 5 V’s to describe the characteristics of data being
consumed by the Big Data ecosystem. The V’s are Volume, Variety, Velocity, Veracity and
Variability. In today’s times, when Big Data is mentioned, it often is equated with advanced
forms of data analytics like data modelling and predictive analytics.

Big Data’s challenges keep changing depending on the context or the ecosystem and include
capturing, analysis, storage technology, visualization, sharing, transfer, and privacy concerns.
Multiple technologies partake in the Big Data model to handle the V’s mentioned earlier as
efficiently as possible. For structured data, RDBMS or distributed RDBMS is used. For
unstructured data, NoSQL databases like MongoDB is used. For data ingestion Apache Flume
and other specialized tools are used. One of the prominent and well known, Big Data
frameworks is Hadoop. There are other custom implementations of the Hadoop ecosystem from
the market’s major players like Amazon, Microsoft, Oracle and IBM.

2. APPLICATIONS
We shall quickly run through a few of the top applications of Big Data in the real world before
we dive into Big Data Architecture.

 Big Data in the Banking and Securities Industry


The banking industry uses Big Data for risk analysis, anti-money laundering and other such
financial frauds, so such frauds and quickly detected and mitigated. Banks use risk analysis to
analyse the creditworthiness of a potential customer. In many countries, securities and Exchange
boards use Big Data to perform Network analytics and Natural Language Processing to catch
illegal trading activities.

 Big Data in Communications, Media and Entertainment


Media content is driven by analysing customer data at Big Data’s scale to understand
behavioural patterns, likes and dislikes of a broad set of customers. YouTube is a big example of
Big Data at work, driving ad revenues using the information derived by Big Data analytics.

 Big Data in Healthcare


Big Data is being used to deliver evidence-based diagnosis instead of a battery of medical tests,
thus bringing down costs and improving efficiency in medical circles. Big Data is also used in
Machine Learning models to detect a condition based on x-ray images, thus saving valuable time
and lives.

 Big data in Manufacturing


Big Data today is extensively used in manufacturing for Supply Chain Management, Predictive
Maintenance, Predictive Quality, Production Forecasting, improving throughput and yield and
more.

 Big Data in Insurance


Big Data helps insurance companies in minimizing underwriting risks and improving fraud
detection in claims.

 Retail Trade
The biggest application of big data is probably in the retail and e-commerce industry, with data
analysis from various streams, including social media, for producing targeted adverts. Sentiment
analysis also plays a major role in gauging customer feedback on all products, thus quickly
fixing any issues before it turns out to be a major loss.

3. BIG DATA ARCHITECTURE DIAGRAM


Let’s now dive straight into Big Data Architecture.

There are several Big Data products on the market, but you still have to design the system to suit
your business’s particular needs. You will need a Big Data Architect to design your Big Data
solution catering to your unique business ecosystem. Big Data has a generic architecture that
applies to most businesses at a high level, and it is not necessary that you need all of the
components used for successful implementation.

To start with, Big Data is known to have at least 6 layers to its architecture. They are

 Data Ingestion Layer


You need your Big Data setup to handle all incoming data streams, whether structured,
unstructured or semi-structured and at speeds that match the rate at which data is coming in. This
is achieved at the Data Ingestion Layer. The incoming data is prioritized and categorized for a
smooth flow into further layers down the line.

 Data Collector Layer


The data collector layer is concerned with the transportation of data from the data ingestion layer
to the rest of the pipeline. Here the components are decoupled to allow analytical processing.

 Data Processing Layer


The processing layer where the analytical process begins, where data is needed for analysis is
selected, cleaned, formatted for further analysis and modelling.

 Data Storage Layer


This layer is critical to Big Data. After all, it is all about data. The volume of data and the
velocity of data directly impact Data Storage Layer. The storage solution should be in line with
the data ingestion requirement of your business ecosystem.

 Data Query Layer


This layer is where active analytical processing of data takes place.
 Data Visualization Layer
This layer is everything to do with a graphical representation of information and value gained
through analysis. Using rich charts, graphs and maps, the tools in this layer help present a
compelling story for a decision to be made by your leadership team.

 Data Monitoring Layer


This layer involves Data Profiling and Lineage, Data Quality, Data Cleansing, Data Loss
Prevention.

Here is a representation of Big Data Architecture with just the Big Data components shown.

4. PATTERNS
There are several design patterns that are like templates that you can select for your business.
These design patterns are based on the layer in context. We shall mention a few patterns in the
ingestion, data storage layers.

Data Source and Ingestion layers

A few design patterns exist for this layer, namely,

 Multisource Extractor
 Multi-destination
 Protocol Converter
 Just-n-time Transformation
 Real-time streaming pattern
Data Storage Layer

With ACID (atomicity, consistency, isolation, and durability), BASE (basically available, soft
state, eventually consistent) and CAP (consistency, availability, and partition tolerance)
paradigms, several design patterns have been built for the storage layer, namely,

 Façade pattern
 NoSQL pattern
 Polyglot pattern
There are a few more design patterns available that you may attempt to explore.

5. EXAMPLES
Here we list out a few well-known examples of Big Data Architecture shaped according to the
business ecosystem they were developed for and evolved into their current state.

 Streaming
This kind of Big Data architecture allows for real-time data ingestion of critical business data
that needs to be taken care of or responded to in real-time. The velocity and variety of data here
is the key pivot around which this architecture has evolved.

 General-purpose
This kind of Big Data Architecture provides generic storage and processing capabilities that are
applicable in most businesses.

 NoSQL engines
Big Data architectures of this kind stress dealing with data coming in at high velocity, is high in
volume, and is coming in a variety of types (structured, unstructured or semi-structured).

 Enterprise Datawarehouse
A Big Data Architecture inspired by an enterprise Datawarehouse that stores a separate database
of historical data for years, using it for analytical purposes.

 In-place Analytics
An architecture that allows data to be left “in place” in a low-cost storage engine where ad-hoc
queries can be run without the need for separate and expensive clusters.

CONCLUSION
Without any doubt, it can be said that Big Data will be the technology that businesses run on, be
it on-premises or on the cloud. There is no denying Big Data is the technology that facilitates
Machine Learning and Artificial intelligence and is a backend skill that will be sought after in
every industry and business in the near future.
Or

Big Data Architecture Definition

Big data architecture refers to the logical and physical structure that dictates how high volumes
of data are ingested, processed, stored, managed, and accessed.

‍What is Big Data Architecture?


Big data architecture is the foundation for big data analytics. It is the overarching system used to
manage large amounts of data so that it can be analyzed for business purposes, steer data
analytics, and provide an environment in which big data analytics tools can extract vital business
information from otherwise ambiguous data. The big data architecture framework serves as a
reference blueprint for big data infrastructures and solutions, logically defining how big data
solutions will work, the components that will be used, how information will flow, and security
details.
The architecture components of big data analytics typically consists of four logical layers and
performs four major processes:
Big Data Architecture Layers

 Big Data Sources Layer: a big data environment can manage both batch processing and
real-time processing of big data sources, such as data warehouses, relational database
management systems, SaaS applications, and IoT devices.
 Management & Storage Layer: receives data from the source, converts the data into a
format comprehensible for the data analytics tool, and stores the data according to its
format.
 Analysis Layer: analytics tools extract business intelligence from the big data storage
layer. ‍
 Consumption Layer: receives results from the big data analysis layer and presents them to
the pertinent output layer - also known as the business intelligence layer.


Big Data Architecture Processes

 Connecting to Data Sources: connectors and adapters are capable of efficiently


connecting any format of data and can connect to a variety of different storage systems,
protocols, and networks.
 Data Governance: includes provisions for privacy and security, operating from the
moment of ingestion through processing, analysis, storage, and deletion.
 Systems Management: highly scalable, large-scale distributed clusters are typically the
foundation for modern big data architectures, which must be monitored continually via
central management consoles.‍
 Protecting Quality of Service: the Quality of Service framework supports the defining of
data quality, compliance policies, and ingestion frequency and sizes.


In order to benefit from the potential of big data, it is crucial to invest in a big data infrastructure
that is capable of handling huge quantities of data. These benefits include: improving
understanding and analysis of big data, making better decisions faster, reducing costs, predicting
future needs and trends, encouraging common standards and providing a common language, and
providing consistent methods for implementing technology that solves comparable problems.
Big data infrastructure challenges include the management of data quality, which requires
extensive analysis; scaling, which can be costly and affect performance if not sufficient; and
security, which increases in complexity with big data sets.

Big Data Architecture

Establishing big data architecture components before embarking upon a big data project is a
crucial step in understanding how the data will be used and how it will bring value to the
business. Implementing the following big data architecture principles for your big data
architecture strategy will help in developing a service-oriented approach that ensures the data
addresses a variety of business needs.

 Preliminary Step: A big data project should be in line with the business vision and have a
good understanding of the organizational context, the key drivers of the organization,
data architecture work requirements, architecture principles and framework to be used,
and the maturity of the enterprise architecture. It is also important to have a thorough
understanding of the elements of the current business technology landscape, such as
business strategies and organizational models, business principles and goals, current
frameworks in use, governance and legal frameworks, IT strategy, and any pre-existing
architecture frameworks and repositories.

 Data Sources: Before any big data solution architecture is coded, data sources should be
identified and categorized so that big data architects can effectively normalize the data to
a common format. Data sources can be categorized as either structured data, which is
typically formatted using predefined database techniques, or unstructured data, which
does not follow a consistent format, such as emails, images, and Internet data.
 Big Data ETL: Data should be consolidated into a single Master Data Management
system for querying on demand, either via batch processing or stream processing. For
processing, Hadoop has been a popular batch processing framework. For querying, the
Master Data Management system can be stored in a data repository such as NoSQL-
based or relational DBMS
 Data Services API: When choosing a database solution, consider whether or not there is a
standard query language, how to connect to the database, the ability of the database to
scale as data grows, and which security mechanisms are in place. ‍
 User Interface Service: a big data application architecture should have an intuitive design
that is customizable, available through current dashboards in use, and accessible in the
cloud. Standards like Web Services for Remote Portlets (WSRP) facilitate the serving of
User Interfaces through Web Service calls.

How to Build a Big Data Architecture

Designing a big data reference architecture, while complex, follows the same general procedure:

 Analyze the Problem: First determine if the business does in fact have a big data problem,
taking into consideration criteria such as data variety, velocity, and challenges with the
current system. Common use cases include data archival, process offload, data lake
implementation, unstructured data processing, and data warehouse modernization.
 Select a Vendor: Hadoop is one of the most widely recognized big data architecture tools
for managing big data end to end architecture. Popular vendors for Hadoop distribution
include Amazon Web Services, BigInsights, Cloudera, Hortonworks, Mapr, and
Microsoft.
 Deployment Strategy: Deployment can be either on-premises, which tends to be more
secure; cloud-based, which is cost effective and provides flexibility regarding scalability;
or a mix deployment strategy.
 Capacity Planning: When planning hardware and infrastructure sizing, consider daily
data ingestion volume, data volume for one-time historical load, the data retention period,
multi-data center deployment, and the time period for which the cluster is sized
 Infrastructure Sizing: This is based on capacity planning and determines the number of
clusters/environment required and the type of hardware required. Consider the type of
disk and number of disks per machine, the types of processing memory and memory size,
number of CPUs and cores, and the data retained and stored in each environment.‍
 Plan a Disaster Recovery: In developing a backup and disaster recovery plan, consider
the criticality of data stored, the Recovery Point Objective and Recovery Time Objective
requirements, backup interval, multi datacenter deployment, and whether Active-Active
or Active-Passive disaster recovery is most appropriate.

Drivers of Big Data


Big Data is of three types:
 Structured Data – It is that data that can be processed, stored, and can be retrieved in
a fixed format i.e., we can say that when data is stored and extracted, it should be in a
proper manner/layout.
 Unstructured Data – This type of data lacks any structure and is stored as it is.
Analyzing such data is very time-consuming as well as challenging.

 Semi-structured Data – This type of data is a mix of the above two types of big data

i.e., structured and unstructured data. It is also known as hybrid big data

7 V’s of Big Data

1. Volume
2. Videos
3. Velocity
4. Variability
5. Veracity
6. Visualization
7. Value

1. Volume
This is the main characteristic of big data. The term volume here defines big data as “BIG”.
With a massive amount of data generating daily, we know gigabytes is not enough to store such
huge amount of data.
Because of this, now the data is stored in terms of Zettabytes, Exabytes, and Yottabytes. For
instance, almost 50 hours of videos are uploaded on YouTube every single minute.
Now imagine how much data is being generated on YouTube itself.

2. Variety
Here variety means types of data sources. As discussed above, big data can be of various types –
structured, semi-structured, and unstructured.
In today’s world, the data which is generated in large quantities is unstructured data only like
audio files, video, images, text files, etc.
which are difficult to map due to their nature i.e., they don’t have any set of rules which makes
them difficult to sort from essential data.

3. Velocity
Velocity here refers to how fast the data can be processed and accessed. For example, social
media posts, YouTube videos, audio files, images that are uploaded in thousands every second
should be accessible as early as possible.

4. Variability
Variability is different from the variety. Variability refers to the data which keeps on changing
constantly.
Variability mainly focuses on understanding and interpreting the correct meanings of raw data.
For example – A soda shop may offer 6 different blends of soda, but if you get the same blend of
soda every day and it tastes different every day, that is variability.
The same is in the case of data, and if it is continuously changing, then it can have an impact on
the quality of your data.

5. Veracity
If your data is not accurate, it is of no use, and here comes the concept of Veracity. It is all about
making sure the data gathered by you is accurate and also keeping the bad data away from your
systems.
It is also the trustworthiness or quality of data which a company received and processes to derive
useful insights.
6. Visualization
Visualization here refers to how you can present your data to the management for decision-
making purposes.
We all know that data cane be presented in many ways such as Excel files,word docs ,graphical
Charts etc.irrespective of the format,The data should be easily readable ,understandable,and
accessible and that’s why data visualization is important .
7 Value
Value is known as the end game in Bigdata.Every user needs to understand that the organization
needs some value after efforts are made and resources are spent on the above mentioned V’s.
Bigdata can help a user provide value if it is done and processed correctly.
Data Analytics Lifecycle – Overview
The Data Analytics Lifecycle is a cyclic process which explains, in six stages, how information in
made, collected, processed, implemented, and analyzed for different objectives
Data Analytics Lifecycle :
The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is
iterative to represent real project. To address the distinct requirements for performing analysis on Big
Data, step – by – step methodology is needed to organize the activities and tasks involved with
acquiring, processing, analyzing, and repurposing data.
Phase 1: Discovery –
 The data science team learn and investigate the problem.
 Develop context and understanding.
 Come to know about data sources needed and available for the project.
 The team formulates initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –
 Steps to explore, preprocess, and condition data prior to modeling and analysis.
 It requires the presence of an analytic sandbox, the team execute, load, and transform, to get data
into the sandbox.
 Data preparation tasks are likely to be performed multiple times and not in predefined order.
 Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
Phase 3: Model Planning –
 Team explores data to learn about relationships between variables and subsequently, selects key
variables and the most suitable models.
 In this phase, data science team develop data sets for training, testing, and production purposes.
 Team builds and executes models based on the work done in the model planning phase.
 Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building –
 Team develops datasets for testing, training, and production purposes.
 Team also considers whether its existing tools will suffice for running the models or if they need
more robust environment for executing models.
 Free or open-source tools – Rand PL/R, Octave, WEKA.
 Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results –
 After executing model team need to compare outcomes of modeling to criteria established for
success and failure.
 Team considers how best to articulate findings and outcomes to various team members and
stakeholders, taking into account warning, assumptions.
 Team should identify key findings, quantify business value, and develop narrative to summarize
and convey findings to stakeholders.
Phase 6: Operationalize –
 The team communicates benefits of project more broadly and sets up pilot project to deploy work
in controlled way before broadening the work to full enterprise of users.
 This approach enables team to learn about performance and related constraints of the model in
production environment on small scale , and make adjustments before full deployment.
 The team delivers final reports, briefings, codes.
 Free or open source tools – Octave, WEKA, SQL, MADlib.

Or
Life Cycle Phases of Data Analytics
The Data Analytics Lifecycle is a cyclic process which explains, in six stages, how information

in made, collected, processed, implemented, and analyzed for different objectives.


Data Discovery
This is the initial phase to set your project's objectives and find ways to achieve a complete data
analytics lifecycle. Start with defining your business domain and ensure you have enough
resources (time, technology, data, and people) to achieve your goals.
The biggest challenge in this phase is to accumulate enough information. You need to draft an
analytic plan, which requires some serious leg work.
Accumulate resources
First, you have to analyze the models you have intended to develop. Then determine how much
domain knowledge you need to acquire for fulfilling those models.
The next important thing to do is assess whether you have enough skills and resources to bring
your projects to fruition.
Frame the issue
Problems are most likely to occur while meeting your client's expectations. Therefore, you need
to identify the issues related to the project and explain them to your clients. This process is called
"framing." You have to prepare a problem statement explaining the current situation and
challenges that can occur in the future. You also need to define the project's objective, including
the success and failure criteria for the project.
Formulate initial hypothesis
Once you gather all the clients' requirements, you have to develop initial hypotheses after
exploring the initial data.
Data Preparation and Processing
The Data preparation and processing phase involves collecting, processing, and conditioning
data before moving to the model building process.
Identify data sources
You have to identify various data sources and analyze how much and what kind of data you can
accumulate within a given timeframe. Evaluate the data structures, explore their attributes and
acquire all the tools needed.
Collection of data
You can collect data using three methods:
Data acquisition: You can collect data through external sources.
Data Entry: You can prepare data points through digital systems or manual entry as well.
Signal reception: You can accumulate data from digital devices such as IoT devices and control
systems.
Model Planning
This is a phase where you have to analyze the quality of data and find a suitable model for your
project.
Loading Data in Analytics Sandbox
An analytics sandbox is a part of data lake architecture that allows you to store and process large
amounts of data. It can efficiently process a large range of data such as big data, transactional
data, social media data, web data, and many more. It is an environment that allows your analysts
to schedule and process data assets using the data tools of their choice. The best part of the
analytics sandbox is its agility. It empowers analysts to process data in real-time and get essential
information within a short duration.

Data are loaded in the sandbox in three ways:


ETL − Team specialists make the data comply with the business rules before loading it in the
sandbox.
ELT − The data is loaded in the sandbox and then transform as per business rules.
ETLT − It comprises two levels of data transformation, including ETL and ELT both.
The data you have collected may contain unnecessary features or null values. It may come in a
form too complex to anticipate. This is where data exploration' can help you uncover the hidden
trends in data.
Steps involved in data exploration:

 Data identification
 Univariate Analysis
 Multivariate Analysis
 Filling Null values
 Feature engineering
For model planning, data analysts often use regression techniques, decision trees, neural
networks, etc. Tools mostly used for model planning and execution include Rand PL/R, WEKA,
Octave, Statista, and MATLAB.
Model Building
Model building is the process where you have to deploy the planned model in a real-time
environment. It allows analysts to solidify their decision-making process by gain in-depth
analytical information. This is a repetitive process, as you have to add new features as required
by your customers constantly.
Your aim here is to forecast business decisions and customize market strategies and develop
tailor-made customer interests. This can be done by integrating the model into your existing
production domain.
In some cases, a specific model perfectly aligns with the business objectives/ data, and
sometimes it requires more than one try. As you start exploring the data, you need to run
particular algorithms and compare the outputs with your objectives. In some cases, you may even
have to run different variances of models simultaneously until you receive the desired results.
Result Communication and Publication
This is the phase where you have to communicate the data analysis with your clients. It requires
several intricate processes where you how to present information to clients in a lucid manner.
Your clients don't have enough time to determine which data is essential. Therefore, you must do
an impeccable job to grab the attention of your clients.
Check the data accuracy
Is the data provide information as expected? If not, then you have to run some other processes to
resolve this issue. You need to ensure the data you process provides consistent information. This
will help you build a convincing argument while summarizing your findings.
Highlight important findings
Well, each data holds a significant role in building an efficient project. However, some data
inherits more potent information that can truly serve your audience's benefits. While
summarizing your findings, try to categorize data into different key points.
Determine the most appropriate communication format
How you communicate your findings tells a lot about you as a professional. We recommend you
to go for visuals presentation and animations as it helps you to convey information much faster.
However, sometimes you also need to go old-school as well. For instance, your clients may have
to carry the findings in physical format. They may also have to pick up certain information and
share them with others.
Operationalize
As soon you prepare a detailed report including your key findings, documents, and briefings,
your data analytics life cycle almost comes close to the end. The next step remains the measure
the effectiveness of your analysis before submitting the final reports to your stakeholders.
In this process, you have to move the sandbox data and run it in a live environment. Then you
have to closely monitor the results, ensuring they match with your expected goals. If the findings
fit perfectly with your objective, then you can finalize the report. Otherwise, you have to take a
step back in your data analytics lifecycle and make some changes.
Industry examples of Big Data
Here is the list of the top 10 industries using big data applications:

1. Banking and Securities

2. Communications, Media and Entertainment

3. Healthcare Providers

4. Education

5. Manufacturing and Natural Resources

6. Government

7. Insurance

8. Retail and Wholesale trade

9. Transportation

10. Energy and Utilities

1. Banking and Securities

Industry-specific Big Data Challenges

A study of 16 projects in 10 top investment and retail banks shows that the challenges in this
industry include: securities fraud early warning, tick analytics, card fraud detection, archival of
audit trails, enterprise credit risk reporting, trade visibility, customer data transformation, social
analytics for trading, IT operations analytics, and IT policy compliance analytics, among others.
Applications of Big Data in the Banking and Securities Industry

The Securities Exchange Commission (SEC) is using Big Data to monitor financial market
activity. They are currently using network analytics and natural language processors to catch
illegal trading activity in the financial markets.

Retail traders, Big banks, hedge funds, and other so-called ‘big boys’ in the financial markets use
Big Data for trade analytics used in high-frequency trading, pre-trade decision-support analytics,
sentiment measurement, Predictive Analytics, etc.

This industry also heavily relies on Big Data for risk analytics, including; anti-money laundering,
demand enterprise risk management, "Know Your Customer," and fraud mitigation.

Big Data providers are specific to this industry includes 1010data, Panopticon Software,
Streambase Systems, Nice Actimize, and Quartet FS.

2. Communications, Media and Entertainment

Industry-specific Big Data Challenges

Since consumers expect rich media on-demand in different formats and a variety of devices,
some Big Data challenges in the communications, media, and entertainment industry include:

 Collecting, analyzing, and utilizing consumer insights

 Leveraging mobile and social media content

 Understanding patterns of real-time, media content usage

Applications of Big Data in the Communications, Media and Entertainment Industry

Organizations in this industry simultaneously analyze customer data along with behavioral data
to create detailed customer profiles that can be used to:

 Create content for different target audiences

 Recommend content on demand


 Measure content performance

A case in point is the Wimbledon Championships (YouTube Video) that leverages Big Data to
deliver detailed sentiment analysis on the tennis matches to TV, mobile, and web users in real-
time.

Spotify, an on-demand music service, uses Hadoop Big Data analytics, to collect data from its
millions of users worldwide and then uses the analyzed data to give informed music
recommendations to individual users.

Amazon Prime, which is driven to provide a great customer experience by offering video, music,
and Kindle books in a one-stop-shop, also heavily utilizes Big Data.

Big Data Providers in this industry include Infochimps, Splunk, Pervasive Software, and Visible
Measures.

3. Healthcare Providers
Industry-specific Big Data Challenges
The healthcare sector has access to huge amounts of data but has been plagued by failures in
utilizing the data to curb the cost of rising healthcare and by inefficient systems that stifle faster
and better healthcare benefits across the board.
This is mainly because electronic data is unavailable, inadequate, or unusable. Additionally, the
healthcare databases that hold health-related information have made it difficult to link data that
can show patterns useful in the medical field.
Other challenges related to Big Data include the exclusion of patients from the decision-making
process and the use of data from different readily available sensors.

Applications of Big Data in the Healthcare Sector

Some hospitals, like Beth Israel, are using data collected from a cell phone app, from millions of
patients, to allow doctors to use evidence-based medicine as opposed to administering several
medical/lab tests to all patients who go to the hospital. A battery of tests can be efficient, but it
can also be expensive and usually ineffective.

Free public health data and Google Maps have been used by the University of Florida to create
visual data that allows for faster identification and efficient analysis of healthcare information,
used in tracking the spread of chronic disease. Obamacare has also utilized Big Data in a variety
of ways. Big Data Providers in this industry include Recombinant Data, Humedica, Explorys,
and Cerner

4. Education

Industry-specific Big Data Challenges

From a technical point of view, a significant challenge in the education industry is to incorporate
Big Data from different sources and vendors and to utilize it on platforms that were not designed
for the varying data.

From a practical point of view, staff and institutions have to learn new data management and
analysis tools.

On the technical side, there are challenges to integrating data from different sources on different
platforms and from different vendors that were not designed to work with one another.
Politically, issues of privacy and personal data protection associated with Big Data used for
educational purposes is a challenge.

Applications of Big Data in Education

Big data is used quite significantly in higher education. For example, The University of
Tasmania. An Australian university with over 26000 students has deployed a Learning and
Management System that tracks, among other things, when a student logs onto the system, how
much time is spent on different pages in the system, as well as the overall progress of a student
over time.
In a different use case of the use of Big Data in education, it is also used to measure teacher’s
effectiveness to ensure a pleasant experience for both students and teachers. Teacher’s
performance can be fine-tuned and measured against student numbers, subject matter, student
demographics, student aspirations, behavioral classification, and several other variables.

On a governmental level, the Office of Educational Technology in the U. S. Department of


Education is using Big Data to develop analytics to help correct course students who are going
astray while using online Big Data certification courses. Click patterns are also being used to
detect boredom.

Big Data Providers in this industry include Knewton and Carnegie Learning and
MyFit/Naviance.

5. Manufacturing and Natural Resources

Industry-specific Big Data Challenges

Increasing demand for natural resources, including oil, agricultural products, minerals, gas,
metals, and so on, has led to an increase in the volume, complexity, and velocity of data that is a
challenge to handle.

Similarly, large volumes of data from the manufacturing industry are untapped. The
underutilization of this information prevents the improved quality of products, energy efficiency,
reliability, and better profit margins.

Applications of Big Data in Manufacturing and Natural Resources

In the natural resources industry, Big Data allows for predictive modeling to support decision
making that has been utilized for ingesting and integrating large amounts of data from geospatial
data, graphical data, text, and temporal data. Areas of interest where this has been used include;
seismic interpretation and reservoir characterization.

Big data has also been used in solving today’s manufacturing challenges and to gain a
competitive advantage, among other benefits.

In the graphic below, a study by Deloitte shows the use of supply chain capabilities from Big
Data currently in use and their expected use in the future.
Source: Supply Chain Talent of the Future

Big Data Providers in this industry include CSC, Aspen Technology, Invensys, and Pentaho.

6. Government

Industry-specific Big Data Challenges

In governments, the most significant challenges are the integration and interoperability of Big
Data across different government departments and affiliated organizations.

Applications of Big Data in Government

In public services, Big Data has an extensive range of applications, including energy exploration,
financial market analysis, fraud detection, health-related research, and environmental protection.

Some more specific examples are as follows:

Big data is being used in the analysis of large amounts of social disability claims made to the
Social Security Administration (SSA) that arrive in the form of unstructured data. The analytics
are used to process medical information rapidly and efficiently for faster decision making and to
detect suspicious or fraudulent claims.

The Food and Drug Administration (FDA) is using Big Data to detect and study patterns of food-
related illnesses and diseases. This allows for a faster response, which has led to more rapid
treatment and less death.

The Department of Homeland Security uses Big Data for several different use cases. Big data is
analyzed from various government agencies and is used to protect the country.
Big Data Providers in this industry include Digital Reasoning, Socrata, and HP.

7. Insurance

Industry-specific Big Data Challenges

Lack of personalized services, lack of personalized pricing, and the lack of targeted services to
new segments and specific market segments are some of the main challenges.

In a survey conducted by Marketforce challenges identified by professionals in the insurance


industry include underutilization of data gathered by loss adjusters and a hunger for better
insight.

Applications of Big Data in the Insurance Industry

Big data has been used in the industry to provide customer insights for transparent and simpler
products, by analyzing and predicting customer behavior through data derived from social media,
GPS-enabled devices, and CCTV footage. The Big Data also allows for better customer retention
from insurance companies.

When it comes to claims management, predictive analytics from Big Data has been used to offer
faster service since massive amounts of data can be analyzed mainly in the underwriting stage.
Fraud detection has also been enhanced.

Through massive data from digital channels and social media, real-time monitoring of claims
throughout the claims cycle has been used to provide insights.

Big Data Providers in this industry include Sprint, Qualcomm, Octo Telematics, The Climate
Corp.

8. Retail and Wholesale trade

Industry-specific Big Data Challenges

From traditional brick and mortar retailers and wholesalers to current day e-commerce traders,
the industry has gathered a lot of data over time. This data, derived from customer loyalty cards,
POS scanners, RFID, etc. are not being used enough to improve customer experiences on the
whole. Any changes and improvements made have been quite slow.

Applications of Big Data in the Retail and Wholesale Industry

Big data from customer loyalty data, POS, store inventory, local demographics data continues to
be gathered by retail and wholesale stores.

In New York’s Big Show retail trade conference in 2014, companies like Microsoft, Cisco,
and IBM pitched the need for the retail industry to utilize Big Data for analytics and other uses,
including:

 Optimized staffing through data from shopping patterns, local events, and so on

 Reduced fraud

 Timely analysis of inventory

Social media use also has a lot of potential use and continues to be slowly but surely adopted,
especially by brick and mortar stores. Social media is used for customer prospecting, customer
retention, promotion of products, and more.

Big Data Providers in this industry include First Retail, First Insight, Fujitsu, Infor, Epicor, and
Vistex.
8. Transportation
Industry-specific Big Data Challenges
In recent times, huge amounts of data from location-based social networks and high-speed data
from telecoms have affected travel behavior. Regrettably, research to understand travel behavior
has not progressed as quickly.In most places, transport demand models are still based on poorly
understood new social media structures.

Applications of Big Data in the Transportation Industry

Some applications of Big Data by governments, private organizations, and individuals include:

 Governments use of Big Data: traffic control, route planning, intelligent transport systems,
congestion management (by predicting traffic conditions)

 Private-sector use of Big Data in transport: revenue management, technological


enhancements, logistics and for competitive advantage (by consolidating shipments and
optimizing freight movement)

 Individual use of Big Data includes route planning to save on fuel and time, for travel
arrangements in tourism, etc.
Source: Using Big Data in the Transport Sector

Big Data Providers in this industry include Qualcomm and Manhattan Associates.

10. Energy and Utilities

Industry-specific Big Data Challenges


The image below shows some of the main challenges in the energy and utility
industry.Applications of Big Data in the Energy and Utility Industry
Smart meter readers allow data to be collected almost every 15 minutes as opposed to once a day
with the old meter readers. This granular data is being used to analyze the consumption of
utilities better, which allows for improved customer feedback and better control of utilities use.
In utility companies, the use of Big Data also allows for better asset and workforce management,
which is useful for recognizing errors and correcting them as soon as possible before complete
failure is experienced.

Big Data Providers in this industry include Alstom Siemens ABB and Cloudera.

Conclusion
Having gone through 10 industry verticals including how Big Data plays a role in these
industries, here are a few key takeaways:

1. There is substantial real spending on Big Data.

2. To capitalize on Big Data opportunities, you need to:

 Familiarize yourself with and understand industry-specific challenges.

 Understand or know the data characteristics of each industry.

 Understand where spending is occurring.

 Match market needs with your own capabilities and solutions.

3. Vertical industry expertise is key to utilizing Big Data effectively and efficiently.

You might also like