Bigdata Mod-1
Bigdata Mod-1
MODULE-I [8 Hrs]
Introduction: Big Data Overview, BI Versus Data Science, Current Analytical Architecture,
Drivers of Big Data. Data Analytics Lifecycle - Overview, Phases - Discovery, Data Preparation
and Model planning, Model building, Communicate Results and Operationalize.
Industry examples of Big Data
Introduction:
90% of the world’s data was generated in the last few years.”
Due to the advent of new technologies, devices, and communication means like social
networking sites, the amount of data produced by mankind is growing rapidly every year. The
amount of data produced by us from the beginning of time till 2003 was 5 billion gigabytes. If
you pile up the data in the form of disks it may fill an entire football field. The same amount
was created in every two days in 2011, and in every ten minutes in 2013. This rate is still
growing enormously. Though all this information produced is meaningful and can be useful
when processed, it is being neglected.
Big data is a collection of large datasets that cannot be processed using traditional computing
techniques. It is not a single technique or a tool, rather it has become a complete subject, which
involves various tools, technqiues and frameworks.
Big data involves the data produced by different devices and applications. Given below are
some of the fields that come under the umbrella of Big Data.
Black Box Data − It is a component of helicopter, airplanes, and jets, etc. It captures
voices of the flight crew, recordings of microphones and earphones, and the
performance information of the aircraft.
Social Media Data − Social media such as Facebook and Twitter hold information and
the views posted by millions of people across the globe.
Stock Exchange Data − The stock exchange data holds information about the ‘buy’ and
‘sell’ decisions made on a share of different companies made by the customers.
Power Grid Data − The power grid data holds information consumed by a particular
node with respect to a base station.
Transport Data − Transport data includes model, capacity, distance and availability of a
vehicle.
Search Engine Data − Search engines retrieve lots of data from different databases.
Thus Big Data includes huge volume, high velocity, and extensible variety of data. The data in
it will be of three types.
Structured data − Relational data.
Semi Structured data − XML data.
Unstructured data − Word, PDF, Text, Media Logs.
Using the information kept in the social network like Facebook, the marketing agencies
are learning about the response for their campaigns, promotions, and other advertising
mediums.
Using the information in the social media like preferences and product perception of
their consumers, product companies and retail organizations are planning their
production.
Using the data regarding the previous medical history of patients, hospitals are providing
better and quick service.
Big data technologies are important in providing more accurate analysis, which may lead to
more concrete decision-making resulting in greater operational efficiencies, cost reductions, and
reduced risks for the business.
To harness the power of big data, you would require an infrastructure that can manage and
process huge volumes of structured and unstructured data in realtime and can protect data
privacy and security.
There are various technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that handle big data, we
examine the following two classes of technology −
This include systems like MongoDB that provide operational capabilities for real-time,
interactive workloads where data is primarily captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud computing architectures
that have emerged over the past decade to allow massive computations to be run inexpensively
and efficiently. This makes operational big data workloads much easier to manage, cheaper, and
faster to implement.
Some NoSQL systems can provide insights into patterns and trends based on real-time data with
minimal coding and without the need for data scientists and additional infrastructure.
These includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex analysis that may
touch most or all of the data.
MapReduce provides a new method of analyzing data that is complementary to the capabilities
provided by SQL, and a system based on MapReduce that can be scaled up from single servers
to thousands of high and low end machines.
These two classes of technology are complementary and frequently deployed together.
Operational Analytical
Big data analytics is the often complex process of examining big data to uncover information --
such as hidden patterns, correlations, market trends and customer preferences -- that can help
organizations make informed business decisions.
On a broad scale, data analytics technologies and techniques give organizations a way to analyze
data sets and gather new information. Business intelligence (BI) queries answer basic questions
about business operations and performance.
Big data analytics is a form of advanced analytics, which involve complex applications with
elements such as predictive models, statistical algorithms and what-if analysis powered by
analytics systems.
Organizations can use big data analytics systems and software to make data-driven decisions that
can improve business-related outcomes. The benefits may include more effective marketing, new
revenue opportunities, customer personalization and improved operational efficiency. With an
effective strategy, these benefits can provide competitive advantages over rivals.
Capturing data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
To fulfill the above challenges, organizations normally take the help of enterprise servers.
BI Versus Data Science
Data Science:
Data science is basically a field in which information and knowledge are extracted from the data
by using various scientific methods, algorithms, and processes. It can thus be defined as a
combination of various mathematical tools, algorithms, statistics, and machine learning
techniques which are thus used to find the hidden patterns and insights from the data which helps
in the decision making process. Data science deals with both structured as well as unstructured
data. It is related to both data mining and big data. Data science involves studying the historic
trends and thus using its conclusions to redefine present trends and also predict future trends.
Business Intelligence:
Business intelligence(BI) is basically a set of technologies, applications, and processes that are
used by enterprises for business data analysis. It is basically used for the conversion of raw data
into meaningful information which is thus used for business decision making and profitable
actions. It deals with the analysis of structured and sometimes unstructured data which paves the
way for new and profitable business opportunities. It supports decision making based on facts
rather than assumption-based decision making. Thus it has a direct impact on the business
decisions of an enterprise. Business intelligence tools enhance the chances of an enterprise to
enter a new market as well as help in studying the impact of marketing efforts.
comparison to business
y intelligence. data science.
It deals with the questions what It deals with the question what
Questions will happen and what if. happened.
Big Data’s challenges keep changing depending on the context or the ecosystem and include
capturing, analysis, storage technology, visualization, sharing, transfer, and privacy concerns.
Multiple technologies partake in the Big Data model to handle the V’s mentioned earlier as
efficiently as possible. For structured data, RDBMS or distributed RDBMS is used. For
unstructured data, NoSQL databases like MongoDB is used. For data ingestion Apache Flume
and other specialized tools are used. One of the prominent and well known, Big Data
frameworks is Hadoop. There are other custom implementations of the Hadoop ecosystem from
the market’s major players like Amazon, Microsoft, Oracle and IBM.
2. APPLICATIONS
We shall quickly run through a few of the top applications of Big Data in the real world before
we dive into Big Data Architecture.
Retail Trade
The biggest application of big data is probably in the retail and e-commerce industry, with data
analysis from various streams, including social media, for producing targeted adverts. Sentiment
analysis also plays a major role in gauging customer feedback on all products, thus quickly
fixing any issues before it turns out to be a major loss.
There are several Big Data products on the market, but you still have to design the system to suit
your business’s particular needs. You will need a Big Data Architect to design your Big Data
solution catering to your unique business ecosystem. Big Data has a generic architecture that
applies to most businesses at a high level, and it is not necessary that you need all of the
components used for successful implementation.
To start with, Big Data is known to have at least 6 layers to its architecture. They are
Here is a representation of Big Data Architecture with just the Big Data components shown.
4. PATTERNS
There are several design patterns that are like templates that you can select for your business.
These design patterns are based on the layer in context. We shall mention a few patterns in the
ingestion, data storage layers.
Multisource Extractor
Multi-destination
Protocol Converter
Just-n-time Transformation
Real-time streaming pattern
Data Storage Layer
With ACID (atomicity, consistency, isolation, and durability), BASE (basically available, soft
state, eventually consistent) and CAP (consistency, availability, and partition tolerance)
paradigms, several design patterns have been built for the storage layer, namely,
Façade pattern
NoSQL pattern
Polyglot pattern
There are a few more design patterns available that you may attempt to explore.
5. EXAMPLES
Here we list out a few well-known examples of Big Data Architecture shaped according to the
business ecosystem they were developed for and evolved into their current state.
Streaming
This kind of Big Data architecture allows for real-time data ingestion of critical business data
that needs to be taken care of or responded to in real-time. The velocity and variety of data here
is the key pivot around which this architecture has evolved.
General-purpose
This kind of Big Data Architecture provides generic storage and processing capabilities that are
applicable in most businesses.
NoSQL engines
Big Data architectures of this kind stress dealing with data coming in at high velocity, is high in
volume, and is coming in a variety of types (structured, unstructured or semi-structured).
Enterprise Datawarehouse
A Big Data Architecture inspired by an enterprise Datawarehouse that stores a separate database
of historical data for years, using it for analytical purposes.
In-place Analytics
An architecture that allows data to be left “in place” in a low-cost storage engine where ad-hoc
queries can be run without the need for separate and expensive clusters.
CONCLUSION
Without any doubt, it can be said that Big Data will be the technology that businesses run on, be
it on-premises or on the cloud. There is no denying Big Data is the technology that facilitates
Machine Learning and Artificial intelligence and is a backend skill that will be sought after in
every industry and business in the near future.
Or
Big data architecture refers to the logical and physical structure that dictates how high volumes
of data are ingested, processed, stored, managed, and accessed.
Big Data Sources Layer: a big data environment can manage both batch processing and
real-time processing of big data sources, such as data warehouses, relational database
management systems, SaaS applications, and IoT devices.
Management & Storage Layer: receives data from the source, converts the data into a
format comprehensible for the data analytics tool, and stores the data according to its
format.
Analysis Layer: analytics tools extract business intelligence from the big data storage
layer.
Consumption Layer: receives results from the big data analysis layer and presents them to
the pertinent output layer - also known as the business intelligence layer.
Big Data Architecture Processes
In order to benefit from the potential of big data, it is crucial to invest in a big data infrastructure
that is capable of handling huge quantities of data. These benefits include: improving
understanding and analysis of big data, making better decisions faster, reducing costs, predicting
future needs and trends, encouraging common standards and providing a common language, and
providing consistent methods for implementing technology that solves comparable problems.
Big data infrastructure challenges include the management of data quality, which requires
extensive analysis; scaling, which can be costly and affect performance if not sufficient; and
security, which increases in complexity with big data sets.
Establishing big data architecture components before embarking upon a big data project is a
crucial step in understanding how the data will be used and how it will bring value to the
business. Implementing the following big data architecture principles for your big data
architecture strategy will help in developing a service-oriented approach that ensures the data
addresses a variety of business needs.
Preliminary Step: A big data project should be in line with the business vision and have a
good understanding of the organizational context, the key drivers of the organization,
data architecture work requirements, architecture principles and framework to be used,
and the maturity of the enterprise architecture. It is also important to have a thorough
understanding of the elements of the current business technology landscape, such as
business strategies and organizational models, business principles and goals, current
frameworks in use, governance and legal frameworks, IT strategy, and any pre-existing
architecture frameworks and repositories.
Data Sources: Before any big data solution architecture is coded, data sources should be
identified and categorized so that big data architects can effectively normalize the data to
a common format. Data sources can be categorized as either structured data, which is
typically formatted using predefined database techniques, or unstructured data, which
does not follow a consistent format, such as emails, images, and Internet data.
Big Data ETL: Data should be consolidated into a single Master Data Management
system for querying on demand, either via batch processing or stream processing. For
processing, Hadoop has been a popular batch processing framework. For querying, the
Master Data Management system can be stored in a data repository such as NoSQL-
based or relational DBMS
Data Services API: When choosing a database solution, consider whether or not there is a
standard query language, how to connect to the database, the ability of the database to
scale as data grows, and which security mechanisms are in place.
User Interface Service: a big data application architecture should have an intuitive design
that is customizable, available through current dashboards in use, and accessible in the
cloud. Standards like Web Services for Remote Portlets (WSRP) facilitate the serving of
User Interfaces through Web Service calls.
Designing a big data reference architecture, while complex, follows the same general procedure:
Analyze the Problem: First determine if the business does in fact have a big data problem,
taking into consideration criteria such as data variety, velocity, and challenges with the
current system. Common use cases include data archival, process offload, data lake
implementation, unstructured data processing, and data warehouse modernization.
Select a Vendor: Hadoop is one of the most widely recognized big data architecture tools
for managing big data end to end architecture. Popular vendors for Hadoop distribution
include Amazon Web Services, BigInsights, Cloudera, Hortonworks, Mapr, and
Microsoft.
Deployment Strategy: Deployment can be either on-premises, which tends to be more
secure; cloud-based, which is cost effective and provides flexibility regarding scalability;
or a mix deployment strategy.
Capacity Planning: When planning hardware and infrastructure sizing, consider daily
data ingestion volume, data volume for one-time historical load, the data retention period,
multi-data center deployment, and the time period for which the cluster is sized
Infrastructure Sizing: This is based on capacity planning and determines the number of
clusters/environment required and the type of hardware required. Consider the type of
disk and number of disks per machine, the types of processing memory and memory size,
number of CPUs and cores, and the data retained and stored in each environment.
Plan a Disaster Recovery: In developing a backup and disaster recovery plan, consider
the criticality of data stored, the Recovery Point Objective and Recovery Time Objective
requirements, backup interval, multi datacenter deployment, and whether Active-Active
or Active-Passive disaster recovery is most appropriate.
Semi-structured Data – This type of data is a mix of the above two types of big data
i.e., structured and unstructured data. It is also known as hybrid big data
1. Volume
2. Videos
3. Velocity
4. Variability
5. Veracity
6. Visualization
7. Value
1. Volume
This is the main characteristic of big data. The term volume here defines big data as “BIG”.
With a massive amount of data generating daily, we know gigabytes is not enough to store such
huge amount of data.
Because of this, now the data is stored in terms of Zettabytes, Exabytes, and Yottabytes. For
instance, almost 50 hours of videos are uploaded on YouTube every single minute.
Now imagine how much data is being generated on YouTube itself.
2. Variety
Here variety means types of data sources. As discussed above, big data can be of various types –
structured, semi-structured, and unstructured.
In today’s world, the data which is generated in large quantities is unstructured data only like
audio files, video, images, text files, etc.
which are difficult to map due to their nature i.e., they don’t have any set of rules which makes
them difficult to sort from essential data.
3. Velocity
Velocity here refers to how fast the data can be processed and accessed. For example, social
media posts, YouTube videos, audio files, images that are uploaded in thousands every second
should be accessible as early as possible.
4. Variability
Variability is different from the variety. Variability refers to the data which keeps on changing
constantly.
Variability mainly focuses on understanding and interpreting the correct meanings of raw data.
For example – A soda shop may offer 6 different blends of soda, but if you get the same blend of
soda every day and it tastes different every day, that is variability.
The same is in the case of data, and if it is continuously changing, then it can have an impact on
the quality of your data.
5. Veracity
If your data is not accurate, it is of no use, and here comes the concept of Veracity. It is all about
making sure the data gathered by you is accurate and also keeping the bad data away from your
systems.
It is also the trustworthiness or quality of data which a company received and processes to derive
useful insights.
6. Visualization
Visualization here refers to how you can present your data to the management for decision-
making purposes.
We all know that data cane be presented in many ways such as Excel files,word docs ,graphical
Charts etc.irrespective of the format,The data should be easily readable ,understandable,and
accessible and that’s why data visualization is important .
7 Value
Value is known as the end game in Bigdata.Every user needs to understand that the organization
needs some value after efforts are made and resources are spent on the above mentioned V’s.
Bigdata can help a user provide value if it is done and processed correctly.
Data Analytics Lifecycle – Overview
The Data Analytics Lifecycle is a cyclic process which explains, in six stages, how information in
made, collected, processed, implemented, and analyzed for different objectives
Data Analytics Lifecycle :
The Data analytic lifecycle is designed for Big Data problems and data science projects. The cycle is
iterative to represent real project. To address the distinct requirements for performing analysis on Big
Data, step – by – step methodology is needed to organize the activities and tasks involved with
acquiring, processing, analyzing, and repurposing data.
Phase 1: Discovery –
The data science team learn and investigate the problem.
Develop context and understanding.
Come to know about data sources needed and available for the project.
The team formulates initial hypothesis that can be later tested with data.
Phase 2: Data Preparation –
Steps to explore, preprocess, and condition data prior to modeling and analysis.
It requires the presence of an analytic sandbox, the team execute, load, and transform, to get data
into the sandbox.
Data preparation tasks are likely to be performed multiple times and not in predefined order.
Several tools commonly used for this phase are – Hadoop, Alpine Miner, Open Refine, etc.
Phase 3: Model Planning –
Team explores data to learn about relationships between variables and subsequently, selects key
variables and the most suitable models.
In this phase, data science team develop data sets for training, testing, and production purposes.
Team builds and executes models based on the work done in the model planning phase.
Several tools commonly used for this phase are – Matlab, STASTICA.
Phase 4: Model Building –
Team develops datasets for testing, training, and production purposes.
Team also considers whether its existing tools will suffice for running the models or if they need
more robust environment for executing models.
Free or open-source tools – Rand PL/R, Octave, WEKA.
Commercial tools – Matlab , STASTICA.
Phase 5: Communication Results –
After executing model team need to compare outcomes of modeling to criteria established for
success and failure.
Team considers how best to articulate findings and outcomes to various team members and
stakeholders, taking into account warning, assumptions.
Team should identify key findings, quantify business value, and develop narrative to summarize
and convey findings to stakeholders.
Phase 6: Operationalize –
The team communicates benefits of project more broadly and sets up pilot project to deploy work
in controlled way before broadening the work to full enterprise of users.
This approach enables team to learn about performance and related constraints of the model in
production environment on small scale , and make adjustments before full deployment.
The team delivers final reports, briefings, codes.
Free or open source tools – Octave, WEKA, SQL, MADlib.
Or
Life Cycle Phases of Data Analytics
The Data Analytics Lifecycle is a cyclic process which explains, in six stages, how information
Data identification
Univariate Analysis
Multivariate Analysis
Filling Null values
Feature engineering
For model planning, data analysts often use regression techniques, decision trees, neural
networks, etc. Tools mostly used for model planning and execution include Rand PL/R, WEKA,
Octave, Statista, and MATLAB.
Model Building
Model building is the process where you have to deploy the planned model in a real-time
environment. It allows analysts to solidify their decision-making process by gain in-depth
analytical information. This is a repetitive process, as you have to add new features as required
by your customers constantly.
Your aim here is to forecast business decisions and customize market strategies and develop
tailor-made customer interests. This can be done by integrating the model into your existing
production domain.
In some cases, a specific model perfectly aligns with the business objectives/ data, and
sometimes it requires more than one try. As you start exploring the data, you need to run
particular algorithms and compare the outputs with your objectives. In some cases, you may even
have to run different variances of models simultaneously until you receive the desired results.
Result Communication and Publication
This is the phase where you have to communicate the data analysis with your clients. It requires
several intricate processes where you how to present information to clients in a lucid manner.
Your clients don't have enough time to determine which data is essential. Therefore, you must do
an impeccable job to grab the attention of your clients.
Check the data accuracy
Is the data provide information as expected? If not, then you have to run some other processes to
resolve this issue. You need to ensure the data you process provides consistent information. This
will help you build a convincing argument while summarizing your findings.
Highlight important findings
Well, each data holds a significant role in building an efficient project. However, some data
inherits more potent information that can truly serve your audience's benefits. While
summarizing your findings, try to categorize data into different key points.
Determine the most appropriate communication format
How you communicate your findings tells a lot about you as a professional. We recommend you
to go for visuals presentation and animations as it helps you to convey information much faster.
However, sometimes you also need to go old-school as well. For instance, your clients may have
to carry the findings in physical format. They may also have to pick up certain information and
share them with others.
Operationalize
As soon you prepare a detailed report including your key findings, documents, and briefings,
your data analytics life cycle almost comes close to the end. The next step remains the measure
the effectiveness of your analysis before submitting the final reports to your stakeholders.
In this process, you have to move the sandbox data and run it in a live environment. Then you
have to closely monitor the results, ensuring they match with your expected goals. If the findings
fit perfectly with your objective, then you can finalize the report. Otherwise, you have to take a
step back in your data analytics lifecycle and make some changes.
Industry examples of Big Data
Here is the list of the top 10 industries using big data applications:
3. Healthcare Providers
4. Education
6. Government
7. Insurance
9. Transportation
A study of 16 projects in 10 top investment and retail banks shows that the challenges in this
industry include: securities fraud early warning, tick analytics, card fraud detection, archival of
audit trails, enterprise credit risk reporting, trade visibility, customer data transformation, social
analytics for trading, IT operations analytics, and IT policy compliance analytics, among others.
Applications of Big Data in the Banking and Securities Industry
The Securities Exchange Commission (SEC) is using Big Data to monitor financial market
activity. They are currently using network analytics and natural language processors to catch
illegal trading activity in the financial markets.
Retail traders, Big banks, hedge funds, and other so-called ‘big boys’ in the financial markets use
Big Data for trade analytics used in high-frequency trading, pre-trade decision-support analytics,
sentiment measurement, Predictive Analytics, etc.
This industry also heavily relies on Big Data for risk analytics, including; anti-money laundering,
demand enterprise risk management, "Know Your Customer," and fraud mitigation.
Big Data providers are specific to this industry includes 1010data, Panopticon Software,
Streambase Systems, Nice Actimize, and Quartet FS.
Since consumers expect rich media on-demand in different formats and a variety of devices,
some Big Data challenges in the communications, media, and entertainment industry include:
Organizations in this industry simultaneously analyze customer data along with behavioral data
to create detailed customer profiles that can be used to:
A case in point is the Wimbledon Championships (YouTube Video) that leverages Big Data to
deliver detailed sentiment analysis on the tennis matches to TV, mobile, and web users in real-
time.
Spotify, an on-demand music service, uses Hadoop Big Data analytics, to collect data from its
millions of users worldwide and then uses the analyzed data to give informed music
recommendations to individual users.
Amazon Prime, which is driven to provide a great customer experience by offering video, music,
and Kindle books in a one-stop-shop, also heavily utilizes Big Data.
Big Data Providers in this industry include Infochimps, Splunk, Pervasive Software, and Visible
Measures.
3. Healthcare Providers
Industry-specific Big Data Challenges
The healthcare sector has access to huge amounts of data but has been plagued by failures in
utilizing the data to curb the cost of rising healthcare and by inefficient systems that stifle faster
and better healthcare benefits across the board.
This is mainly because electronic data is unavailable, inadequate, or unusable. Additionally, the
healthcare databases that hold health-related information have made it difficult to link data that
can show patterns useful in the medical field.
Other challenges related to Big Data include the exclusion of patients from the decision-making
process and the use of data from different readily available sensors.
Some hospitals, like Beth Israel, are using data collected from a cell phone app, from millions of
patients, to allow doctors to use evidence-based medicine as opposed to administering several
medical/lab tests to all patients who go to the hospital. A battery of tests can be efficient, but it
can also be expensive and usually ineffective.
Free public health data and Google Maps have been used by the University of Florida to create
visual data that allows for faster identification and efficient analysis of healthcare information,
used in tracking the spread of chronic disease. Obamacare has also utilized Big Data in a variety
of ways. Big Data Providers in this industry include Recombinant Data, Humedica, Explorys,
and Cerner
4. Education
From a technical point of view, a significant challenge in the education industry is to incorporate
Big Data from different sources and vendors and to utilize it on platforms that were not designed
for the varying data.
From a practical point of view, staff and institutions have to learn new data management and
analysis tools.
On the technical side, there are challenges to integrating data from different sources on different
platforms and from different vendors that were not designed to work with one another.
Politically, issues of privacy and personal data protection associated with Big Data used for
educational purposes is a challenge.
Big data is used quite significantly in higher education. For example, The University of
Tasmania. An Australian university with over 26000 students has deployed a Learning and
Management System that tracks, among other things, when a student logs onto the system, how
much time is spent on different pages in the system, as well as the overall progress of a student
over time.
In a different use case of the use of Big Data in education, it is also used to measure teacher’s
effectiveness to ensure a pleasant experience for both students and teachers. Teacher’s
performance can be fine-tuned and measured against student numbers, subject matter, student
demographics, student aspirations, behavioral classification, and several other variables.
Big Data Providers in this industry include Knewton and Carnegie Learning and
MyFit/Naviance.
Increasing demand for natural resources, including oil, agricultural products, minerals, gas,
metals, and so on, has led to an increase in the volume, complexity, and velocity of data that is a
challenge to handle.
Similarly, large volumes of data from the manufacturing industry are untapped. The
underutilization of this information prevents the improved quality of products, energy efficiency,
reliability, and better profit margins.
In the natural resources industry, Big Data allows for predictive modeling to support decision
making that has been utilized for ingesting and integrating large amounts of data from geospatial
data, graphical data, text, and temporal data. Areas of interest where this has been used include;
seismic interpretation and reservoir characterization.
Big data has also been used in solving today’s manufacturing challenges and to gain a
competitive advantage, among other benefits.
In the graphic below, a study by Deloitte shows the use of supply chain capabilities from Big
Data currently in use and their expected use in the future.
Source: Supply Chain Talent of the Future
Big Data Providers in this industry include CSC, Aspen Technology, Invensys, and Pentaho.
6. Government
In governments, the most significant challenges are the integration and interoperability of Big
Data across different government departments and affiliated organizations.
In public services, Big Data has an extensive range of applications, including energy exploration,
financial market analysis, fraud detection, health-related research, and environmental protection.
Big data is being used in the analysis of large amounts of social disability claims made to the
Social Security Administration (SSA) that arrive in the form of unstructured data. The analytics
are used to process medical information rapidly and efficiently for faster decision making and to
detect suspicious or fraudulent claims.
The Food and Drug Administration (FDA) is using Big Data to detect and study patterns of food-
related illnesses and diseases. This allows for a faster response, which has led to more rapid
treatment and less death.
The Department of Homeland Security uses Big Data for several different use cases. Big data is
analyzed from various government agencies and is used to protect the country.
Big Data Providers in this industry include Digital Reasoning, Socrata, and HP.
7. Insurance
Lack of personalized services, lack of personalized pricing, and the lack of targeted services to
new segments and specific market segments are some of the main challenges.
Big data has been used in the industry to provide customer insights for transparent and simpler
products, by analyzing and predicting customer behavior through data derived from social media,
GPS-enabled devices, and CCTV footage. The Big Data also allows for better customer retention
from insurance companies.
When it comes to claims management, predictive analytics from Big Data has been used to offer
faster service since massive amounts of data can be analyzed mainly in the underwriting stage.
Fraud detection has also been enhanced.
Through massive data from digital channels and social media, real-time monitoring of claims
throughout the claims cycle has been used to provide insights.
Big Data Providers in this industry include Sprint, Qualcomm, Octo Telematics, The Climate
Corp.
From traditional brick and mortar retailers and wholesalers to current day e-commerce traders,
the industry has gathered a lot of data over time. This data, derived from customer loyalty cards,
POS scanners, RFID, etc. are not being used enough to improve customer experiences on the
whole. Any changes and improvements made have been quite slow.
Big data from customer loyalty data, POS, store inventory, local demographics data continues to
be gathered by retail and wholesale stores.
In New York’s Big Show retail trade conference in 2014, companies like Microsoft, Cisco,
and IBM pitched the need for the retail industry to utilize Big Data for analytics and other uses,
including:
Optimized staffing through data from shopping patterns, local events, and so on
Reduced fraud
Social media use also has a lot of potential use and continues to be slowly but surely adopted,
especially by brick and mortar stores. Social media is used for customer prospecting, customer
retention, promotion of products, and more.
Big Data Providers in this industry include First Retail, First Insight, Fujitsu, Infor, Epicor, and
Vistex.
8. Transportation
Industry-specific Big Data Challenges
In recent times, huge amounts of data from location-based social networks and high-speed data
from telecoms have affected travel behavior. Regrettably, research to understand travel behavior
has not progressed as quickly.In most places, transport demand models are still based on poorly
understood new social media structures.
Some applications of Big Data by governments, private organizations, and individuals include:
Governments use of Big Data: traffic control, route planning, intelligent transport systems,
congestion management (by predicting traffic conditions)
Individual use of Big Data includes route planning to save on fuel and time, for travel
arrangements in tourism, etc.
Source: Using Big Data in the Transport Sector
Big Data Providers in this industry include Qualcomm and Manhattan Associates.
Big Data Providers in this industry include Alstom Siemens ABB and Cloudera.
Conclusion
Having gone through 10 industry verticals including how Big Data plays a role in these
industries, here are a few key takeaways:
3. Vertical industry expertise is key to utilizing Big Data effectively and efficiently.