0% found this document useful (0 votes)
15 views

Big Data - Module 1

Uploaded by

Jagadeesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Big Data - Module 1

Uploaded by

Jagadeesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Big Data Analytics

Module -1
What is big data
• Big data is data that exceeds the processing
capacity of conventional database systems.
• This data comes from everywhere: sensors used to
gather climate information, posts to social media
sites, digital pictures and videos, purchase
transaction records, and cell phone GPS signals to
name a few. This data is big data.
• Big data usually includes data sets with sizes
beyond the ability of commonly used software
tools to capture, create, manage, and process the
data within a tolerable elapsed time
Categories of BIG Data
• Structured
• Written in a format that’s easy for machines to
understand.
• Structured data is easily searchable by basic algorithms.
• Examples : Fields/ Tables/ Columns/
RDBMS/Spreadsheet

• Semi-structured
• Markers/Tags to separate elements
• XML/HTML
• Unstructured
• No fields/attributes
• More like Human Language
• Free form text (E-mail body, notes, articles,…)
• Audio, video, and image
Big Data Analytics

• Big (and small) Data analytics is the process of


examining data—typically of a variety of sources,
types, volumes and / or complexities—to uncover
hidden patterns, unknown correlations, and other
useful information.
Characteristics of Big Data
Why Big data?
• Understanding and Targeting Customers
• Understanding and Optimizing Business Processes
• Personal Quantification and Performance
Optimization
• Improving Healthcare and Public Health
• Improving Sports Performance
• Improving Science and Research
• Optimizing Machine and Device Performance
• Improving Security and Law Enforcement.
• Improving and Optimizing Cities and Countries
• Financial Trading
Unstructured data

• Unstructured data is information that either


does not have a predefined data model and/or
does not fit well into a relational database.
• The term "big data" is closely associated with
unstructured data. Big data refers to extremely
large datasets that are difficult to analyze with
traditional tools.
Web Analytics

• Web analytics is the measurement, collection, analysis and


reporting of web data for purposes of understanding and
optimizing web usage.
• Web event data is incredibly valuable
• Web analytics tools are good at delivering the standard reports
that are common across different business types
Big Data and Marketing

• Today’s consumers have changed. They’ve put


down the newspaper, they fast forward through
TV commercials, and they junk unsolicited
email.
• Today’s cross-channel consumer is more
dynamic, informed, and unpredictable than
ever.
• The Right Approach: Cross-Channel Lifecycle
Marketing
Empowering Marketing with Social
Intelligence

• Very intelligent software is required to parse


all that social data to define things like the
sentiment of a post.
Fraud and Big Data
• Fraud is intentional deception made for personal gain or to
damage another individual.
• In order to prevent the fraud, credit card transactions are
monitored and checked in real time.
• The Capgemini Financial Services team believes that due to
the nature of data streams and processing required, Big Data
technologies provide an optimal technology solution based on
the following three Vs:
• 1. High volume. Years of customer records and transactions
(150 billion+ records per year)
• 2. High velocity. Dynamic transactions and social media
information
• 3. High variety. Social media plus other unstructured data
such as customer emails, call center conversations, as well as
transactional structured data
Risk and Big Data

• Many of the world’s top analytics


professionals work in risk management.
• The two most common types of risk
management are credit risk management and
market risk management.
• A third type of risk, operational risk
management, isn’t as common as credit and
market risk.
Credit Risk Management
• Credit risk management is a critical function that
spans a diversity of businesses across a wide
range of industries.
• Whether you’re a small B2B regional plastics
manufacturer or a large global consumer financial
institution, the underlying credit risk principles
are essentially the same: driving the business
using the optimal balance of risk and reward.
• Credit risk professionals are stakeholders in key
decisions that address all aspects of a business,
from finding new and profitable customers to
maintaining and growing relationships with
existing customers.
Big Data and Algorithmic Trading

• Algorithmic trading relies on sophisticated


mathematics to determine buy and sell orders
for equities, commodities, interest rate and
foreign exchange rates, derivatives, and fixed
income instruments at blinding speed.
• A key component of algorithmic trading is
determining return and the risk of each
potential trade, and then making a decision to
buy or sell.
• Crunching Through Complex Interrelated Data
• Intraday Risk Analytics, a Constant Flow of
Big Data
• Calculating Risk in Marketing
• Other Industries Benefit from Financial
Services’ Risk Experience
Big Data and Advances in Health Care
• Big Data promises an enormous revolution in health
care, with important advancements in everything
from the management of chronic disease to the
delivery of personalized medicine.
• Health care challenges are forcing the pharmaceutical
business model to undergo rapid change.
Big Data Technology

• Technology is radically changing the way data is


produced, processed, analyzed, and consumed.
• The Elephant in the Room: Hadoop’s Parallel World
• There are many Big Data technologies that have been
making an impact on the new technology stacks for
handling Big Data, but Apache Hadoop is one
technology that has been the darling of Big Data talk.
• Hadoop is an open-source platform for storage and
processing of diverse data types that enables data-
driven enterprises to rapidly derive the complete value
from all their data.
• The original creators of Hadoop are Doug Cutting (used
to be at Yahoo! now at Cloudera) and Mike Cafarella
(now teaching at the University of Michigan in Ann
Arbor).
• Hadoop handles a variety of workloads, including
search, log processing, recommendation systems, data
warehousing, and video/image analysis.
• Apache Hadoop is an open-source project administered
by the Apache Software Foundation.
• Unlike traditional, structured platforms, Hadoop is able
to store any kind of data in its native format and to
perform a wide variety of analyses and transformations
on that data.
• Hadoop stores terabytes, and even petabytes, of data
inexpensively
The two critical components of Hadoop are:
• 1. The Hadoop Distributed File System (HDFS). HDFS is the
storage system for a Hadoop cluster. When data lands in the
cluster, HDFS breaks it into pieces and distributes those pieces
among the different servers participating in the cluster. Each
server stores just a small fragment of the complete data set, and
each piece of data is replicated on more than one server.
• 2. MapReduce. Because Hadoop stores the entire dataset in
small pieces across a collection of servers, analytical jobs can
be distributed, in parallel, to each of the servers storing part of
the data. Each server evaluates the question against its local
fragment simultaneously and reports its results back for
collation into a comprehensive answer. MapReduce is the agent
that distributes the work and collects the results.
• The new approach is based on two foundational
concepts.
⮚ Data needs to be stored in a system in which the hardware
is infinitely scalable. In other words, you cannot allow
hardware (storage and network) to become the bottleneck.
⮚ Data must be processed, and converted into usable
business intelligence where it sits. Put simply, you must
move the code to the data and not the other way around.
• Today we can run the algorithm, look at the
results, extract the results, and feed the business
process—automatically and at massive scale,
using all of the data available.
Data Discovery: Work the Way People’s
Minds Work
• There is a lot of buzz in the industry about data
discovery, the term used to describe the new wave of
business intelligence that enables users to explore
data, make discoveries, and uncover insights in a
dynamic and intuitive way versus predefined queries
and preconfigured drill-down dashboards.
• This approach has resonated with many business
users who are looking for the freedom and flexibility
to view Big Data.
• In fact, there are two software companies that stand
out in the crowd by growing their businesses at
unprecedented rates in this space: Tableau Software
and QlikTech International.
Open-Source Technology for Big
Data Analytics

• One of the great benefits of open source lies in


the flexibility of the adoption model: you
download and deploy it when you need it.
• An open-source stack is defined by its
community of users and contributors. No one
“controls” an open-source stack, and no one
can predict exactly how it will evolve.
The Cloud and Big Data

• It is important to remember that for all kinds of


reasons—technical, political, social, regulatory, and
cultural—cloud computing has not been a successful
business model that has been widely adopted for
enterprises to store their Big Data assets.
• With a cloud model, you pay on a subscription basis
with no upfront capital expense.
• The ability to build massively scalable platforms—
platforms where you have the option to keep adding
new products and services for zero additional cost—
is giving rise to business models that weren’t possible
before.
Software as a Service BI

• The software industry has seen some successful


companies excel in the game of software as a
service(SaaS) industry, such as salesforce.com.
• The basic principal is to make it easy for companies
to gain access to solutions without the headache of
building and maintaining their own onsite
implementation.
• Another common buying factor for SaaS is the
immediate access to talent, especially in the world of
information management, business intelligence (BI),
and predictive analytics.
Three elements that have impacted the viability of mobile BI:
1. Location—the GPS component and location to know where
you are in time as well as the movement.
2. It’s not just about pushing data; you can transact with your
smart phone based on information you get.
3. Multimedia functionality allows the visualization pieces to
really come into play.
Three challenges with mobile BI include:
1. Managing standards for rolling out these devices.
2. Managing security (always a big challenge).
3. Managing “bring your own device,” where you have
devices both owned by the company and devices owned by the
individual, both contributing to productivity.
Crowdsourcing Analytics
• Crowdsourcing is a recognition that you can’t
possibly always have the best and brightest internal
people to solve all your big problems
• Crowdsourcing is a great way to capitalize on the
resources that can build algorithms and predictive
models.
• It takes years of learning and experience to get the
knowledge to create algorithms and predictive
models.
• So crowd sourcing is a way to capitalize on the
limited resources that are available in the
marketplace.
• Crowdsourcing is a disruptive business model whose
roots are in technology but is extending beyond
technology to other areas.
• There are various types of crowdsourcing, such as
crowd voting, crowd purchasing, wisdom of crowds,
crowd funding, and contests.
• For example:
– 99designs.com/, which does crowdsourcing of graphic
design
– agentanything.com/, which posts “missions” where agents
vie for to run errands
– 33needs.com/, which allows people to contribute to
charitable programs that make a social impact
Inter- and Trans-Firewall Analytics

• Decision science is witnessing a similar trend as


enterprises are beginning to collaborate on insights
across the value chain.
• For instance, in the health care industry, rich
consumer insights can be generated by collaborating
on data and insights from the health insurance
provider, pharmacy delivering the drugs, and the drug
manufacturer.
• For example, there are instances where a retailer and
a social media company can come together to share
insights on consumer behavior that will benefit both
players.
• Some of the more progressive companies are taking
this a step further and working on leveraging the
large volumes of data outside the firewall such as
social data, location data, and so forth.
• It will be not very long before internal data and
insights from within the firewall is no longer a
differentiator. We see this trend as the move from
intra- to inter- and trans-firewall analytics.
• Today they are doing intra-firewall analytics with
data within the firewall.
• Tomorrow they will be collaborating on insights with
other companies to do inter-firewall analytics as well
as leveraging the public domain spaces to do trans-
firewall analytics
Value Chain for Inter-Firewall and
Trans-Firewall Analytics

You might also like