dataanalyticsunit-1[1]
dataanalyticsunit-1[1]
Digital Notes
[Department of Electronics Engineering]
Subject Name : Data Analytics
Subject Code : KCS-051
Course : B. Tech
Branch : CSE
Semester : V
Prepared by : Mr. Anand Prakash Dwivedi
Unit – 1
Index:
2
Page
Introduction to Data Analytics
Characteristics of data
Big data is a term that is used to describe data that is high volume, high
velocity, and/or high variety; requires new technologies and techniques to
capture, store, and analyze it; and is used to enhance decision making, provide
insight and discovery, and support and optimize processes.
For example, every customer e-mail, customer-service chat, and social
4
Page
Variety: This refers to large variety of input data which in turn generates
large amount of data as output.
a. Various formats, types, and structures
5
Page
b. Text, numerical, images, audio, video, sequences,
time series, social media data, multi-dim arrays,
etc…
c. Static data vs. streaming data
Big data involves the data produced by different devices and applications. Given
below are some of the fields that come under the umbrella of Big Data.
Thus Big Data includes huge volume, high velocity, and extensible variety of data.
The data in it will be of three types.
Structure Data
o It can be defined as the data that has a defined repeating pattern.
o This pattern makes it easier for any program to sort, read, and process the
data.
o Processing structured data is much faster and easier than processing data
without any specific repeating pattern.
o Is organised data in a prescribed format.
o Is stored in tabular form.
o Is the data that resides in fixed fields within a record or file.
o Is formatted data that has eities and their attributes are properly mapped.
o Is used in query and report against predetermined data types.
o Sources: DBMS/RDBMS, Flat files, Multidimensional databases, Legacy
databases
8
Page
Unstructure Data
• It is a set of data that might or might not have any logical or repeating
patterns.
• Typically of metadata,i . e,the additional information related to data.
• Inconsistent data (files, social media websites, satalities, etc.)
• Data in different format (e-mails, text, audio, video or images.
• Sources: Social media, Mobile Data, Text both internal & external to an
organzation
9
Page
Page 10
Semi-Structure Data
• Having a schema-less or self-describing structure, refers to a form of
structured data that contains tags or markup element in order to separate
elements and generate hierarchies of records and fields in the given data.
• In other words, data is stored inconsistently in rows and columns of a
database.
Sources: File systems such as Web data in the form of cookies, Data exchange
formats
3. Predictive Analytics:
o This stage involves predicting the possible future events based on the
information obtained from the Descriptive and/or Discovery Analytics
stages. Also in the stage possible risks involved can be identified. Eg:
What shall be the sales improvement next year (making insights for
future)?
4. Prescriptive Analytics:
o It involves planning actions or making decisions to improve the
Business based on the predictive analytics . Eg: How much amount of
material should be procured to increase the production?
• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization
Commodity hardware
Big data is really critical to our life and its emerging as one of the most important
technologies in modern world. Follow are just few benefits which are very much
known to all of us:
Using the information kept in the social network like Facebook, the
marketing agencies are learning about the response for their campaigns,
promotions, and other advertising mediums.
Using the information in the social media like preferences and product
perception of their consumers, product companies and retail organizations
are planning their production.
Using the data regarding the previous medical history of patients, hospitals
are providing better and quick service.
Big data technologies are important in providing more accurate analysis, which
may lead to more concrete decision-making resulting in greater operational
efficiencies, cost reductions, and reduced risks for the business.
There are various technologies in the market from different vendors including
Amazon, IBM, Microsoft, etc., to handle big data. While looking into the technologies
that handle big data, we examine the following two classes of technology:
This include systems like MongoDB that provide operational capabilities for
real-time, interactive workloads where data is primarily captured and stored.
NoSQL Big Data systems are designed to take advantage of new cloud
computing architectures that have emerged over the past decade to allow
massive computations to be run inexpensively and efficiently. This makes
16
Page
operational big data workloads much easier to manage, cheaper, and faster to
implement.
Some NoSQL systems can provide insights into patterns and trends based on
real-time data with minimal coding and without the need for data scientists
and additional infrastructure.
o Capturing data
o Curation
o Storage
o Searching
o Sharing
o Transfer
17
o Analysis
Page
o Presentation
To fulfill the above challenges, organizations normally take the help of enterprise
servers.
It’s important that we differentiate the two because some organizations might be
selling themselves short in one area and not reap the benefits, which web analytics
can bring to the table. The first core component of web analytics, reporting, is
merely organizing data into summaries. On the other hand, analysis is the process
of inspecting, cleaning, transforming, and modeling these summaries (reports) with
the goal of highlighting useful information.
Simply put, reporting translates data into information while analysis turns
information into insights. Also, reporting should enable users to ask “What?”
questions about the information, whereas analysis should answer to “Why”” and
“What can we do about it?”
1. Purpose
Reporting helps companies monitor their data even before digital technology
boomed. Various organizations have been dependent on the information it brings
to their business, as reporting extracts that and makes it easier to understand.
Analysis interprets data at a deeper level. While reporting can link between cross-
channels of data, provide comparison, and make understand information easier
(think of a dashboard, charts, and graphs, which are reporting tools and not
analysis reports), analysis interprets this information and provides
recommendations on actions.
18
Page
2. Tasks
As reporting and analysis have a very fine line dividing them, sometimes it’s easy to
confuse tasks that have analysis labeled on top of them when all it does is
reporting. Hence, ensure that your analytics team has a healthy balance doing both.
3. Outputs
Reporting and analysis have the push and pull effect from its users through their
outputs. Reporting has a push approach, as it pushes information to users and
outputs come in the forms of canned reports, dashboards, and alerts.
Analysis has a pull approach, where a data analyst draws information to further
probe and to answer business questions. Outputs from such can be in the form of
ad hoc responses and analysis presentations. Analysis presentations are comprised
of insights, recommended actions, and a forecast of its impact on the company—all
in a language that’s easy to understand at the level of the user who’ll be reading
and deciding on it.
This is important for organizations to realize truly the value of data, such that a
standard report is not similar to a meaningful analytics.
4. Delivery
5. Value
This isn’t about identifying which one brings more value, rather understanding that
both are indispensable when looking at the big picture. It should help businesses
grow, expand, move forward, and make more profit or increase their value.
Reporting Analysis
Phases-of-data-analytics-lifecycle
20
Page
Phase 1: Discovery
In this phase,
The data science team must learn and investigate the problem,
Develop context and understanding, and
Learn about the data sources needed and available for the project.
In addition, the team formulates initial hypotheses that can later be tested
with data.
The team should perform five main activities during this step of the discovery
phase:
Identify data sources: Make a list of data sources the team may need to test
the initial hypotheses outlined in this phase.
21
perform data transformations, and load the data back into the
datastore.
Page
o The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and
analyze it.
Rules for Analytics Sandbox
When developing the analytic sandbox, collect all kinds of data there, as team
members need access to high volumes and varieties of data for a Big Data
analytics project.
This can include everything from summary-level aggregated data,
structured data , raw data feeds, and unstructured text data from call
logs or web logs, depending on the kind of analysis the team plans to
undertake.
A good rule is to plan for the sandbox to be at least 5– 10 times the size of the
original datasets, partly because copies of the data may be created that serve
as specific tables or data stores for specific kinds of analysis in the project.
Performing ETLT
As part of the ETLT step, it is advisable to make an inventory of the data and
compare the data currently available with datasets the team needs.
Performing this sort of gap analysis provides a framework for understanding
which datasets the team can take advantage of today and where the team
needs to initiate projects for data collection or access to new datasets
currently unavailable.
A component of this subphase involves extracting data from the available
sources and determining data connections for raw data, online transaction
processing (OLTP) databases, online analytical processing (OLAP)
cubes, or other data feeds.
Data conditioning refers to the process of cleaning data, normalizing
datasets, and performing transformations on the data.
Common Tools for the Data Preparation Phase
Several tools are commonly used for this phase:
Hadoop can perform massively parallel ingest and custom analysis for web traffic
analysis, GPS location analytics, and combining of massive unstructured data feeds
from multiple sources.
Alpine Miner provides a graphical user interface (GUI) for creating analytic
workflows, including data manipulations and a series of analytic events such as
staged data-mining techniques (for example, first select the top 100 customers, and
then run descriptive statistics and clustering).
23
OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool
Page
for working with messy data. A GUI-based tool for performing data
transformations, and it's one of the most robust free tools currently available.
Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and
transformation. Wrangler was developed at Stanford University and can be used to
perform many transformations on a given dataset.
26
Page