0% found this document useful (0 votes)
32 views

dataanalyticsunit-1[1]

I uploaded the data analytics unit 1 notes.

Uploaded by

bhikharilal0711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

dataanalyticsunit-1[1]

I uploaded the data analytics unit 1 notes.

Uploaded by

bhikharilal0711
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

MAHARANA PRATAP GROUP OF INSTITUTIONS

KOTHI MANDHANA, KANPUR


(Approved by AICTE, New Delhi and Affiliated to Dr.AKTU, Lucknow )

Digital Notes
[Department of Electronics Engineering]
Subject Name : Data Analytics
Subject Code : KCS-051
Course : B. Tech
Branch : CSE
Semester : V
Prepared by : Mr. Anand Prakash Dwivedi
Unit – 1

Index:

1. Introduction to Data Analytics


2. Characteristics of data
3. Classification of data (structured, semi-structured, unstructured)
4. STAGES OF BIG DATA BUSINESS ANALYTICS
5. Big Data Challenges
6. Reporting and analysis
7. Data Analytics Lifecycle
8. key roles for successful analytic projects

2
Page
Introduction to Data Analytics

Analytics is the discovery, interpretation, and communication of meaningful


patterns in data and applying those patterns towards effective decision making
.Analytics is an
encompassing and multidimensional field that uses mathematics, statistics,
predictive modeling and machine learning techniques to find meaningful patterns
and knowledge in recorded data
Data analysis is a process of inspecting, cleansing, transforming, and modeling
data.Data analytics refers to qualitative and quantitative techniques and processes
used to enhance productivity and business gain
Why Data Analytics
Data Analytics is needed in Business to Consumer applications (B2C).
Organisations collect data that they have gathered from customers, businesses,
economy and practical experience. Data is then processed after gathering and is
categorised as per the requirement and analysis is done to study purchase patterns
and etc
The process of Data Analysis
Analysis refers to breaking a whole into its separate components for individual
examination. Data analysis is a process for obtaining raw data and converting it
into
information useful for decision-making by users. There are several phases that can
be distinguished :Data requirements, Data collection ,Data processing ,Data
cleaning, Exploratory data analysis, Modeling and algorithms , Data product
,Communication
Scopeof Data Analytics
Bright future of data analytics, many professionals and students are interested in a
career in data analytics.Any person who likes to work on numbers, has a logical
thinking, can understand figures and can turn them into actionable insights, has a
good future in this field. A proper training of the tools of data analytics would be
required to begin with. Since it is a course that requires effort to learn and get
certified, there is always dearth of qualified professionals. Being a relatively new
field also, the demand for such professionals is more than the current supply.
Higher demand also means higher salaries.
3
Page
Importance Data Analytics
● Predict customer trends and behaviours
● Analyse, interpret and deliver data in meaningful ways
● Increase business productivity
● Drive effective decision-making
Skill is required for Data analytics ?
1.) Analytical Skills
2.) Numeracy Skills
3.) Technical and Computer Skills
4.) Attention to Details
5.) Business Skills
6.) Communication Skills

The Truth About Data Analytics:


Data analytics for businesses that want to make good use of the data that they are
taking in. Businesses that can use data analytics properly are more likely than
others to
succeed and thrive. But with all of the advantages of data analytics, the key benefits
can be described in this way:
● Data analytics reduces the costs associated with running a business.
● It cuts down on the time needed to come to strategy-defining decisions.
● Data analytics help to more-accurately define customer trends. Determining
the Effectiveness of Your Analytics Program
Given the growing familiarity and popularity of data analytics, there are a number
of advanced analytics programs available on the market. As such, there are certain
traits to look for in any analytics solution that will help you gage just how effective
it will be in improving your business.

Characteristics of data
Big data is a term that is used to describe data that is high volume, high
velocity, and/or high variety; requires new technologies and techniques to
capture, store, and analyze it; and is used to enhance decision making, provide
insight and discovery, and support and optimize processes.
 For example, every customer e-mail, customer-service chat, and social
4
Page

media comment may be captured, stored, and analyzed to better


understand customers’ sentiments. Web browsing data may capture
every mouse movement in order to better understand customers’
shopping behaviors.
 Radio frequency identification (RFID) tags may be placed on every
single piece of merchandise in order to assess the condition and
location of every item.
 Volume: Machine generated data is produced in larger quantities than non-
traditional data.
a. Data Volume
b. 44x increase from 2009-2020
c. From 0.8 zettabytes to 35zb
d. Data volume is increasing exponentially
 Velocity: This refers to the speed of data processing.
Data is begin generated fast and need to be processed fast.
Online Data Analytics
Late decisions  missing opportunities
Examples
• E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you
• Healthcare monitoring: sensors monitoring your activities and body 
any abnormal measurements require immediate reaction

 Variety: This refers to large variety of input data which in turn generates
large amount of data as output.
a. Various formats, types, and structures
5
Page
b. Text, numerical, images, audio, video, sequences,
time series, social media data, multi-dim arrays,
etc…
c. Static data vs. streaming data

## A single application can be generating/collecting many


types of data.

Classification of data (structured, semi-structured, unstructured)


Big data has many sources. For example, every mouse click on a web site can be
captured in Web log files and analyzed in order to better understand shoppers’
6
Page
buying behaviors and to influence their shopping by dynamically recommending
products.
Social media sources such as Face book and Twitter generate tremendous
amounts of comments and tweets. This data can be captured and analyzed to
understand, for example, what people think about new product
introductions.
Machines, such as smart meters, generate data. These meters continuously
stream data about electricity, water, or gas consumption that can be shared
with customers and combined with pricing plans to motivate customers to
move some of their energy consumption, such as for washing clothes, to non-
peak hours. There is a tremendous amount of geospatial (e.g., GPS) data, such
as that created by cell phones, that can be used by applications like Four
Square to help you know the locations of friends and to receive offers from
nearby stores and restaurants. Image, voice, and audio data can be analyzed
for applications such as facial recognition systems in security systems.

What Comes Under Big Data?

Big data involves the data produced by different devices and applications. Given
below are some of the fields that come under the umbrella of Big Data.

 Black Box Data: It is a component of helicopter, airplanes, and jets, etc. It


captures voices of the flight crew, recordings of microphones and earphones,
and the performance information of the aircraft.
 Social Media Data: Social media such as Facebook and Twitter hold
information and the views posted by millions of people across the globe.
7
Page
 Stock Exchange Data: The stock exchange data holds information about the
‘buy’ and ‘sell’ decisions made on a share of different companies made by the
customers.
 Power Grid Data: The power grid data holds information consumed by a
particular node with respect to a base station.
 Transport Data: Transport data includes model, capacity, distance and
availability of a vehicle.
 Search Engine Data: Search engines retrieve lots of data from different
databases.

Thus Big Data includes huge volume, high velocity, and extensible variety of data.
The data in it will be of three types.

o Structured data : Relational data.


o Semi Structured data : XML data.
o Unstructured data : Word, PDF, Text, Media Logs.

Structure Data
o It can be defined as the data that has a defined repeating pattern.
o This pattern makes it easier for any program to sort, read, and process the
data.
o Processing structured data is much faster and easier than processing data
without any specific repeating pattern.
o Is organised data in a prescribed format.
o Is stored in tabular form.
o Is the data that resides in fixed fields within a record or file.
o Is formatted data that has eities and their attributes are properly mapped.
o Is used in query and report against predetermined data types.
o Sources: DBMS/RDBMS, Flat files, Multidimensional databases, Legacy
databases
8
Page
Unstructure Data
• It is a set of data that might or might not have any logical or repeating
patterns.
• Typically of metadata,i . e,the additional information related to data.
• Inconsistent data (files, social media websites, satalities, etc.)
• Data in different format (e-mails, text, audio, video or images.
• Sources: Social media, Mobile Data, Text both internal & external to an
organzation
9
Page
Page 10
Semi-Structure Data
• Having a schema-less or self-describing structure, refers to a form of
structured data that contains tags or markup element in order to separate
elements and generate hierarchies of records and fields in the given data.
• In other words, data is stored inconsistently in rows and columns of a
database.
Sources: File systems such as Web data in the form of cookies, Data exchange
formats

STAGES OF BIG DATA BUSINESS ANALYTICS

The different stages of business analytics are:


1. Descriptive analytics:
o Here the information that is present in the data is obtained and
summarized. It is primarily involved in finding all the statistics that
describes the data. Eg: How many buyers bought A.C. in the month
11

of December previous years?


Page

2. Diagnostic/ Discovery Analytics:


o This stage involves finding out the reason for the statistics determined
o in the previous analytics stage. Otherwise it involves, why that
statistics have happened? Eg: Why there is an increase/ decrease in
the sales of A.C.in the month of December?

3. Predictive Analytics:
o This stage involves predicting the possible future events based on the
information obtained from the Descriptive and/or Discovery Analytics
stages. Also in the stage possible risks involved can be identified. Eg:
What shall be the sales improvement next year (making insights for
future)?
4. Prescriptive Analytics:
o It involves planning actions or making decisions to improve the
Business based on the predictive analytics . Eg: How much amount of
material should be procured to increase the production?

What made Big Data needed?

Key Computing Resources for Big Data

 Processing capability: CPU, processor, or node.


 Memory
 Storage
 Network
12
Page
Techniques towards Big Data

• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization

Why Big Data now?


13
Page

 More data are being collected and stored


 Open source code

 Commodity hardware

What’s driving Big Data

Optimizations and predictive analytics

- Complex statistical analysis


14

- All types of data, and many sources


- Very large datasets
Page
- More of a real-time
- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

Benefits of Big Data

Big data is really critical to our life and its emerging as one of the most important
technologies in modern world. Follow are just few benefits which are very much
known to all of us:

 Using the information kept in the social network like Facebook, the
marketing agencies are learning about the response for their campaigns,
promotions, and other advertising mediums.
 Using the information in the social media like preferences and product
perception of their consumers, product companies and retail organizations
are planning their production.
 Using the data regarding the previous medical history of patients, hospitals
are providing better and quick service.

BIG DATA ANALYTICS

 Accumulation of raw data captured from various sources (i.e. discussion


boards, emails, exam logs, chat logs in e-learning systems) can be used to
identify fruitful patterns and relationships
15
Page
 By itself, stored data does not generate business value, and this is true of
traditional databases, data warehouses, and the new technologies such as
Hadoop for storing big data. Once the data is appropriately stored,
 However, it can be analyzed, which can create tremendous value. A variety of
analysis technologies, approaches, and products have emerged that are
especially applicable to big data, such as in-memory analytics, in-database
analytics, and appliances

Big Data Technologies

Big data technologies are important in providing more accurate analysis, which
may lead to more concrete decision-making resulting in greater operational
efficiencies, cost reductions, and reduced risks for the business.

There are various technologies in the market from different vendors including
Amazon, IBM, Microsoft, etc., to handle big data. While looking into the technologies
that handle big data, we examine the following two classes of technology:

o Operational Big Data

o Analytical Big Data

Operational Big Data

 This include systems like MongoDB that provide operational capabilities for
real-time, interactive workloads where data is primarily captured and stored.
 NoSQL Big Data systems are designed to take advantage of new cloud
computing architectures that have emerged over the past decade to allow
massive computations to be run inexpensively and efficiently. This makes
16
Page
operational big data workloads much easier to manage, cheaper, and faster to
implement.
 Some NoSQL systems can provide insights into patterns and trends based on
real-time data with minimal coding and without the need for data scientists
and additional infrastructure.

Analytical Big Data

 This includes systems like Massively Parallel Processing (MPP) database


systems and MapReduce that provide analytical capabilities for retrospective
and complex analysis that may touch most or all of the data.
 MapReduce provides a new method of analyzing data that is complementary
to the capabilities provided by SQL, and a system based on MapReduce that
can be scaled up from single servers to thousands of high and low end
machines.
 These two classes of technology are complementary and frequently deployed
together.

Big Data Challenges

The major challenges associated with big data are as follows:

o Capturing data
o Curation
o Storage
o Searching
o Sharing
o Transfer
17

o Analysis
Page

o Presentation
To fulfill the above challenges, organizations normally take the help of enterprise
servers.

Reporting and analysis


Living in the era of digital technology and big data has made organizations
dependent on the wealth of information data can bring. You might have seen how
reporting and analysis are used interchangeably, especially the manner which
outsourcing companies market their services. While both areas are part of web
analytics (note that analytics isn’t similar to analysis), there’s a vast difference
between them, and it’s more than just spelling.

It’s important that we differentiate the two because some organizations might be
selling themselves short in one area and not reap the benefits, which web analytics
can bring to the table. The first core component of web analytics, reporting, is
merely organizing data into summaries. On the other hand, analysis is the process
of inspecting, cleaning, transforming, and modeling these summaries (reports) with
the goal of highlighting useful information.

Simply put, reporting translates data into information while analysis turns
information into insights. Also, reporting should enable users to ask “What?”
questions about the information, whereas analysis should answer to “Why”” and
“What can we do about it?”

Here are five differences between reporting and analysis:

1. Purpose

Reporting helps companies monitor their data even before digital technology
boomed. Various organizations have been dependent on the information it brings
to their business, as reporting extracts that and makes it easier to understand.

Analysis interprets data at a deeper level. While reporting can link between cross-
channels of data, provide comparison, and make understand information easier
(think of a dashboard, charts, and graphs, which are reporting tools and not
analysis reports), analysis interprets this information and provides
recommendations on actions.
18
Page
2. Tasks

As reporting and analysis have a very fine line dividing them, sometimes it’s easy to
confuse tasks that have analysis labeled on top of them when all it does is
reporting. Hence, ensure that your analytics team has a healthy balance doing both.

Here’s a great differentiator to keep in mind if what you’re doing is reporting or


analysis:

Reporting includes building, configuring, consolidating, organizing, formatting, and


summarizing. It’s very similar to the abovementioned like turning data into charts,
graphs, and linking data across multiple channels.

Analysis consists of questioning, examining, interpreting, comparing, and


confirming. With big data, predicting is possible as well.

3. Outputs

Reporting and analysis have the push and pull effect from its users through their
outputs. Reporting has a push approach, as it pushes information to users and
outputs come in the forms of canned reports, dashboards, and alerts.

Analysis has a pull approach, where a data analyst draws information to further
probe and to answer business questions. Outputs from such can be in the form of
ad hoc responses and analysis presentations. Analysis presentations are comprised
of insights, recommended actions, and a forecast of its impact on the company—all
in a language that’s easy to understand at the level of the user who’ll be reading
and deciding on it.

This is important for organizations to realize truly the value of data, such that a
standard report is not similar to a meaningful analytics.

4. Delivery

Considering that reporting involves repetitive tasks—often with truckloads of data,


automation has been a lifesaver, especially now with big data. It’s not surprising
that the first thing outsourced are data entry services since outsourcing companies
are perceived as data reporting experts.
19
Page
Analysis requires a more custom approach, with human minds doing superior
reasoning and analytical thinking to extract insights, and technical skills to provide
efficient steps towards accomplishing a specific goal. This is why data analysts and
scientists are demanded these days, as organizations depend on them to come up
with recommendations for leaders or business executives make decisions about
their businesses.

5. Value

This isn’t about identifying which one brings more value, rather understanding that
both are indispensable when looking at the big picture. It should help businesses
grow, expand, move forward, and make more profit or increase their value.

Reporting Analysis

Provides data Provides answers

Provides what is asked for Provides what is needed

Is typically standardized Is typically customized

Does not involve a person Involves a person

Is fairly inflexible Is extremely flexible

Data Analytics Lifecycle


 Big Data analysis differs from tradional data analysis primarily due to the
volume, velocity and va r ie ty cha ra cte r s t i c s o f t h e d a t a b e i n g
processes.
 To address the distinct requirements for performing analysis on Big Data, a
step-by-step methodology is needed to organize the activities and tasks
involved with acquiring, processing, analyzing and repurposing data.

Phases-of-data-analytics-lifecycle
20
Page
Phase 1: Discovery
In this phase,
 The data science team must learn and investigate the problem,
 Develop context and understanding, and
 Learn about the data sources needed and available for the project.
 In addition, the team formulates initial hypotheses that can later be tested
with data.

The team should perform five main activities during this step of the discovery
phase:
 Identify data sources: Make a list of data sources the team may need to test
the initial hypotheses outlined in this phase.
21

o Make an inventory of the datasets currently available and those that


o can be purchased or otherwise acquired for the tests the team wants
Page
o to perform.
 Capture aggregate data sources: This is for previewing the data and
providing
high-level understanding.
o It enables the team to gain a quick overview of the data and perform
o further exploration on specific areas.
 Review the raw data: Begin understanding the interdependencies among
the
data attributes.
o Become familiar with the content of the data, its quality, and its
limitations.
 Evaluate the data structures and tools needed: The data type and
structure dictate which tools the team can use to analyze the data.
 Scope the sort of data infrastructure needed for this type of problem: In
addition to the tools needed, the data influences the kind of infrastructure
that's required, such as disk storage and network capacity.
 Unlike many traditional stage-gate processes, in which the team can advance
only when specific criteria are met, the Data Analytics Lifecycle is intended to
accommodate more ambiguity.
 For each phase of the process, it is recommended to pass certain checkpoints
as a way of gauging whether the team is ready to move to the next phase of
the Data Analytics Lifecycle.

Phase 2: Data preparation

 This phase includes


 Steps to explore, Preprocess, and condition data prior to modeling and
analysis.
 It requires the presence of an analytic sandbox (workspace), in which the
team can work with data and perform analytics for the duration of the
project.
o The team needs to execute Extract, Load, and Transform (ELT) or
extract, transform and load (ETL) to get data into the sandbox.
o In ETL, users perform processes to extract data from a datastore,
22

perform data transformations, and load the data back into the
datastore.
Page
o The ELT and ETL are sometimes abbreviated as ETLT. Data should be
transformed in the ETLT process so the team can work with it and
analyze it.
Rules for Analytics Sandbox
 When developing the analytic sandbox, collect all kinds of data there, as team
members need access to high volumes and varieties of data for a Big Data
analytics project.
 This can include everything from summary-level aggregated data,
structured data , raw data feeds, and unstructured text data from call
logs or web logs, depending on the kind of analysis the team plans to
undertake.
 A good rule is to plan for the sandbox to be at least 5– 10 times the size of the
original datasets, partly because copies of the data may be created that serve
as specific tables or data stores for specific kinds of analysis in the project.
Performing ETLT
 As part of the ETLT step, it is advisable to make an inventory of the data and
compare the data currently available with datasets the team needs.
 Performing this sort of gap analysis provides a framework for understanding
which datasets the team can take advantage of today and where the team
needs to initiate projects for data collection or access to new datasets
currently unavailable.
 A component of this subphase involves extracting data from the available
sources and determining data connections for raw data, online transaction
processing (OLTP) databases, online analytical processing (OLAP)
cubes, or other data feeds.
 Data conditioning refers to the process of cleaning data, normalizing
datasets, and performing transformations on the data.
Common Tools for the Data Preparation Phase
Several tools are commonly used for this phase:
Hadoop can perform massively parallel ingest and custom analysis for web traffic
analysis, GPS location analytics, and combining of massive unstructured data feeds
from multiple sources.
Alpine Miner provides a graphical user interface (GUI) for creating analytic
workflows, including data manipulations and a series of analytic events such as
staged data-mining techniques (for example, first select the top 100 customers, and
then run descriptive statistics and clustering).
23

OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool
Page
for working with messy data. A GUI-based tool for performing data
transformations, and it's one of the most robust free tools currently available.
Similar to OpenRefine, Data Wrangler is an interactive tool for data cleaning and
transformation. Wrangler was developed at Stanford University and can be used to
perform many transformations on a given dataset.

Phase 3: Model Planning


Phase 3 is model planning , where the team determines the methods, techniques,
and workflow it intends to follow for the subsequent model building phase.
 The team explores the data to learn about the relationships between
variables and subsequently selects key variables and the most suitable
models.
 During this phase that the team refers to the hypotheses developed in Phase
1, when they first became acquainted with the data and understanding the
business problems or domain area.
Common Tools for the Model Planning Phase
Here are several of the more common ones:
 R has a complete set of modeling capabilities and provides a good
environment for building interpretive models with high-quality code. In
addition, it has the ability to interface with databases via an ODBC connection
and execute statistical tests.
 SQL Analysis services can perform in-database analytics of common data
mining functions, involved aggregations, and basic predictive models.
 SAS/ ACCESS provides integration between SAS and the analytics sandbox
via multiple data connectors such as OBDC, JDBC, and OLE DB. SAS itself is
generally used on file extracts, but with SAS/ ACCESS, users can connect to
relational databases (such as Oracle or Teradata).
Phase 4: Model Building
 In this phase the data science team needs to develop data sets for training,
testing, and production purposes. These data sets enable the data scientist to
develop the analytical model and train it ("training data"), while holding
aside some of the data ("holdout data" or "test data") for testing the model.
 the team develops datasets for testing, training, and production purposes.
o In addition, in this phase the team builds and executes models based on
the work done in the model planning phase.
o The team also considers whether its existing tools will sufficient for
running the models, or if it will need a more robust environment for
24

executing models and workflows (for example, fast hardware and


Page

parallel processing, if applicable).


 Free or Open Source tools: Rand PL/R, Octave, WEKA, Python
 Commercial Tools: Matlab, STATISTICA.
Phase 5: Communicate Results
 In Phase 5, After executing the model, the team needs to compare the
outcomes of the modeling to the criteria established for success and failure.
 The team considers how best to articulate the findings and outcomes to the
various team members and stakeholders, taking into account warning,
assumptions, and any limitations of the results.
 The team should identify key findings, quantify the business value, and
develop a narrative to summarize and convey findings to stakeholders.
Phase 6: Operationalize
 In the final phase 6, Operationalize), the team communicates the benefits of
the project more broadly and sets up a pilot project to deploy the work in a
controlled way before broadening the work to a full enterprise or ecosystem
of users.
 This approach enables the team to learn about the performance and re lated
constraints of the model in a production environment on a small scale and
make adjustments before a full deployment.
 The team delivers final reports, briefings, code, and technical documents. In
addition, the team may run a pilot project to implement the models in a
production environment.
Common Tools for the Model Building Phase
Free or Open Source tools:
 R and PL/ R was described earlier in the model planning phase, and PL/ R is
a procedural language for PostgreSQL with R. Using this approach means that
R commands can be executed in database.
 Octave , a free software programming language for computational modeling,
has some of the functionality of Matlab. Because it is freely available, Octave
is used in major universities when teaching machine learning.
 WEKA is a free data mining software package with an analytic workbench.
The functions created in WEKA can be executed within Java code.
 Python is a programming language that provides toolkits for machine
learning and analysis, such as scikit-learn, numpy, scipy, pandas, and related
data visualization using matplotlib.
 SQL in-database implementations, such as MADlib, provide an alterative to
inmemory desktop analytical tools.
25

 MADlib provides an open-source machine learning library of algorithms that


can
Page
be executed in-database, for PostgreSQL or Greenplum.

Key Roles for a Successful Analytics Project


 Business User – understands the domain area
 Project Sponsor – provides requirements
 Project Manager – ensures meeting objectives
 Business Intelligence Analyst – provides business domain expertise based
on deep understanding of the data
 Database Administrator (DBA) – creates DB environment
 Data Engineer – provides technical skills, assists data management and
extraction, supports analytic sandbox
 Data Scientist – provides analytic techniques and modeling

26
Page

You might also like