Chapter-2 Data Science2
Chapter-2 Data Science2
Chapter contents :
In case of academic discipline and profession, data science continues to evolve as one of
the most promising and in-demand career paths for skilled professionals. 2
cont. ..
Data professionals understand that they must advance past the traditional skills of
analyzing large amounts of data, data mining, and programming skills.
3
What are data and information
Is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-
9) or special characters (+, -, /, *, <,>, =, etc.).
4
cont. ..
Information is defined as :-
5
Data Processing Cycle
Data Processing Cycle is the sequence of steps or operations used to
transform raw data into useful information..
I. Input
►Processing –in this step, the input data is changed to produce data in a
more useful form.
Transforming raw data into data in more usable for form.
For example, interest can be calculated on deposit to a bank, or a summary of
sales for the month can be calculated from the sales orders.
Data types can be described from diverse perspectives. Here some of the perspectives:-
i. Data types from Computer programming perspective
Common data types in programmers perspective include:
Integers(int)- is used to store integer numbers.
For instance, Integers = { ..., −4, −3, −2, −1, 0, 1, 2, 3, 4, ... }
Booleans(bool)- is used to represent restricted to one of two values: true or
false.
Characters(char)- is used to store a single character((letters, digits, symbols, etc...).
Floating-point numbers(float)- is used to store real numbers.
Alphanumeric strings(string)- used to store a combination of characters and
numbers. 8
cont. ..
ii. Data types from Data Analytics perspective
From a data analytics point of view, there are 3 common types of data types.
A. Structured Data:- adheres/follows a pre-defined data model and is
therefore highly organized and straightforward to analyze.
It conforms to a tabular format or organized in rows and columns.
e.g. Excel files or SQL databases
13
Data value Chain
• It is introduced to describe the information flow within a big data system as a
series of steps needed to generate value and useful insights from data.
• The Big Data Value Chain identifies the following key high-level activities:
14
Data Acquisition :- is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried out.
Data acquisition is one of the major big data challenges in terms of infrastructure requirements. Why?
Data Analysis :- making the raw data acquired amenable to use in decision-making as well as domain-
specific usage.
it involves exploring, transforming, and modeling data with the goal of highlighting relevant data,
synthesizing and extracting useful hidden information with high potential from a business 15point of
cont. ..
Data Curation :- It is the active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its effective usage.
Data Usage:- It covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis within the business activity.
16
Basic concepts of big data
• Big data is the term for a collection of data sets so large and complex .
Veracity: can we trust the data? How accurate is it? Figure 2.4 Characteristics of big data
Cluster computing :
• Big data clustering software combines the resources of many smaller machines,
seeking to provide a number of benefits:
Resource Pooling:
High Availability:
Easy Scalability:
18
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with big data easier.
is a framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models.
Characteristics of Hadoop
i. Economical:- Its systems are highly economical.
ii. Reliable:- stores copies of the data on different machines and is resistant to
hardware failure.
iii. Scalable/Accessible :- is easily scalable both, horizontally and vertically.
iv. Flexible:- you can store as much structured and unstructured data as U need.
19
cont. ..
It comprises the following components and
• Hadoop has an ecosystem that has
many others:
evolved from its four core HDFS: Hadoop Distributed File System
components: YARN: Yet Another Resource Negotiator
MapReduce: Programming based Data Processing
data management, Spark: In-Memory data processing
PIG, HIVE: Query-based processing of data services
access, HBase: NoSQL Database
Mahout, Spark MLLib: Machine Learning algorithm
processing, and libraries
Solar, Lucene: Searching and Indexing
storage. Zookeeper: Managing cluster
Oozie: Job Scheduling
20
cont. ..
21
Big Data Life Cycle with Hadoop
There are different stages of Big Data processing, some of them are:-
I. Ingesting/ Feeding data into the system :- data is ingested or transferred to Hadoop from
II. Processing the data in storage:- the data is stored and processed.
The data is stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase.
22
cont. ..
III. Computing and analyzing data:- data is analyzed by processing frameworks such as Pig,
Hive, and Impala.
Pig converts the data using a map and reduce and then analyzes it.
Hive is also based on the map and reduce programming and is most suitable
for structured data.
iv. Visualizing the results:- is performed by tools such as Hue and Cloudera Search.
23
24