0% found this document useful (0 votes)

40 views24 pages

Chapter-2 Data Science2

Emerging

Uploaded by

aberaendale334

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views24 pages

Chapter-2 Data Science2

Emerging

Uploaded by

aberaendale334

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 24

Chapter 2: Data Science

Chapter contents :

 An Overview of Data Science

 What are data and information

 Data types and their representation

 Data value Chain

 Basic concepts of big data

Overview of Data Science
• What is data science, data , information and big data?

Data is a raw facts which cannot be used for decision or judgments.

 Data science is:-

 defined as a multi-disciplinary field that uses scientific methods, processes, algorithms,

and systems in order to extract knowledge and insights from structured, semi-structured
and unstructured data.

 It is much more than simply analyzing data.

 It offers a range of roles and requires a range of skills.

 In case of academic discipline and profession, data science continues to evolve as one of
the most promising and in-demand career paths for skilled professionals. 2
cont. ..
 Data professionals understand that they must advance past the traditional skills of
analyzing large amounts of data, data mining, and programming skills.

 Data scientists need to be curious and result-oriented, with exceptional

industry-specific knowledge and communication skills that allow them to
explain highly technical results to their non-technical counterparts.

3
What are data and information

• What is Data and information ?

 Data can be defined as: - a representation of facts, figures, concepts, or

instructions in a formalized manner,

• which should be suitable for communication, interpretation, or

processing, by human or electronic machines.

 It can be described as unprocessed facts and figures.

 Is represented with the help of characters such as alphabets (A-Z, a-z), digits (0-
9) or special characters (+, -, /, *, <,>, =, etc.).

4
cont. ..
 Information is defined as :-

Processed or interpreted data on which decisions and actions are

based.
It is data that has been processed into a form that is meaningful to
the recipient.
It is created from organized, structured, and processed data in a
particular context.

5
Data Processing Cycle
Data Processing Cycle is the sequence of steps or operations used to
transform raw data into useful information..

Data processing is the re-structuring or re-ordering of data by people

or machines to increase their usefulness and add values for a particular
purpose.

Data processing consists the following basic steps -:-

I. Input

II. Processing and

III. Output. Input Processing Output

6
Cont..
►Input - in this step, the input data/raw data is prepared in some convenient
form for processing.

►Processing –in this step, the input data is changed to produce data in a
more useful form.
 Transforming raw data into data in more usable for form.
 For example, interest can be calculated on deposit to a bank, or a summary of
sales for the month can be calculated from the sales orders.

►Output - at this stage, the result or outcome of the preceding processing

step is collected.
 Decoding or interpreting the processing output and presenting it to the user.
 For example, output data may be payroll for employees. 7
Data types and their representation

Data types can be described from diverse perspectives. Here some of the perspectives:-
i. Data types from Computer programming perspective
Common data types in programmers perspective include:
 Integers(int)- is used to store integer numbers.
 For instance, Integers = { ..., −4, −3, −2, −1, 0, 1, 2, 3, 4, ... }
 Booleans(bool)- is used to represent restricted to one of two values: true or
false.
 Characters(char)- is used to store a single character((letters, digits, symbols, etc...).
 Floating-point numbers(float)- is used to store real numbers.
 Alphanumeric strings(string)- used to store a combination of characters and
numbers. 8
cont. ..
ii. Data types from Data Analytics perspective
From a data analytics point of view, there are 3 common types of data types.
A. Structured Data:- adheres/follows a pre-defined data model and is
therefore highly organized and straightforward to analyze.
 It conforms to a tabular format or organized in rows and columns.
 e.g. Excel files or SQL databases

B. Semi-structured Data:- is a form of structured data that does not

conform with the formal structure of data models associated with
relational databases or other forms of data tables.
 It is also known as a self-describing structure. (why ?)
 e.g. JSON , XML, sensor data. 9
. Excel files or SQL databases

A semi-structured data: XML Example.

<employees>
<employee>
A semi-structured data: JSON Example. <firstName>John</firstName> <lastName>Doe</lastName>
</employee>
{"employees":[ <employee>
{ "firstName":"John", "lastName":"Doe" }, <firstName>Anna</firstName> <lastName>Smith</lastName>
</employee>
{ "firstName":"Anna", "lastName":"Smith" }, <employee>
{ "firstName":"Peter", "lastName":"Jones" } <firstName>Peter</firstName> <lastName>Jones</lastName>
]} </employee>
</employees> 10
cont. ..
C. Unstructured data:- is information that does not have either a
predefined data model or is not organized in a pre-defined manner.

 Unstructured data is not organized in rows and columns.

 Unstructured data is qualitative data that consists of audio, video files,

image files, texts files, NoSQL databases, descriptions etc.

Figure 2.2 Data types from a data analytics perspective 11

iii. Metadata :- is simply defined as data about data.
 It provides additional information about a specific set of data.

 For example, in a set of photographs, it describe when and where the

photos were taken. and also it provides fields for dates and locations
which, by themselves, can be considered structured data.
 Because of this reason, metadata is frequently used by Big Data
solutions for initial analysis.
 Metadata is not a separate data structure, but it is one of the most
important elements for Big Data analysis and big data solutions.
12
Metadata

13
Data value Chain
• It is introduced to describe the information flow within a big data system as a
series of steps needed to generate value and useful insights from data.
• The Big Data Value Chain identifies the following key high-level activities:

14
 Data Acquisition :- is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried out.

 Data acquisition is one of the major big data challenges in terms of infrastructure requirements. Why?

Because, the infrastructure is required to:

 support the acquisition of big data with low latency in capturing data & executing
queries.
 be able to handle very high transaction volumes in distributed Environment

 support flexible and dynamic data structures

 Data Analysis :- making the raw data acquired amenable to use in decision-making as well as domain-
specific usage.

 it involves exploring, transforming, and modeling data with the goal of highlighting relevant data,
synthesizing and extracting useful hidden information with high potential from a business 15point of
cont. ..
 Data Curation :- It is the active management of data over its life cycle to ensure it
meets the necessary data quality requirements for its effective usage.

 it contains different activities such as content creation, selection,

classification, transformation, validation, and preservation.
 Data Storage :- is the persistence and management of data in a scalable way that
satisfies the needs of applications that require fast access to the data.

 Data Usage:- It covers the data-driven business activities that need access to data, its
analysis, and the tools needed to integrate the data analysis within the business activity.

16
Basic concepts of big data

• Big data is the term for a collection of data sets so large and complex .

• it becomes difficult to process using on-hand database management tools or traditional

data processing applications.

• it is characterized by 3V and more:

 Volume: large amounts of data (Zeta bytes)

 Velocity: Data is live streaming or in motion

 Veracity: can we trust the data? How accurate is it? Figure 2.4 Characteristics of big data

 Variety: data comes in many different forms from diverse sources

17
Clustered Computing and Hadoop Ecosystem

Cluster computing :

• Big data clustering software combines the resources of many smaller machines,
seeking to provide a number of benefits:

 Resource Pooling:

 High Availability:

 Easy Scalability:

18
Hadoop and its Ecosystem
• Hadoop is an open-source framework intended to make interaction with big data easier.

is a framework that allows for the distributed processing of large datasets
across clusters of computers using simple programming models.
Characteristics of Hadoop
i. Economical:- Its systems are highly economical.
ii. Reliable:- stores copies of the data on different machines and is resistant to
hardware failure.
iii. Scalable/Accessible :- is easily scalable both, horizontally and vertically.
iv. Flexible:- you can store as much structured and unstructured data as U need.

19
cont. ..
It comprises the following components and
• Hadoop has an ecosystem that has
many others:
evolved from its four core  HDFS: Hadoop Distributed File System
components:  YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 data management,  Spark: In-Memory data processing
 PIG, HIVE: Query-based processing of data services
 access,  HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm
 processing, and libraries
 Solar, Lucene: Searching and Indexing
 storage.  Zookeeper: Managing cluster
 Oozie: Job Scheduling

20
cont. ..

21
Big Data Life Cycle with Hadoop
There are different stages of Big Data processing, some of them are:-

I. Ingesting/ Feeding data into the system :- data is ingested or transferred to Hadoop from

various sources such as relational databases, systems, or local files.

II. Processing the data in storage:- the data is stored and processed.

 The data is stored in the distributed file system, HDFS, and the NoSQL distributed data, HBase.

Spark and MapReduce perform data processing.

22
cont. ..

III. Computing and analyzing data:- data is analyzed by processing frameworks such as Pig,
Hive, and Impala.

 Pig converts the data using a map and reduce and then analyzes it.
Hive is also based on the map and reduce programming and is most suitable
for structured data.
iv. Visualizing the results:- is performed by tools such as Hue and Cloudera Search.

 In this stage, the analyzed data can be accessed by users.

23
24

Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
U - 02 ET
No ratings yet
U - 02 ET
24 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
ict Ch. 2
No ratings yet
ict Ch. 2
38 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Emerging Tech Ch 2
No ratings yet
Emerging Tech Ch 2
52 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
ETCh2
No ratings yet
ETCh2
36 pages
EmTec Chapter 2 (1)
No ratings yet
EmTec Chapter 2 (1)
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
Chapter 2 [Data Science]
No ratings yet
Chapter 2 [Data Science]
35 pages
Data Science
No ratings yet
Data Science
35 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Chapter Two2
No ratings yet
Chapter Two2
21 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2 - EMTE_240216_133452
No ratings yet
Chapter 2 - EMTE_240216_133452
47 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
58 pages
Chapter 2: Data Science
No ratings yet
Chapter 2: Data Science
32 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Emerging_CH2
No ratings yet
Emerging_CH2
41 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
2 Data-Science PDF
No ratings yet
2 Data-Science PDF
49 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
111organic 1
No ratings yet
111organic 1
67 pages
Chapter One Pharmacy Organic
No ratings yet
Chapter One Pharmacy Organic
67 pages
Excitable Tissue (Muscular Tissue)
No ratings yet
Excitable Tissue (Muscular Tissue)
154 pages
Introduction To Anatomy Pharmacy
No ratings yet
Introduction To Anatomy Pharmacy
121 pages
+++fictionorganic U-2 Classes and Name
No ratings yet
+++fictionorganic U-2 Classes and Name
30 pages
Applied Drug Development Masters Program Guide
No ratings yet
Applied Drug Development Masters Program Guide
3 pages
Unit 5
No ratings yet
Unit 5
121 pages
Unit 6
No ratings yet
Unit 6
63 pages
Unit 4
No ratings yet
Unit 4
41 pages
Work Sheet
No ratings yet
Work Sheet
4 pages
Hybridizationsample
No ratings yet
Hybridizationsample
3 pages
Assignment 2
No ratings yet
Assignment 2
23 pages
Introduction To XML: What You Should Already Know
No ratings yet
Introduction To XML: What You Should Already Know
8 pages
XIIComp.Sc.S.E.476
No ratings yet
XIIComp.Sc.S.E.476
9 pages
class XII CS Term2- Board Practical Record Programs 2024-25
No ratings yet
class XII CS Term2- Board Practical Record Programs 2024-25
3 pages
Blii 014
No ratings yet
Blii 014
24 pages
Delphi Array Activity
No ratings yet
Delphi Array Activity
4 pages
What Is XML and Its Applications &characterisrics of XML
No ratings yet
What Is XML and Its Applications &characterisrics of XML
4 pages
Tutorial For Designer
No ratings yet
Tutorial For Designer
19 pages
CSCP 254 - Databases - Lecture 01
No ratings yet
CSCP 254 - Databases - Lecture 01
48 pages
300+ TOP Database Management System Questions and Answers 2022
100% (1)
300+ TOP Database Management System Questions and Answers 2022
35 pages
Abstract Data Type Is A Definition of New Type, Describes Its Data Structure Is An Implementation of ADT. Many ADT
No ratings yet
Abstract Data Type Is A Definition of New Type, Describes Its Data Structure Is An Implementation of ADT. Many ADT
12 pages
Chapter 2 - Introduction To Programming Using Java
No ratings yet
Chapter 2 - Introduction To Programming Using Java
57 pages
Adc Lab Matlab STUDENT
No ratings yet
Adc Lab Matlab STUDENT
13 pages
Digital Communication Unit 5
No ratings yet
Digital Communication Unit 5
105 pages
DBMS Interview Questions 1
No ratings yet
DBMS Interview Questions 1
5 pages
Food Order Management Final Report
No ratings yet
Food Order Management Final Report
31 pages
Shantanu: Mr. N. J. Ambekar
No ratings yet
Shantanu: Mr. N. J. Ambekar
5 pages
Chapter 2: Application Layer
100% (1)
Chapter 2: Application Layer
19 pages
Assembly
No ratings yet
Assembly
36 pages
Python Practical Record
No ratings yet
Python Practical Record
51 pages
NetBackup105_AdminGuide_Cloud
No ratings yet
NetBackup105_AdminGuide_Cloud
208 pages
Rubrik Quick Competitive Selling Guide
No ratings yet
Rubrik Quick Competitive Selling Guide
2 pages
Network MCQ
No ratings yet
Network MCQ
3 pages
HMI Modbus Help
No ratings yet
HMI Modbus Help
23 pages
Ecs RN 3803
No ratings yet
Ecs RN 3803
26 pages
Ansible Exam V 2.7
No ratings yet
Ansible Exam V 2.7
19 pages
Report
No ratings yet
Report
47 pages
Agreement Protocols-I
No ratings yet
Agreement Protocols-I
38 pages
DBMS Assignment
No ratings yet
DBMS Assignment
3 pages
DATA MANAGEMENT II Lecture I
No ratings yet
DATA MANAGEMENT II Lecture I
18 pages

Uploaded by

Uploaded by

Chapter 2: Data Science

 An Overview of Data Science

 What are data and information

 Data types and their representation

 Data value Chain

 Basic concepts of big data

Data is a raw facts which cannot be used for decision or judgments.

 Data science is:-

 defined as a multi-disciplinary field that uses scientific methods, processes, algorithms,

 It is much more than simply analyzing data.

 It offers a range of roles and requires a range of skills.

 Data scientists need to be curious and result-oriented, with exceptional

• What is Data and information ?

 Data can be defined as: - a representation of facts, figures, concepts, or

• which should be suitable for communication, interpretation, or

 It can be described as unprocessed facts and figures.

Processed or interpreted data on which decisions and actions are

Data processing is the re-structuring or re-ordering of data by people

Data processing consists the following basic steps -:-

II. Processing and

III. Output. Input Processing Output

►Output - at this stage, the result or outcome of the preceding processing

B. Semi-structured Data:- is a form of structured data that does not

A semi-structured data: XML Example.

 Unstructured data is not organized in rows and columns.

 Unstructured data is qualitative data that consists of audio, video files,

Figure 2.2 Data types from a data analytics perspective 11

 For example, in a set of photographs, it describe when and where the

Because, the infrastructure is required to:

 support flexible and dynamic data structures

 it contains different activities such as content creation, selection,

• it becomes difficult to process using on-hand database management tools or traditional

• it is characterized by 3V and more:

 Volume: large amounts of data (Zeta bytes)

 Velocity: Data is live streaming or in motion

 Variety: data comes in many different forms from diverse sources

various sources such as relational databases, systems, or local files.

Spark and MapReduce perform data processing.

 In this stage, the analyzed data can be accessed by users.

You might also like