0% found this document useful (0 votes)

46 views

Emerging Chapter 2

Emerging technology

Uploaded by

Hamza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Emerging Chapter 2

Emerging technology

Uploaded by

Hamza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

DATA SCIENCE

ADDIS ABABA UNIVERSITY

CHAPTER TWO

Kassahun T. (MSc.) 1
Data Science

Objectives:
q Describe what data science is and the role of data
scientists.
q Differentiate data and information.
q Describe data processing life cycle
q Describe data value chain in emerging era of big data.
q Understand the basics of Big Data.
q Describe the purpose of the Hadoop ecosystem
components
Data Science

An Overview of Data Science

§ Data science is a multi-disciplinary field that uses
scientific methods, processes, algorithms, and
systems to extract knowledge and insights from
structured, semi-structured and unstructured data.
§ Data scientists must possess a strong quantitative
background in statistics and linear algebra as well
as programming knowledge with focuses on data
warehousing, mining, and modeling to build and
analyze algorithms.
Data Science

Definition of data and information

§ Data: facts and statistics collected together for
reference or analysis.
§ Data is represented with the help of characters
such as alphabets (A-Z, a-z), digits (0-9) or special
characters (+,-,/,*,<,>,= etc.)
§ Informati on: Informat i on i s o rg a n i z e d o r
classified data, which has some meaningful values
for the receiver. Information is the processed data
on which decisions and actions are based.
Data Science

Cont ..
§ Data can be described as unprocessed facts and
figures cannot help in decision-making.
§ Information is interpreted data; created from
organized, structured, and processed data in a
particular context.
§ However, data is the raw material that is
organized, structured, and interpreted to create
useful information systems.
Data Science

Cont ..

For the decision to be meaningful, the processed

data must qualify for the following characteristics:
1. Timely − Information should be available when required.

2. Accuracy − Information should be accurate.

3. Completeness − Information should be complete.

Data Science

Data Processing Cycle

§ Data processing is the re-structuring or re-ordering

of data by people or machine to increase their
usefulness and add values for a particular purpose.
§ Data processing passes through three steps:
§ Input: collecting data from the external world. It may be
done in many forms.
§ Processing: the input data is changed to produce data in a
more useful form.
§ Output: the result of the proceeding processing step is
collected.
Data Science

Cont ..
Data Science

Data types and its representation

§ In computer programming a data type is an
attribute of data which tells the compiler or
interpreter how the programmer intends to use
the data.
§ Common data types include:
§ Integers, example 1, 2, 3,
§ Booleans, true or false answers
§ Characters, example a, b, c, d, A,
§ Floating-point numbers, example 0.35, 1.75
§ Alphanumeric strings, example ab12cd
Data Science

Cont ..
Data types from Data Analytics perspective:
Structured Data
§ Structured data is data that has been organized into a formatted
repository, typically a database, so that its elements can be made
addressable for more effective processing and analysis. E.g Excel
files or SQL databases
Semi-structured Data
§ It is a form of structured data that does not obey the formal
structure of data models associated with relational databases or
other forms of data tables, but contains tags or other markers to
separate semantic elements and fields within the data. E.g JSON
and XML.
Data Science

Cont ..
Unstructured Data

§ Unstructured data is information that either does not

have a pre-defined data model or is not organized in a
pre-defined manner.

§ Unstructured information is typically text-heavy, but

may contain data such as dates, numbers, and facts as
well. Common examples of unstructured data include
audio, video files or No-SQL databases.
Data Science

Metadata – Data about Data

Metadata is data that describes other data. Meta is a prefix

that in most information technology usages means "an
underlying definition or description."
Ø summarizes basic information about data, which can
make finding and working with particular instances
of data easier.
Ø metadata is frequently used by Big Data solutions for
initial analysis.
E.g In a set of photographs, for example, metadata could
describe when and where the photos were taken.
Data Science

Cont ..
Data Science

Data value Chain

The data value chain describes the process of data creation
and use from first identifying a need for data to its final
use and possible reuse.
Data Science

Cont ..
§ Data acquisition (DAQ): It is the process of gathering, filtering,
and cleaning data before it is put in a data warehouse. & It is one of
the major big data challenges in terms of infrastructure
requirements
§ Data analysis: is the process of evaluating data using analytical
and statistical tools to discover useful information and aid in
business decision making. & it involves exploring, transforming,
and modeling data with the helps of data mining, business
intelligence, & machine learning.
§ Data curation It is the active management of data over its life
cycle to ensure data quality requirements for its effective usage.
E.g content creation, selection, classification, validation, etc
Data Science

Cont ..
§ Data curators (scientific curators or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable,
accessible, reusable and fit their purpose.
§ Data storage is the recording (storing) of information (data) in a
storage medium. Relational Database Management
Systems(RDBMS): have been the main a solution to the storage
paradigm
§ However, the ACID (Atomicity, Consistency, Isolation, and
Durability) properties that guarantee database transactions.
Unsuitable for big data
§ Data usage It covers the data-driven business activities that need
access to data. & enhance competitiveness through the reduction of
costs.
Data Science

Basic concepts of big data

§ Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications.
§ "Big Data" refers to the evolution and use of
technologies that provide the right user at the right time
with the right information from a mass of data.
§ Big data is characterized by 3V and more:
Data Science

Cont..
Volume: large amounts of data Zeta bytes/Massive datasets
Velocity: Data is live streaming or in motion
Variety: data comes in many different forms from diverse sources
Veracity: can we trust the data? How accurate is it? etc.
Data Science

Clustered Computing
§ Because of the qualities of big data, individual
computers are often inadequate for handling the data at
most stages.
§ To better address the high storage and computational
needs of big data, computer clusters are a better fit.
§ Big data clustering software combines the resources of
many smaller machines, seeking to provide a number of
benefits:
Data Science

Cont ..
§ Resource Pooling: Combining the available storage space to
hold data is a clear benefit, but CPU and memory pooling are
also extremely important.
§ High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or
software failures from affecting access to data and
processing.
§ Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
§ Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
Data Science

Hadoop and its Ecosystem

§ Hadoop is an open-source framework intended to make
interaction with big data easier.
§ It is a framework that allows for the distributed
processing of large datasets across clusters.
§ The four key characteristics of Hadoop are:
§ Economical: Its systems are highly economical as ordinary
computers
§ Reliable: It is resistant to hardware failure.
§ Scalable: It is easily scalable both, horizontally and vertically
§ Flexible: It is flexible and you can store as much structured and
unstructured data as you need.
Data Science

Cont ..
It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Data Science

Cont ..
§ Hadoop has an ecosystem that has evolved from
its four core components: data management,
access, processing, and storage.
Data Science

Big Data Life Cycle with Hadoop

1. Ingesting data into the system:
The data is ingested or transferred to Hadoop from various
sources such as relational database systems, or local files
Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2. Processing the data in storage:
• In this stage, the data is stored and processed.
• The data is stored in the distributed file system.
Data Science

Cont ..

3. Computing and analyzing data:

• The data is analyzed by processing frameworks
such as Pig, Hive, and Impala.
4. Visualizing the results:
• Which is performed by tools such as Hue and
Cloudera Search. In this stage, the analyzed data
can be accessed by users.
Data Science

End Of Chapter Two

Oracle Cloud Infrastructure Foundations PDF
50% (2)
Oracle Cloud Infrastructure Foundations PDF
121 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
DAIRY FARM GROUP - Redesign of Business Systems and Processes - Case Analysis
No ratings yet
DAIRY FARM GROUP - Redesign of Business Systems and Processes - Case Analysis
5 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Data Science
No ratings yet
Data Science
35 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
ETCh2
No ratings yet
ETCh2
36 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Data Science Unit I
No ratings yet
Data Science Unit I
13 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
IDS_sem ans unit 1
No ratings yet
IDS_sem ans unit 1
10 pages
FDSUNIT 1
No ratings yet
FDSUNIT 1
27 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
mod 3
No ratings yet
mod 3
96 pages
Data Science
No ratings yet
Data Science
40 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
20IT501_BDA_Unit1
No ratings yet
20IT501_BDA_Unit1
18 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Emerging_CH2
No ratings yet
Emerging_CH2
41 pages
unit_1
No ratings yet
unit_1
9 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Data v2
No ratings yet
Data v2
25 pages
FDS CH1
No ratings yet
FDS CH1
4 pages
21css303t Datascience Unit 1 Notes (1)
No ratings yet
21css303t Datascience Unit 1 Notes (1)
246 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
IDS Complete Notes
No ratings yet
IDS Complete Notes
126 pages
Chapter 2 DS New
No ratings yet
Chapter 2 DS New
29 pages
1) Data-sci Chapter-1
No ratings yet
1) Data-sci Chapter-1
17 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
15 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
DS R Unit-1
No ratings yet
DS R Unit-1
41 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Emergency chapter two(2)
No ratings yet
Emergency chapter two(2)
41 pages
himadev
No ratings yet
himadev
37 pages
UNIT-1
No ratings yet
UNIT-1
25 pages
Emerging Tech Ch 2
No ratings yet
Emerging Tech Ch 2
52 pages
Big Data CH01
No ratings yet
Big Data CH01
12 pages
DS231_Week_2
No ratings yet
DS231_Week_2
33 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
HUAWEI OceanStor 9000 Big Data Storage System Brochure PDF
No ratings yet
HUAWEI OceanStor 9000 Big Data Storage System Brochure PDF
2 pages
Ch4 OpenStack New (Autosaved)
No ratings yet
Ch4 OpenStack New (Autosaved)
242 pages
Distributed Computing Course File
No ratings yet
Distributed Computing Course File
10 pages
HPC Unit 456
No ratings yet
HPC Unit 456
25 pages
Altair Optistruct 130 Performance Benchmark and 4 Optistruct by Altair
No ratings yet
Altair Optistruct 130 Performance Benchmark and 4 Optistruct by Altair
19 pages
Cloud Computing 6th Sem
No ratings yet
Cloud Computing 6th Sem
14 pages
Analytical Modeling of Parallel Programs S. Oliveira
No ratings yet
Analytical Modeling of Parallel Programs S. Oliveira
13 pages
Notes CC-Unit 1
No ratings yet
Notes CC-Unit 1
4 pages
Leveraging AI With Databricks and Azure Data Lake Storage
No ratings yet
Leveraging AI With Databricks and Azure Data Lake Storage
3 pages
AWS Elemental 2017
No ratings yet
AWS Elemental 2017
29 pages
Mongodblabmanual1 240305075254 f531f8f5
No ratings yet
Mongodblabmanual1 240305075254 f531f8f5
73 pages
Building High Availability Clusters With SUSE Linux Enterprise High Availability Extension 11
No ratings yet
Building High Availability Clusters With SUSE Linux Enterprise High Availability Extension 11
32 pages
Get Akka in Action, Second Edition (MEAP V10) Francisco Lopez-Sancho Abraham PDF ebook with Full Chapters Now
No ratings yet
Get Akka in Action, Second Edition (MEAP V10) Francisco Lopez-Sancho Abraham PDF ebook with Full Chapters Now
40 pages
CC
No ratings yet
CC
30 pages
50 AZURE Cloud Issues Troubleshooting Steps - Part1
No ratings yet
50 AZURE Cloud Issues Troubleshooting Steps - Part1
55 pages
Cloud+Native+Geospatial+Analytics+with+Apache+Sedona
No ratings yet
Cloud+Native+Geospatial+Analytics+with+Apache+Sedona
39 pages
Cloud Architectures and Security Notes(Zoya)
No ratings yet
Cloud Architectures and Security Notes(Zoya)
45 pages
Solution For Databases For Oracle Rac With VSP 5500 Advanced Server ds220 Using Intel Cascade Lake Cpus
No ratings yet
Solution For Databases For Oracle Rac With VSP 5500 Advanced Server ds220 Using Intel Cascade Lake Cpus
30 pages
Azure Sap Cluster
No ratings yet
Azure Sap Cluster
2,442 pages
System Models For Distributed and Cloud Computing
No ratings yet
System Models For Distributed and Cloud Computing
22 pages
Parallel and Distributed Computing Architectures A PDF
No ratings yet
Parallel and Distributed Computing Architectures A PDF
286 pages
Tilottama Resume
No ratings yet
Tilottama Resume
1 page
Mysql WP Enterprise Guide
No ratings yet
Mysql WP Enterprise Guide
27 pages
Distributed System: Topic: Basics of Distributed System Presented By: Dr. Akash Sanghi
No ratings yet
Distributed System: Topic: Basics of Distributed System Presented By: Dr. Akash Sanghi
20 pages
Top 50 FastAPI Job Interview Questions and Answer
No ratings yet
Top 50 FastAPI Job Interview Questions and Answer
10 pages
Azure Application Architecture Guide
100% (1)
Azure Application Architecture Guide
1,420 pages
Living Labs Roadmap 2007-2010
100% (2)
Living Labs Roadmap 2007-2010
61 pages
The ABCs of Big Data
No ratings yet
The ABCs of Big Data
6 pages

Uploaded by

Uploaded by

DATA SCIENCE

ADDIS ABABA UNIVERSITY

An Overview of Data Science

Definition of data and information

For the decision to be meaningful, the processed

2. Accuracy − Information should be accurate.

3. Completeness − Information should be complete.

Data Processing Cycle

§ Data processing is the re-structuring or re-ordering

Data types and its representation

§ Unstructured data is information that either does not

§ Unstructured information is typically text-heavy, but

Metadata – Data about Data

Metadata is data that describes other data. Meta is a prefix

Data value Chain

Basic concepts of big data

Hadoop and its Ecosystem

Big Data Life Cycle with Hadoop

3. Computing and analyzing data:

End Of Chapter Two

You might also like