0% found this document useful (0 votes)
46 views

Emerging Chapter 2

Emerging technology

Uploaded by

Hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Emerging Chapter 2

Emerging technology

Uploaded by

Hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

DATA SCIENCE

ADDIS ABABA UNIVERSITY

CHAPTER TWO

Kassahun T. (MSc.) 1
Data Science

Objectives:
q Describe what data science is and the role of data
scientists.
q Differentiate data and information.
q Describe data processing life cycle
q Describe data value chain in emerging era of big data.
q Understand the basics of Big Data.
q Describe the purpose of the Hadoop ecosystem
components
Data Science

An Overview of Data Science


§ Data science is a multi-disciplinary field that uses
scientific methods, processes, algorithms, and
systems to extract knowledge and insights from
structured, semi-structured and unstructured data.
§ Data scientists must possess a strong quantitative
background in statistics and linear algebra as well
as programming knowledge with focuses on data
warehousing, mining, and modeling to build and
analyze algorithms.
Data Science

Definition of data and information


§ Data: facts and statistics collected together for
reference or analysis.
§ Data is represented with the help of characters
such as alphabets (A-Z, a-z), digits (0-9) or special
characters (+,-,/,*,<,>,= etc.)
§ Informati on: Informat i on i s o rg a n i z e d o r
classified data, which has some meaningful values
for the receiver. Information is the processed data
on which decisions and actions are based.
Data Science

Cont ..
§ Data can be described as unprocessed facts and
figures cannot help in decision-making.
§ Information is interpreted data; created from
organized, structured, and processed data in a
particular context.
§ However, data is the raw material that is
organized, structured, and interpreted to create
useful information systems.
Data Science

Cont ..

For the decision to be meaningful, the processed


data must qualify for the following characteristics:
1. Timely − Information should be available when required.

2. Accuracy − Information should be accurate.

3. Completeness − Information should be complete.


Data Science

Data Processing Cycle

§ Data processing is the re-structuring or re-ordering


of data by people or machine to increase their
usefulness and add values for a particular purpose.
§ Data processing passes through three steps:
§ Input: collecting data from the external world. It may be
done in many forms.
§ Processing: the input data is changed to produce data in a
more useful form.
§ Output: the result of the proceeding processing step is
collected.
Data Science

Cont ..
Data Science

Data types and its representation


§ In computer programming a data type is an
attribute of data which tells the compiler or
interpreter how the programmer intends to use
the data.
§ Common data types include:
§ Integers, example 1, 2, 3,
§ Booleans, true or false answers
§ Characters, example a, b, c, d, A,
§ Floating-point numbers, example 0.35, 1.75
§ Alphanumeric strings, example ab12cd
Data Science

Cont ..
Data types from Data Analytics perspective:
Structured Data
§ Structured data is data that has been organized into a formatted
repository, typically a database, so that its elements can be made
addressable for more effective processing and analysis. E.g Excel
files or SQL databases
Semi-structured Data
§ It is a form of structured data that does not obey the formal
structure of data models associated with relational databases or
other forms of data tables, but contains tags or other markers to
separate semantic elements and fields within the data. E.g JSON
and XML.
Data Science

Cont ..
Unstructured Data

§ Unstructured data is information that either does not


have a pre-defined data model or is not organized in a
pre-defined manner.

§ Unstructured information is typically text-heavy, but


may contain data such as dates, numbers, and facts as
well. Common examples of unstructured data include
audio, video files or No-SQL databases.
Data Science

Metadata – Data about Data

Metadata is data that describes other data. Meta is a prefix


that in most information technology usages means "an
underlying definition or description."
Ø summarizes basic information about data, which can
make finding and working with particular instances
of data easier.
Ø metadata is frequently used by Big Data solutions for
initial analysis.
E.g In a set of photographs, for example, metadata could
describe when and where the photos were taken.
Data Science

Cont ..
Data Science

Data value Chain


The data value chain describes the process of data creation
and use from first identifying a need for data to its final
use and possible reuse.
Data Science

Cont ..
§ Data acquisition (DAQ): It is the process of gathering, filtering,
and cleaning data before it is put in a data warehouse. & It is one of
the major big data challenges in terms of infrastructure
requirements
§ Data analysis: is the process of evaluating data using analytical
and statistical tools to discover useful information and aid in
business decision making. & it involves exploring, transforming,
and modeling data with the helps of data mining, business
intelligence, & machine learning.
§ Data curation It is the active management of data over its life
cycle to ensure data quality requirements for its effective usage.
E.g content creation, selection, classification, validation, etc
Data Science

Cont ..
§ Data curators (scientific curators or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable,
accessible, reusable and fit their purpose.
§ Data storage is the recording (storing) of information (data) in a
storage medium. Relational Database Management
Systems(RDBMS): have been the main a solution to the storage
paradigm
§ However, the ACID (Atomicity, Consistency, Isolation, and
Durability) properties that guarantee database transactions.
Unsuitable for big data
§ Data usage It covers the data-driven business activities that need
access to data. & enhance competitiveness through the reduction of
costs.
Data Science

Basic concepts of big data


§ Big data is the term for a collection of data sets so large
and complex that it becomes difficult to process using
on-hand database management tools or traditional data
processing applications.
§ "Big Data" refers to the evolution and use of
technologies that provide the right user at the right time
with the right information from a mass of data.
§ Big data is characterized by 3V and more:
Data Science

Cont..
Volume: large amounts of data Zeta bytes/Massive datasets
Velocity: Data is live streaming or in motion
Variety: data comes in many different forms from diverse sources
Veracity: can we trust the data? How accurate is it? etc.
Data Science

Clustered Computing
§ Because of the qualities of big data, individual
computers are often inadequate for handling the data at
most stages.
§ To better address the high storage and computational
needs of big data, computer clusters are a better fit.
§ Big data clustering software combines the resources of
many smaller machines, seeking to provide a number of
benefits:
Data Science

Cont ..
§ Resource Pooling: Combining the available storage space to
hold data is a clear benefit, but CPU and memory pooling are
also extremely important.
§ High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or
software failures from affecting access to data and
processing.
§ Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
§ Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
Data Science

Hadoop and its Ecosystem


§ Hadoop is an open-source framework intended to make
interaction with big data easier.
§ It is a framework that allows for the distributed
processing of large datasets across clusters.
§ The four key characteristics of Hadoop are:
§ Economical: Its systems are highly economical as ordinary
computers
§ Reliable: It is resistant to hardware failure.
§ Scalable: It is easily scalable both, horizontally and vertically
§ Flexible: It is flexible and you can store as much structured and
unstructured data as you need.
Data Science

Cont ..
It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Data Science

Cont ..
§ Hadoop has an ecosystem that has evolved from
its four core components: data management,
access, processing, and storage.
Data Science

Big Data Life Cycle with Hadoop


1. Ingesting data into the system:
The data is ingested or transferred to Hadoop from various
sources such as relational database systems, or local files
Sqoop transfers data from RDBMS to HDFS, whereas Flume
transfers event data.
2. Processing the data in storage:
• In this stage, the data is stored and processed.
• The data is stored in the distributed file system.
Data Science

Cont ..

3. Computing and analyzing data:


• The data is analyzed by processing frameworks
such as Pig, Hive, and Impala.
4. Visualizing the results:
• Which is performed by tools such as Hue and
Cloudera Search. In this stage, the analyzed data
can be accessed by users.
Data Science

End Of Chapter Two

You might also like