Emerging Chapter 2
Emerging Chapter 2
CHAPTER TWO
Kassahun T. (MSc.) 1
Data Science
Objectives:
q Describe what data science is and the role of data
scientists.
q Differentiate data and information.
q Describe data processing life cycle
q Describe data value chain in emerging era of big data.
q Understand the basics of Big Data.
q Describe the purpose of the Hadoop ecosystem
components
Data Science
Cont ..
§ Data can be described as unprocessed facts and
figures cannot help in decision-making.
§ Information is interpreted data; created from
organized, structured, and processed data in a
particular context.
§ However, data is the raw material that is
organized, structured, and interpreted to create
useful information systems.
Data Science
Cont ..
Cont ..
Data Science
Cont ..
Data types from Data Analytics perspective:
Structured Data
§ Structured data is data that has been organized into a formatted
repository, typically a database, so that its elements can be made
addressable for more effective processing and analysis. E.g Excel
files or SQL databases
Semi-structured Data
§ It is a form of structured data that does not obey the formal
structure of data models associated with relational databases or
other forms of data tables, but contains tags or other markers to
separate semantic elements and fields within the data. E.g JSON
and XML.
Data Science
Cont ..
Unstructured Data
Cont ..
Data Science
Cont ..
§ Data acquisition (DAQ): It is the process of gathering, filtering,
and cleaning data before it is put in a data warehouse. & It is one of
the major big data challenges in terms of infrastructure
requirements
§ Data analysis: is the process of evaluating data using analytical
and statistical tools to discover useful information and aid in
business decision making. & it involves exploring, transforming,
and modeling data with the helps of data mining, business
intelligence, & machine learning.
§ Data curation It is the active management of data over its life
cycle to ensure data quality requirements for its effective usage.
E.g content creation, selection, classification, validation, etc
Data Science
Cont ..
§ Data curators (scientific curators or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable,
accessible, reusable and fit their purpose.
§ Data storage is the recording (storing) of information (data) in a
storage medium. Relational Database Management
Systems(RDBMS): have been the main a solution to the storage
paradigm
§ However, the ACID (Atomicity, Consistency, Isolation, and
Durability) properties that guarantee database transactions.
Unsuitable for big data
§ Data usage It covers the data-driven business activities that need
access to data. & enhance competitiveness through the reduction of
costs.
Data Science
Cont..
Volume: large amounts of data Zeta bytes/Massive datasets
Velocity: Data is live streaming or in motion
Variety: data comes in many different forms from diverse sources
Veracity: can we trust the data? How accurate is it? etc.
Data Science
Clustered Computing
§ Because of the qualities of big data, individual
computers are often inadequate for handling the data at
most stages.
§ To better address the high storage and computational
needs of big data, computer clusters are a better fit.
§ Big data clustering software combines the resources of
many smaller machines, seeking to provide a number of
benefits:
Data Science
Cont ..
§ Resource Pooling: Combining the available storage space to
hold data is a clear benefit, but CPU and memory pooling are
also extremely important.
§ High Availability: Clusters can provide varying levels of fault
tolerance and availability guarantees to prevent hardware or
software failures from affecting access to data and
processing.
§ Easy Scalability: Clusters make it easy to scale horizontally by
adding additional machines to the group.
§ Cluster membership and resource allocation can be handled by
software like Hadoop’s YARN (which stands for Yet
Another Resource Negotiator).
Data Science
Cont ..
It comprises the following components and many others:
• HDFS: Hadoop Distributed File System
• YARN: Yet Another Resource Negotiator
• MapReduce: Programming based Data Processing
• Spark: In-Memory data processing
• PIG, HIVE: Query-based processing of data services
• HBase: NoSQL Database
• Mahout, Spark MLLib: Machine Learning algorithm libraries
• Solar, Lucene: Searching and Indexing
• Zookeeper: Managing cluster
• Oozie: Job Scheduling
Data Science
Cont ..
§ Hadoop has an ecosystem that has evolved from
its four core components: data management,
access, processing, and storage.
Data Science
Cont ..