0% found this document useful (0 votes)
90 views49 pages

2 Data-Science PDF

The document provides an overview of a data science lesson that covers topics like the definition of data science, the data value chain, and basic concepts of big data. It defines data science as a multi-disciplinary field that uses scientific methods and processes to extract knowledge and insights from data. The key steps of the data value chain are described as data collection, data preparation, data analysis, and data visualization. The data value chain stages involve data acquisition, analysis, curation, storage, and usage.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
90 views49 pages

2 Data-Science PDF

The document provides an overview of a data science lesson that covers topics like the definition of data science, the data value chain, and basic concepts of big data. It defines data science as a multi-disciplinary field that uses scientific methods and processes to extract knowledge and insights from data. The key steps of the data value chain are described as data collection, data preparation, data analysis, and data visualization. The data value chain stages involve data acquisition, analysis, curation, storage, and usage.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

DATA SCIENCE

Lesson 2
Instructor: Ellen M. Guiñares
TOPICS COVERED

01 Overview for Data Science


Definition of Data and Information
Data types and representation

02 Data Value Chain


Data Acquisition
Data Analysis
Data Curating
Data Storage
Data Usage

03 Basic concepts of Big Data


What is Data
Science?
Data Science

It is a multi-disciplinary field that uses scientific


methods, processes, algorithms and systems to
extract knowledge and insights and knowledge
from data to drive decision-making and solve
complex problems.
KEY STEPS OF DATA SCIENCE
Data Collection
Gather relevant data

Data Preparation
Suitable format for
analysis Data Visualization
Implement the findings.

Data Analysis
Identify patterns,
relationships and
insights Data Visualization
Communicating the
findings
KEY STEPS OF DATA SCIENCE
WHAT IS EXPECTED OF A DATA SCIENTIST?

• Data scientists must master the full spectrum of


the data science life cycle and possess a level of
flexibility and understanding to maximize returns
at each phase of the process.
• Data scientists need to be curious and result-
oriented.
• Data science need a strong quantitative
background in statistics and linear algebra as
well as programming knowledge.
DATA SCIENCE LIFE CYCLE
Data and Information

Data Information
a representation of facts, Information is organized or
concepts, or instructions. classified data, which has
some meaningful values for
the receiver.
Data

● Data can be defined as a representation of facts, concepts,


or instructions in a formalized manner, which should be
suitable for communication, interpretation, or processing
by human or electronic machine.
● Data is represented with the help of characters such as
alphabets (A-Z, a-z), digits (0-9) or special characters (+,-
,/,*,<,>,= etc.)
Information

● Information is organized or classified data, which has some


meaningful values for the receiver. Information is the
processed data on which decisions and actions are based.

● Information is a data that has been processed into a form


that is meaningful to recipient and is of real or perceived
value in the current or the prospective action or decision of
recipient.
Information

● For the decision to be meaningful, the processed data must


qualify for the following characteristics:

✓ Timely − Information should be available when required.


✓ Accuracy − Information should be accurate.
✓ Completeness − Information should be complete.
Data Processing Cycle

● Data processing is the re-structuring or re-ordering of data


by people or machine to increase their usefulness and add
values for a particular purpose.

● Data processing consists of the following basic steps - input,


processing, and output. These three steps constitute the
data processing cycle.
Data Processing Cycle

● Data processing is the re-structuring or re-ordering of data


by people or machine to increase their usefulness and add
values for a particular purpose.

● Data processing consists of the following basic steps - input,


processing, and output. These three steps constitute the
data processing cycle.
Data Processing Cycle

Input Step Processing Step Output Step


the input data is the input data is the result of the
prepared in some changed to produce proceeding processing
convenient form for data in a more useful step is collected.
processing. form.
The form depends on
the processing
machine.
Data Types and its representation

● Data type or simply type is an attribute of data which tells


the compiler or interpreter how the programmer intends to
use the data.
● Almost all programming languages explicitly include the
notion of data type. Common data types include:
✓ Integers
✓ Booleans
✓ Characters
✓ floating-point numbers
✓ alphanumeric strings
01
Data types / structure
Based on analysis of data
Data Types / structure

● Based on analysis of data:


✓ Structured
✓ Unstructured
✓ Semi-structured
✓ Metadata
● is data that adheres to a pre-defined data
Data Types / structure model and is therefore straightforward to
analyze.

● conforms to a tabular format with


relationship between the different rows and
columns. Common examples are Excel files
or SQL databases.
What is a structured
data? ● Structured data is considered the most
‘traditional’ form of data storage, since the
earliest versions of database management
systems (DBMS) were able to store, process
and access structured data.
● Unstructured data is information that either
Data Types / structure does not have a predefined data model or is
not organized in a pre-defined manner.

● It is without proper formatting and


alignment

What is ● Unstructured information is typically text-


heavy, but may contain data such as dates,
unstructured data? numbers, and facts as well.

● The ability to extract value from


unstructured data is one of main drivers
behind the quick growth of Big Data.
Data Types / structure
● Semi-structured data is a form of structured
data that does not conform with the formal
structure of data models associated with
relational databases or other forms of data
tables.
What is a semi-
structured data? ● Fore example: JSON and XML are forms of
semi-structured data.

● The reason that this third category exists


(between structured and unstructured data)
is because semi-structured data is
considerably easier to analyze than
unstructured data.
Data Types / structure
● A last category of data type is metadata.
From a technical point of view, this is not a
separate data structure, but it is one of the
most important elements for Big Data
analysis and big data solutions.
What is Metadata?
● Metadata is data about data.

● It provides additional information about a


specific set of data.
02
Data Value Chain
Information flow within a big data systems
Data Value Chain

● refers to the entire process of transforming raw data into


valuable insights, information, and knowledge that can be
used for decision-making, innovation, and business growth.

● It involves a series of interconnected activities that add


value to data at each stage of the process, from data
collection and processing to analysis and dissemination.
Data Value Chain

● The Data Value Chain is introduced to describe the


information flow within a big data system as a series of
steps needed to generate value and useful insights from
data.

● The Big Data Value Chain identifies the following key high-
level activities:
Data Value Chain Stages

● The Data Value Chain typically includes the following stages:


✓ Data Acquisition
✓ Data Analysis
✓ Data Curation
✓ Data Storage
✓ Data Usage
Data Value Chain Stages

1. Data Acquisition/Collection

✓ It is the process of gathering, filtering, and cleaning data


before it is put in a data warehouse or any other storage
solution on which data analysis can be carried out.
✓ Data acquisition is one of the major big data challenges in
terms of infrastructure requirements.
Data Value Chain Stages

2. Data Analysis

✓ It is concerned with making the raw data acquired amenable


to use in decision-making as well as domain-specific usage.
✓ Data analysis involves exploring, transforming, and modelling
data with the goal of highlighting relevant data, synthesizing
and extracting useful hidden information with high potential
from a business point of view.
Data Value Chain Stages

3. Data Curation

✓ It is the active management of data over its life cycle to


ensure it meets the necessary data quality requirements for
its effective usage.
✓ Data curation processes can be categorized into different
activities such as content creation, selection, classification,
transformation, validation, and preservation.
Data Value Chain Stages

4. Data Storage

✓ It is the persistence and management of data in a scalable


way that satisfies the needs of applications that require fast
access to the data.
✓ Relational Database Management Systems (RDBMS) have
been the main, and almost unique, solution to the storage
paradigm for nearly 40 years.
Data Value Chain Stages

5. Data Usage

✓ It covers the data-driven business activities that need access


to data, its analysis, and the tools needed to integrate the
data analysis within the business activity.
✓ Data usage in business decision-making can enhance
competitiveness through reduction of costs, increased added
value, or any other parameter that can be measured against
existing performance criteria
03
Basic Concepts of Big Data
Information flow within a big data systems
Basic Concepts of Big Data

• Big data is a blanket term for the non-traditional strategies


and technologies needed to gather, organize, process, and
gather insights from large datasets.
Basic Concepts of Big Data

• An exact definition of “big data” is difficult to nail down


because projects, vendors, practitioners, and business
professionals use it quite differently. With that in mind,
generally speaking, big data is:
✓ large datasets
✓ the category of computing strategies and technologies
that are used to handle large datasets
Basic Concepts of Big Data

• refers to the vast and diverse sets of data that are generated
at a high velocity, volume, and variety from various sources.
The data may be structured, semi-structured, or unstructured
and cannot be easily processed or analyzed using traditional
data processing techniques.
Key Components of Big Data

1. Volume - Big Data refers to data that is too large to be


processed using traditional data processing tools and
techniques. The volume of data can range from terabytes to
petabytes and beyond.

2. Velocity: Big Data is generated at an unprecedented speed


and needs to be processed in real-time or near real-time to
derive meaningful insights. This velocity can be measured in
microseconds to seconds or minutes.
Big Data Characteristics
Other Characteristics of Big Data – 6V’s

1. Veracity: The variety of sources and the complexity of the


processing can lead to challenges in evaluating the quality of
the data (and consequently, the quality of the resulting
analysis).

2. Variability: Variation in the data leads to wide variation in


quality. Additional resources may be needed to identify, process,
or filter low quality data to make it more useful.
Other Characteristics of Big Data – 6V’s

3. Value: The ultimate challenge of big data is delivering value.


Sometimes, the systems and processes in place are complex
enough that using the data and extracting actual value can
become difficult.
Where does big data
come from?
Sources of Big Data

1. Social Media: Social media platforms such as Facebook,


Twitter, LinkedIn, and Instagram generate vast amounts of
data in the form of user interactions, posts, comments, likes,
and shares.

2. Internet of Things (IoT) Devices: IoT devices such as sensors,


smart appliances, and wearable technology generate huge
volumes of data in real-time.
Sources of Big Data

3. E-commerce Transactions: E-commerce platforms generate a


significant amount of data related to customer behavior,
purchase history, preferences, and trends.

4. Machine-generated Data: Machines and applications


generate a massive amount of data, including log files,
clickstream data, system-generated data, and more.
Sources of Big Data

5. Mobile Devices: Mobile devices generate a significant amount


of data, including location data, usage data, and user behavior
data.

6. Customer Feedback: Customer feedback in the form of


surveys, reviews, and support tickets generate large volumes of
data that can be analyzed to improve customer experience.
Sources of Big Data

7. Business Applications: Business applications such as CRM,


ERP, and HRM generate a vast amount of data that can be
analyzed to improve business operations.

8. Public Data Sources: Public data sources such as government


data, weather data, and census data can be combined with other
data sources to create more significant insights.

You might also like