0% found this document useful (0 votes)
11 views

Data Science

The document discusses the distinction between data and information, emphasizing that processed data provides meaningful insights for decision-making. It covers topics such as data recovery, data loss, data collection methods, types of data, big data, and the importance of data visualization. Additionally, it highlights ethical guidelines and governance frameworks necessary for responsible data management.

Uploaded by

31samidhanarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Data Science

The document discusses the distinction between data and information, emphasizing that processed data provides meaningful insights for decision-making. It covers topics such as data recovery, data loss, data collection methods, types of data, big data, and the importance of data visualization. Additionally, it highlights ethical guidelines and governance frameworks necessary for responsible data management.

Uploaded by

31samidhanarkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Introduction:

DATA V/S INFORMATION:

Data can be a number, symbol, or text which may or may not mean anything to individuals on
its own.

When the data is processed and put in context, it bears a meaning. This data can be used for
decision-making, calculations, and discussion. Data then becomes information.

For example, if you are given a list of temperature readings, it would not make sense. But when
the list is well arranged and organized, it shows that the global temperature is rising. This list
now becomes information from data.

DATA RECOVERY

Data can be lost, corrupted, damaged, or deleted due to multiple reasons like system crash,
disk failure, transaction failure. The process of restoring this inaccessible, corrupted, deleted,
or damaged data is called data recovery.

DATA LOSS

The intentional or unintentional data destruction of information by people or processes is


called data loss.

Causes of Data Loss:

Hardware Failure
Software Issues
Natural Disaster
Viruses or Malware
Human Error
TYPES OF DATA LOSS:-

System Failure-
Hardware for crash failure
Software crash
Power Failure.
Natural disaster:-
Fire.
Natural disaster
Crime-
Theft, hacking, etc.
computer virus, ransomware.
Unintentional action-
Accidental deletion of files.
Loss of pendrive or laptops
Intentional action -
Deletion of files or program.
Arranging and Collecting Data-

DATA COLLECTION-

The method of gathering data for calculating and analyzing reliable insights is called data
collection, which done using standardized validated techniques. A research or scientists works
based on the collected data. Data collection is the primary and essential step in most cases.
The approach for data is highly different in different fields.

VARIABLES:-

A variable is an attribute of an object that may vary for different cases. A variable can be a
numbers, characteristics or quantity that can be measured. A variable can have different values
in different cases. It is of two types:-

Numerical Variable-

It is a variable that has values in numbers for e.g. heights, weights, ages etc. It is a quantifiable
characteristic

Categorical Variable-

It is a variable that has values in words. for e.g. name, origin/ country of birth, etc. It is not a
quantifiable characteristic

TYPES OF DATA :-
Quantitative Data-

Quantitative data are numbers or values which can be measured. For e.g:-

Height, weight and age of a student


No. of times an item is sold in a month.
No. of items sold in a month.
Since this data can be quantified it is easier to analyze.

Qualitative Data: On the other hand qualitative data is subjective. For eg. * Traveller's
Feedback on for a hotel Feedback for customer service. Opinion on something.

This data helps us understand experience experiences in depth.

SOURCES OF DATA-

Primary Data Source:-

Physical interviews
Online Surveys
Feed back forms.
Marketing Campaign.

SECONDARY DATA SOURCES:-

Satellite data
IOT sensor data
transitional databases.
Social Media
Web Traffic.

BIG DATA-

When the data exceeds the capacities of traditional databases and a specialised system is
required to mang manage Alu data, then It is called Big Data.

Characteristics of Big Data are-


Volume refers to the size of the data. Determines whether the data can be classified as big
data or not.
Variety: Data sets are collected from a wide range of sources including traditional
databases, sensor data, etc. Includes images, pictures, audio, video, etc. Essential
characteristic.
Velocity: Refers to the rate at which data is generated. Generally is created at rapid speed
resulting in high volumes very soon. Social media generates massive amounts of data every
minute.

RETAIL:
Retail chains are spread across the world. They handle millions of customers every second
minute. They store and analyze customer data and transactions using big data systems.

SCIENCE
:

On the Discover Supercomputing clusters. The NASA center of Climate Simulations (NCCS)
generates 32 petabytes of data on climate simulations and observations.

SOCIAL MEDIA:
Popular social media platforms store and analyze petabytes of data.
EXPERIMENT:
everyday. They use bit Big data techniques for storage, and
analysis

HEALTHCARE
During Covid-19, many governments used Big data to locate
the infected people. Big Data was also used for case
identification and medical treatment.

ALGORITHMS TO INTERPRET DATA


Binary Classification - Is this A or B?
Regression Algorithm - How much or how many - (Frauds) Recommendation Protection
Anomaly Detection - Is this Odd?
Clustering Algorithm - Can I group the data?
Replacement Algorithm - What should I do now? - (Robots)

UNIVARIATE DATA

has single variable


eg: height of a student

Multi Multi MULTIVARIATE DATA


has relationship with multiple variables.
eg - sales of umbrella are dependent on rainfall.
DATA VISUALISATIONS
The mechanism of representing raw data in the form of graphical representations is such that
allows users to explore data and uncover quick insights is called data visualization.

IMPORTANCE
It makes complex data simple and enables human mind to understand its significance.
It helps us recognize the trends, patterns and outliers from seemingly meaningless data
records of data.
Data visualization techniques use visual data in a universal, fast and powerful way of
communication to communicate information

REAL LIFE EXAMPLES


Monitoring student progress with scorecards.
Identifying usage trends of a website.
Monitoring goals and results of a sales executive.
Visualizing spread and impact of pandemics.

DOT PLOT

A dot plot is a graphical representation of data using dots. Dots are used in a dot plot to
illustrate the quantitative values associated with qualitative values - categorical values.

BAR GRAPH

A bar graph is a graphical representation of data using bars of different heights.

The bars can be either vertical or horizontal.


• vertical bar graph is called column graph or chart.

• In a bar graph, bars are presented to show the elements so that they do not touch each other.

MINIMUM:

The smallest number in a dataset is called the minimum. There cannot be two minimum values
in a data set = Max(Range)

MAXIMUM:

Maximum is the largest number in a dataset. There cannot be two maximum values in a data
set = Min(Range)

FREQUENCY:

The number of times a data value repeats (occurs in a data set is called the frequency of the
data value.

=COUNTIFS(Range, criteria"")

HISTOGRAM

Graphical representation of data illustration of frequency against time intervals.

In other words, Histogram displays data points which fall under a set of values called bins to
provide visual representation of the numeric data.

SHAPES OF HISTOGRAM:
Normal
Bimodal
Right-Skewed
Left-Skewed
Random
SINGLE VARIABLE
NORMAL DISTRIBUTION-

Normal distribution is a common bell shaped curve pattern. In normal distribution the data
points are equal distributed on either side of the average. Statistical calculations must be done
to prove normal distribution. It is also known as Symmetrical or bell-shaped distribution.

DIFFERENCE-

Normal distribution has one peak which represents average.

Bimodal distribution has two peaks which show that the data is collected from two different
systems.

BIMODAL (distribution)

has two peaks - combination of two normal histograms.

RANDOM (distribution)

lacks apparent pattern and has several peaks-

RANGE

The difference between maximum and minimum values is called range.

FREQUENCY TABLE:-

A frequency table is tabular representation that summarizes raw categorical data.


DIFFERENCE
Right Skewed -distribution skewed to It is also called positively skewed distribution
In this distribution, all the collected data has value more than 0.
In right skewed, many data points occur on the left than on the right
Left Skewed. 24 is also called negatively skewed distribution.
In this distribution all the collected data has values less than 0
In left skewend, many data points occur on the right with fewer on the left
Ethics in Data Science
ETHICAL GUIDELINES:

Data governance is critical


Protect your cost cor customer
Do not lie
Understand the link of data quality.
Private identity and information should remain private.
Share private information should be treated confidentially.

NEED FOR ETHICAL GUIDELINES:

To collect minimal data.


To identify and search sensitive data.
To have a backup plan incase the insights backfire.

GOAL FOR ETHICAL GUIDELINES:

Protect the To secure customer's private information


To distinguish between legal and ethical policies.
To consolidate data collection methods.
To follow well instructed ang and approved rules
Integrity of data and methods -
ensuring accuracy, consistency and reliability of both the data
and the process used to collect, store and analyze data.
To implement compliance requirements-
means to actively put into practice the rules, regulations and
standards set by government.
To establish internal rules for data use - means to create a set of guidelines within an
organization that defines how system employees access the system to collect, store and
process proper data.

KEY GOALS OF ETHICAL GUIDELINES


Professional integrity and accountability.
Integrity of data and methods
Follow informed concerned rules.
Respect confidentiality and privacy.

DATA GOVERNANCE FRAMEWORK


Data governance framework provides a comprehensive approach in managing, storing,
securing and collecting data.
Data governance means cleaner, leaner and better data which means better analytics, better
decisions and better results.

GOALS OF DATA GOVERNANCE


To improve external and internal communication
To reduce cost
To minimize risks
To increase the value of data
To increase revenue.
To implement compliance requirements
To establish internal rules for data.

You might also like