0% found this document useful (0 votes)
7 views19 pages

9-1 Data analysis and pre-processing part 1.pdf

Chapter 2 discusses various aspects of data, including types of data sets, characteristics of data, and statistical descriptions. It covers data objects, feature types, and methods for measuring central tendency and dispersion. The chapter emphasizes the importance of understanding data through visualization and similarity measurements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

9-1 Data analysis and pre-processing part 1.pdf

Chapter 2 discusses various aspects of data, including types of data sets, characteristics of data, and statistical descriptions. It covers data objects, feature types, and methods for measuring central tendency and dispersion. The chapter emphasizes the importance of understanding data through visualization and similarity measurements.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Chapter 2: Getting to Know Your Data

Dong-Kyu Chae

PI of the Data Intelligence Lab @HYU


Department of Computer Science & Data Science
Hanyang University
Contents
❑ Data Objects and Feature Types

❑ Basic Statistical Descriptions of Data

❑ Data Visualization

❑ Measuring Data Similarity and Dissimilarity

❑ Summary
Types of Data Sets
❑ Tabular

timeout

season
coach

game
score
team
Data matrix / table

ball

lost
pla

wi
n
y
▪ E.g.> a set of term-frequency vectors
❑ Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
❑ Graph and network
Social networks
Document 2 0 7 0 2 1 0 0 3 0 0

❑ World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0

Molecular structures

Data matrix
❑ Time-series (ordered)
❑ Video data: sequence of images TID Items
❑ Temporal data: time-series of trajectories
1 Bread, Coke, Milk
❑ Sequential data: transaction sequences
2 Beer, Bread
❑ Genetic sequence data
3 Beer, Coke, Diaper, Milk
❑ Spatial, image, and multimodal: 4 Beer, Bread, Diaper, Milk
❑ Spatial data: maps 5 Coke, Diaper, Milk
❑ Image data
❑ Multimodal data (video + image + text + ….) Transaction data
Characteristics of Data
❑ Dimensionality
❑ # of features
❑ Curse of dimensionality

❑ Sparsity
❑ Only a small portion of presence

❑ Resolution
❑ Patterns depend on the scale

❑ Distribution

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
❑ Centrality and dispersion

Document 1 3 0 5 0 2 6 0 2 0 2

Document 2 0 7 0 2 1 0 0 3 0 0

Document 3 0 1 0 0 1 2 2 0 3 0
Data Objects
❑ Data sets are made up of data objects
❑A data object represents a real-world entity

❑ Examples:
❑ Sales database: customers, store items, sales
❑ Medical database: patients, treatments
❑ University database: students, professors, courses

❑ Alsocalled tuples, samples, examples, instances, data


points, etc
❑ Data objects are typically described by features
❑ Database rows -> data objects; columns ->features
Features
❑ Features (or dimensions, attributes, variables, etc):
❑A measurable property or characteristics of each data object
❑ E.g., customer_ID, name, address, age, occupation, etc

❑ Types:

❑ Nominal

❑ Binary

❑ Numeric: quantitative
▪ Ratio-scaled
▪ Interval-scaled
Feature Types
❑ Nominal: categories, states, or “names of things”
❑ Has a finite number of values
❑ e.g., Hair_color = {black, blond, brown, grey, red, white, … }
❑ marital status, occupation, ID numbers, zip codes, ….

❑ Binary
❑ Special case of a nominal feature with only 2 states (0 and 1)
❑ Symmetric case and asymmetric case (will be mentioned later)

❑ Ordinal
❑ Values have a meaningful order (ranking)

❑ Magnitude between successive values is not known though

❑ E.g.> Size = {small, medium, large}


Numeric Feature Types
❑ Numeric (integer or real-valued)
❑ Ratio-scaled
▪ Ratio is meaningful
▪ Inherent zero-point (0 means absence)
▪ We can speak of values as being an order of magnitude
larger than the unit of measurement
▪ 6kg is twice as high as 3kg
▪ e.g., temperature in Kelvin, length, counts, money, etc…

❑ Interval-scaled
▪ Only difference is meaningful
▪ Measured on a scale of equal-sized units
▪ Values have order
▪ e.g., temperature in C˚or F˚
▪ No true zero-point
Contents
❑ Data Objects and Feature Types

❑ Basic Statistical Descriptions of Data

❑ Data Visualization

❑ Measuring Data Similarity and Dissimilarity

❑ Summary
Basic Statistical Descriptions of Data
❑ Motivation

❑ To better understand the data: central tendency, variation and


spread

❑ Data dispersion characteristics


❑ Median, max, min, quartiles, outliers, variance, etc.
Measuring the Central Tendency
❑ Mean (sample vs. population):

=
1 n
x =  xi
x
n i =1 N

Note: n is sample size and N is population size.

❑ Weighted arithmetic mean:


n

w x i i
x= i =1
n

w
i =1
i

❑ Trimmed mean:
▪ Taking mean after chopping extreme values
Measuring the Central Tendency
❑ Median:

❑ Middle value if there are odd number of values, or average of the


middle two values otherwise
❑ Simple median requires sorting, not good at a dynamic situation
❑ Solution: estimation via interpolation (for grouped data):

n / 2 − (freql )
median = L1 + ( ) * width
freqmedian
❑ Example
▪ n = 3194, n/2 = 1597, freqmedian = 1500 , L1 = 21
▪ Numerator = 1597 – (200+450+300) = 647
▪ width = (50-21) = 29
▪ Median = 21+(647/1500)*29
Median is located in this
group: 21 ~ 50. Within this
range, we approximate the
exact position of median.
Measuring the Central Tendency
❑ Mode

❑ Value that occurs most frequently in the data


▪ Usually defined on discrete features

❑ Unimodal (1), bimodal (2), trimodal (3)


Symmetric vs. Skewed Data
❑ Median, mean, and mode of symmetric, positively and
negatively skewed data
Measuring the Dispersion of Data
❑ Quartiles, outliers and boxplots
❑ Quartiles: Q1 (25th percentile), Q3 (75th percentile)

❑ Inter-quartile range (IQR): IQR = Q3 – Q1

❑ Five number summary: min, Q1, median, Q3, max

❑ Boxplot: visualization of the above five numbers

▪ Each end of the box is Q3 and Q1 ; median is marked; add whiskers


to express min & max; and plot outliers individually

❑ Outlier: usually, a value higher/lower than 1.5 x IQR


Measuring the Dispersion of Data
❑ Box plot: Five-number summary of a distribution
❑ Minimum, Q1, Median, Q3, Maximum

❑ Boxplot
❑ Data is represented with a box
❑ The ends of the box are at Q1 and Q3.
❑ => The height of the box is IQR
❑ The median is marked by a line within
the box
❑ Whiskers: two lines outside the box
extended to Minimum and Maximum
❑ Outliers: points beyond a specified
outlier threshold, plotted individually
Measuring the Dispersion of Data
❑ Variance and standard deviation (sample: s,
population: σ)
❑ Variance:

𝑛
2
1
𝑠 = lj 2
෍(𝑥𝑖 − 𝑥)
𝑛−1
𝑖=1

𝑵
1
𝜎 2 = ෍(𝑥𝑖 − 𝜇)2
𝑁
𝑖=1

❑ Standard deviation s (or, σ ) is the square root of variance s 2


(or, σ 2 )
Measuring the Dispersion of Data
❑ Using the normal distribution property
❑ From𝜇 − 𝜎 to 𝜇 + 𝜎 : contains about 68% of the measurements (𝜇:
mean, 𝜎 : standard deviation)
❑ From 𝜇 − 2𝜎 to 𝜇 + 2𝜎 : contains about 95% of it
❑ From 𝜇 − 3𝜎 to 𝜇 + 3𝜎 : contains about 99.7% of it
Thank You

You might also like