9-1 Data analysis and pre-processing part 1.pdf
9-1 Data analysis and pre-processing part 1.pdf
Dong-Kyu Chae
❑ Data Visualization
❑ Summary
Types of Data Sets
❑ Tabular
timeout
season
coach
game
score
team
Data matrix / table
ball
lost
pla
❑
wi
n
y
▪ E.g.> a set of term-frequency vectors
❑ Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
❑ Graph and network
Social networks
Document 2 0 7 0 2 1 0 0 3 0 0
❑
❑ World Wide Web Document 3 0 1 0 0 1 2 2 0 3 0
Molecular structures
❑
Data matrix
❑ Time-series (ordered)
❑ Video data: sequence of images TID Items
❑ Temporal data: time-series of trajectories
1 Bread, Coke, Milk
❑ Sequential data: transaction sequences
2 Beer, Bread
❑ Genetic sequence data
3 Beer, Coke, Diaper, Milk
❑ Spatial, image, and multimodal: 4 Beer, Bread, Diaper, Milk
❑ Spatial data: maps 5 Coke, Diaper, Milk
❑ Image data
❑ Multimodal data (video + image + text + ….) Transaction data
Characteristics of Data
❑ Dimensionality
❑ # of features
❑ Curse of dimensionality
❑ Sparsity
❑ Only a small portion of presence
❑ Resolution
❑ Patterns depend on the scale
❑ Distribution
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
❑ Centrality and dispersion
Document 1 3 0 5 0 2 6 0 2 0 2
Document 2 0 7 0 2 1 0 0 3 0 0
Document 3 0 1 0 0 1 2 2 0 3 0
Data Objects
❑ Data sets are made up of data objects
❑A data object represents a real-world entity
❑ Examples:
❑ Sales database: customers, store items, sales
❑ Medical database: patients, treatments
❑ University database: students, professors, courses
❑ Types:
❑ Nominal
❑ Binary
❑ Numeric: quantitative
▪ Ratio-scaled
▪ Interval-scaled
Feature Types
❑ Nominal: categories, states, or “names of things”
❑ Has a finite number of values
❑ e.g., Hair_color = {black, blond, brown, grey, red, white, … }
❑ marital status, occupation, ID numbers, zip codes, ….
❑ Binary
❑ Special case of a nominal feature with only 2 states (0 and 1)
❑ Symmetric case and asymmetric case (will be mentioned later)
❑ Ordinal
❑ Values have a meaningful order (ranking)
❑ Interval-scaled
▪ Only difference is meaningful
▪ Measured on a scale of equal-sized units
▪ Values have order
▪ e.g., temperature in C˚or F˚
▪ No true zero-point
Contents
❑ Data Objects and Feature Types
❑ Data Visualization
❑ Summary
Basic Statistical Descriptions of Data
❑ Motivation
=
1 n
x = xi
x
n i =1 N
w x i i
x= i =1
n
w
i =1
i
❑ Trimmed mean:
▪ Taking mean after chopping extreme values
Measuring the Central Tendency
❑ Median:
n / 2 − (freql )
median = L1 + ( ) * width
freqmedian
❑ Example
▪ n = 3194, n/2 = 1597, freqmedian = 1500 , L1 = 21
▪ Numerator = 1597 – (200+450+300) = 647
▪ width = (50-21) = 29
▪ Median = 21+(647/1500)*29
Median is located in this
group: 21 ~ 50. Within this
range, we approximate the
exact position of median.
Measuring the Central Tendency
❑ Mode
❑ Boxplot
❑ Data is represented with a box
❑ The ends of the box are at Q1 and Q3.
❑ => The height of the box is IQR
❑ The median is marked by a line within
the box
❑ Whiskers: two lines outside the box
extended to Minimum and Maximum
❑ Outliers: points beyond a specified
outlier threshold, plotted individually
Measuring the Dispersion of Data
❑ Variance and standard deviation (sample: s,
population: σ)
❑ Variance:
𝑛
2
1
𝑠 = lj 2
(𝑥𝑖 − 𝑥)
𝑛−1
𝑖=1
𝑵
1
𝜎 2 = (𝑥𝑖 − 𝜇)2
𝑁
𝑖=1