Lec 5
Lec 5
IS314
■ Quantitative Attributes
■ Numeric —> It is a measurable quantity, represented in integer or real values.
■ Discrete —> Discrete data have finite values it can be numerical. These
attributes has finite or set of values
■ Continues —> It have an infinite number of states. i.e. there are many values
between 6 and 7. for example Hight attribute is a continues value may be
1.66, 1.76, 1.77, etc. It is always a real number value.
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
9
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
10
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
■ Properties
■ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
■ d(i, j) = d(j, i) (Symmetry)
■ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
■ A distance that satisfies these properties is a metric
11
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
■ h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
12
14
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
15
Data Preprocessing
17
Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
■ e.g., Occupation=“ ” (missing data)
18
time of entry
■ no register history or changes of the data
19
class: smarter
■ the most probable value: inference-based such as
Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments
■ technology limitation
■ incomplete data
■ inconsistent data
21
■ Binning
■ first sort data and partition into (equal-frequency) bins
■ Clustering
■ detect and remove outliers
22
23
Thank You.
Questions????