0% found this document useful (0 votes)
34 views24 pages

Lec 5

The document discusses key concepts in data mining and data similarity/dissimilarity measurement. It defines different types of data attributes, such as nominal, ordinal, binary, numeric, discrete, and continuous attributes. It also explains how to measure similarity and dissimilarity between data objects using various proximity measures like Minkowski distance and its special cases like Manhattan distance and Euclidean distance. Examples are provided to illustrate data and dissimilarity matrices and calculating distances between data points.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views24 pages

Lec 5

The document discusses key concepts in data mining and data similarity/dissimilarity measurement. It defines different types of data attributes, such as nominal, ordinal, binary, numeric, discrete, and continuous attributes. It also explains how to measure similarity and dissimilarity between data objects using various proximity measures like Minkowski distance and its special cases like Manhattan distance and Euclidean distance. Examples are provided to illustrate data and dissimilarity matrices and calculating distances between data points.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Data Mining

IS314

Dr. Aym n Alhelb wy 26th M rch 2022


a
a
a
Basic Statistical
Descriptions of Data

Measuring Data Similarity


and Dissimilarity

Data Attribute Types


■ Qualitative Attributes
■ Nominal —> It is a form of names of things like: colours, Categorical Data
■ Ordinary —> It shows a meaningful sequence or order between them. The
magnitude between values is not actually known. Eg. Grades (A,B,C,D)
■ Binary —> It has only two values like: True/False, Male/Female, Pass/Fail
■ Symmetric :- Both values are equally important like Gender :“Male/Female”
■ Asymmetric :-Both values are not equally important like test Result: ”Pass/
Fail”

■ Quantitative Attributes
■ Numeric —> It is a measurable quantity, represented in integer or real values.
■ Discrete —> Discrete data have finite values it can be numerical. These
attributes has finite or set of values
■ Continues —> It have an infinite number of states. i.e. there are many values
between 6 and 7. for example Hight attribute is a continues value may be
1.66, 1.76, 1.77, etc. It is always a real number value.

Similarity and Dissimilarity


■ Similarity
■ Numerical measure of how alike two data objects are

■ Value is higher when objects are more alike

■ Often falls in the range [0,1]

■ Dissimilarity (e.g., distance)


■ Numerical measure of how different two data objects are

■ Lower when objects are more alike

■ Minimum dissimilarity is often 0

■ Upper limit varies

■ Proximity refers to a similarity or dissimilarity

Data Matrix and Dissimilarity Matrix


■ Data matrix
■ n data points with p ⎡ x11 ... x1f ... x1p ⎤
dimensions ⎢
⎢ ... ... ... ...

... ⎥
■ Two modes ⎢x ... xif ... xip ⎥
⎢ i1 ⎥
⎢ ... ... ... ... ... ⎥
⎢x ... xnf ... xnp ⎥⎥
⎢⎣ n1 ⎦
■ Dissimilarity matrix
■ n data points, but ⎡ 0 ⎤
registers only the ⎢ d(2,1) 0 ⎥
⎢ ⎥
distance ⎢ d(3,1) d ( 3,2) 0 ⎥
■ A triangular matrix ⎢ ⎥
■ Single mode
⎢ : : : ⎥
⎢⎣d ( n,1) d ( n,2) ... ... 0⎥⎦

Proximity Measure for Nominal Attributes

■ Can take 2 or more states, e.g., red, yellow, blue,


green (generalisation of a binary attribute)
■ Method 1: Simple matching
■ m: # of matches, p: total # of variables
p
d (i, j) = p− m

■ Method 2: Use a large number of binary attributes


■ creating a new binary attribute for each of the M nominal states

Proximity Measure for Binary Attributes


Object j
■ A contingency table for binary data
Object i

■ Distance measure for symmetric


binary variables:
■ Distance measure for asymmetric
binary variables:
■ Jaccard coefficient (similarity measure
for asymmetric binary variables):

■ Note: Jaccard coefficient is the same as “coherence”:

Dissimilarity between Binary Variables


■ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
■ Gender is a symmetric attribute
■ The remaining attributes are asymmetric binary
■ Let the values Y and P be 1, and the value N 0

0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
9

Example: 

Data Matrix and Dissimilarity Matrix

Data Matrix

point attribute1 attribute2


x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix
(with Euclidean Distance)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
10

Distance on Numeric Data: Minkowski Distance


■ Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
■ Properties
■ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
■ d(i, j) = d(j, i) (Symmetry)
■ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
■ A distance that satisfies these properties is a metric

11

Special Cases of Minkowski Distance

■ h = 1: Manhattan (city block, L1 norm) distance


■ E.g., the Hamming distance: the number of bits that are different
between two binary vectors

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
■ h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp

■ h → ∞. “supremum” (Lmax norm, L∞ norm) distance.


■ This is the maximum difference between any component
(attribute) of the vectors

12

Example: Minkowski Distance


Dissimilarity Matrices
Manhattan (L1)
point attribute 1 attribute 2
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
2 x1
L∞ x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
13
Cosine Similarity

■ A document can be represented by thousands of attributes, each recording


the frequency of a particular word (such as keywords) or phrase in the
document.

■ Other vector objects: gene features in micro-arrays, …


■ Applications: information retrieval, biologic taxonomy, gene feature mapping,
...
■ Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors),
then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the Euclidean norm of vector d

14

Example: Cosine Similarity

■ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,


where • indicates vector dot product, ||d||: the Euclidean norm of vector
d

■ Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

15

Data Preprocessing

Major Tasks in Data Preprocessing


■ Data cleaning
■ Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
■ Data integration
■ Integration of multiple databases, data cubes, or files
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
■ Data transformation and data discretization
■ Normalization
■ Concept hierarchy generation

17

Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
■ e.g., Occupation=“ ” (missing data)

■ noisy: containing noise, errors, or outliers


■ e.g., Salary=“−10” (an error)

■ inconsistent: containing discrepancies in codes or names, e.g.,


■ Age=“42”, Birthday=“03/07/2010”

■ Was rating “1, 2, 3”, now rating “A, B, C”

■ discrepancy between duplicate records

■ Intentional (e.g., disguised missing data)


■ Jan. 1 as everyone’s birthday?

18

Incomplete (Missing) Data

■ Data is not always available


■ E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data


■ Missing data may be due to
■ equipment malfunction

■ inconsistent with other recorded data and thus deleted

■ data not entered due to misunderstanding

■ certain data may not be considered important at the

time of entry
■ no register history or changes of the data

■ Missing data may need to be inferred

19

How to Handle Missing Data?


■ Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!

■ the attribute mean

■ the attribute mean for all samples belonging to the same

class: smarter
■ the most probable value: inference-based such as

Bayesian formula or decision tree


20

Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments

■ data entry problems

■ data transmission problems

■ technology limitation

■ inconsistency in naming convention

■ Other data problems which require data cleaning


■ duplicate records

■ incomplete data

■ inconsistent data

21

How to Handle Noisy Data?

■ Binning
■ first sort data and partition into (equal-frequency) bins

■ then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.


■ Regression
■ smooth by fitting the data into regression functions

■ Clustering
■ detect and remove outliers

■ Combined computer and human inspection


■ detect suspicious values and check by human (e.g., deal

with possible outliers)

22

Data Cleaning as a Process


■ Data discrepancy detection
■ Use metadata (e.g., domain, range, dependency, distribution)
■ Check field overloading
■ Check uniqueness rule, consecutive rule and null rule
■ Use commercial tools
■ Data scrubbing: use simple domain knowledge (e.g., postal code,

spell-check) to detect errors and make corrections


■ Data auditing: by analyzing data to discover rules and relationship

to detect violators (e.g., correlation and clustering to find outliers)


■ Data migration and integration
■ Data migration tools: allow transformations to be specified
■ ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
■ Integration of the two processes
■ Iterative and interactive (e.g., Potter’s Wheels)

23

Thank You.
Questions????

You might also like