0% found this document useful (0 votes)

34 views24 pages

Lec 5

The document discusses key concepts in data mining and data similarity/dissimilarity measurement. It defines different types of data attributes, such as nominal, ordinal, binary, numeric, discrete, and continuous attributes. It also explains how to measure similarity and dissimilarity between data objects using various proximity measures like Minkowski distance and its special cases like Manhattan distance and Euclidean distance. Examples are provided to illustrate data and dissimilarity matrices and calculating distances between data points.

Uploaded by

Eslam Sayed Galal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views24 pages

Lec 5

Uploaded by

Eslam Sayed Galal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Data Mining

IS314

Dr. Aym n Alhelb wy 26th M rch 2022

a
a
a
Basic Statistical
Descriptions of Data

Measuring Data Similarity

and Dissimilarity

Data Attribute Types

■ Qualitative Attributes
■ Nominal —> It is a form of names of things like: colours, Categorical Data
■ Ordinary —> It shows a meaningful sequence or order between them. The
magnitude between values is not actually known. Eg. Grades (A,B,C,D)
■ Binary —> It has only two values like: True/False, Male/Female, Pass/Fail
■ Symmetric :- Both values are equally important like Gender :“Male/Female”
■ Asymmetric :-Both values are not equally important like test Result: ”Pass/
Fail”

■ Quantitative Attributes
■ Numeric —> It is a measurable quantity, represented in integer or real values.
■ Discrete —> Discrete data have finite values it can be numerical. These
attributes has finite or set of values
■ Continues —> It have an infinite number of states. i.e. there are many values
between 6 and 7. for example Hight attribute is a continues value may be
1.66, 1.76, 1.77, etc. It is always a real number value.

Similarity and Dissimilarity

■ Similarity
■ Numerical measure of how alike two data objects are

■ Value is higher when objects are more alike

■ Often falls in the range [0,1]

■ Dissimilarity (e.g., distance)

■ Numerical measure of how different two data objects are

■ Lower when objects are more alike

■ Minimum dissimilarity is often 0

■ Upper limit varies

■ Proximity refers to a similarity or dissimilarity

Data Matrix and Dissimilarity Matrix

■ Data matrix
■ n data points with p ⎡ x11 ... x1f ... x1p ⎤
dimensions ⎢
⎢ ... ... ... ...
⎥
... ⎥
■ Two modes ⎢x ... xif ... xip ⎥
⎢ i1 ⎥
⎢ ... ... ... ... ... ⎥
⎢x ... xnf ... xnp ⎥⎥
⎢⎣ n1 ⎦
■ Dissimilarity matrix
■ n data points, but ⎡ 0 ⎤
registers only the ⎢ d(2,1) 0 ⎥
⎢ ⎥
distance ⎢ d(3,1) d ( 3,2) 0 ⎥
■ A triangular matrix ⎢ ⎥
■ Single mode
⎢ : : : ⎥
⎢⎣d ( n,1) d ( n,2) ... ... 0⎥⎦

Proximity Measure for Nominal Attributes

■ Can take 2 or more states, e.g., red, yellow, blue,

green (generalisation of a binary attribute)
■ Method 1: Simple matching
■ m: # of matches, p: total # of variables
p
d (i, j) = p− m

■ Method 2: Use a large number of binary attributes

■ creating a new binary attribute for each of the M nominal states

Proximity Measure for Binary Attributes

Object j
■ A contingency table for binary data
Object i

■ Distance measure for symmetric

binary variables:
■ Distance measure for asymmetric
binary variables:
■ Jaccard coefficient (similarity measure
for asymmetric binary variables):

■ Note: Jaccard coefficient is the same as “coherence”:

Dissimilarity between Binary Variables

■ Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
■ Gender is a symmetric attribute
■ The remaining attributes are asymmetric binary
■ Let the values Y and P be 1, and the value N 0

0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
9

Example:  
Data Matrix and Dissimilarity Matrix

Data Matrix

point attribute1 attribute2

x1 1 2
x2 3 5
x3 2 0
x4 4 5

Dissimilarity Matrix
(with Euclidean Distance)

x1 x2 x3 x4
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
10

Distance on Numeric Data: Minkowski Distance

■ Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two
p-dimensional data objects, and h is the order (the
distance so defined is also called L-h norm)
■ Properties
■ d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
■ d(i, j) = d(j, i) (Symmetry)
■ d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
■ A distance that satisfies these properties is a metric

Special Cases of Minkowski Distance

■ h = 1: Manhattan (city block, L1 norm) distance

■ E.g., the Hamming distance: the number of bits that are different
between two binary vectors

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
■ h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp

■ h → ∞. “supremum” (Lmax norm, L∞ norm) distance.

■ This is the maximum difference between any component
(attribute) of the vectors

Example: Minkowski Distance

Dissimilarity Matrices
Manhattan (L1)
point attribute 1 attribute 2
x1 1 2
L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum
2 x1
L∞ x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
0 2 4 x4 3 1 5 0
13
Cosine Similarity

■ A document can be represented by thousands of attributes, each recording

the frequency of a particular word (such as keywords) or phrase in the
document.

■ Other vector objects: gene features in micro-arrays, …

■ Applications: information retrieval, biologic taxonomy, gene feature mapping,
...
■ Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors),
then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the Euclidean norm of vector d

Example: Cosine Similarity

■ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

where • indicates vector dot product, ||d||: the Euclidean norm of vector
d

■ Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 =
6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

Data Preprocessing

Major Tasks in Data Preprocessing

■ Data cleaning
■ Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
■ Data integration
■ Integration of multiple databases, data cubes, or files
■ Data reduction
■ Dimensionality reduction
■ Numerosity reduction
■ Data compression
■ Data transformation and data discretization
■ Normalization
■ Concept hierarchy generation

Data Cleaning
■ Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
■ incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
■ e.g., Occupation=“ ” (missing data)

■ noisy: containing noise, errors, or outliers

■ e.g., Salary=“−10” (an error)

■ inconsistent: containing discrepancies in codes or names, e.g.,

■ Age=“42”, Birthday=“03/07/2010”

■ Was rating “1, 2, 3”, now rating “A, B, C”

■ discrepancy between duplicate records

■ Intentional (e.g., disguised missing data)

■ Jan. 1 as everyone’s birthday?

Incomplete (Missing) Data

■ Data is not always available

■ E.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

■ Missing data may be due to
■ equipment malfunction

■ inconsistent with other recorded data and thus deleted

■ data not entered due to misunderstanding

■ certain data may not be considered important at the

time of entry
■ no register history or changes of the data

■ Missing data may need to be inferred

How to Handle Missing Data?

■ Ignore the tuple: usually done when class label is missing
(when doing classification)—not effective when the % of
missing values per attribute varies considerably
■ Fill in the missing value manually: tedious + infeasible?
■ Fill in it automatically with
■ a global constant : e.g., “unknown”, a new class?!

■ the attribute mean

■ the attribute mean for all samples belonging to the same

class: smarter
■ the most probable value: inference-based such as

Bayesian formula or decision tree

Noisy Data
■ Noise: random error or variance in a measured variable
■ Incorrect attribute values may be due to
■ faulty data collection instruments

■ data entry problems

■ data transmission problems

■ technology limitation

■ inconsistency in naming convention

■ Other data problems which require data cleaning

■ duplicate records

■ incomplete data

■ inconsistent data

How to Handle Noisy Data?

■ Binning
■ first sort data and partition into (equal-frequency) bins

■ then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

■ Regression
■ smooth by fitting the data into regression functions

■ Clustering
■ detect and remove outliers

■ Combined computer and human inspection

■ detect suspicious values and check by human (e.g., deal

with possible outliers)

Data Cleaning as a Process

■ Data discrepancy detection
■ Use metadata (e.g., domain, range, dependency, distribution)
■ Check field overloading
■ Check uniqueness rule, consecutive rule and null rule
■ Use commercial tools
■ Data scrubbing: use simple domain knowledge (e.g., postal code,

spell-check) to detect errors and make corrections

■ Data auditing: by analyzing data to discover rules and relationship

to detect violators (e.g., correlation and clustering to find outliers)

■ Data migration and integration
■ Data migration tools: allow transformations to be specified
■ ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
■ Integration of the two processes
■ Iterative and interactive (e.g., Potter’s Wheels)

Thank You.
Questions????

MSI GeForce RTX 2080Ti-A MS-V377 PG150-C03 Rev 1.0
No ratings yet
MSI GeForce RTX 2080Ti-A MS-V377 PG150-C03 Rev 1.0
66 pages
19.060.143 - Manual Instructions Banking With ABN AMRO - Get Started INTERNET - 19.07.30 - tcm18-43418
No ratings yet
19.060.143 - Manual Instructions Banking With ABN AMRO - Get Started INTERNET - 19.07.30 - tcm18-43418
9 pages
Data Similarity
0% (1)
Data Similarity
18 pages
Mod 4 Types of Data in Cluster Analysis
No ratings yet
Mod 4 Types of Data in Cluster Analysis
31 pages
Chapter - 2 Data Mining
No ratings yet
Chapter - 2 Data Mining
21 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
02data Part4
No ratings yet
02data Part4
28 pages
9-2 Data Analysis and Pre-Processing Part 2 PDF
No ratings yet
9-2 Data Analysis and Pre-Processing Part 2 PDF
27 pages
Data Preprocessing II
No ratings yet
Data Preprocessing II
21 pages
Data Preprocessing For Clustering
No ratings yet
Data Preprocessing For Clustering
40 pages
Similarity
No ratings yet
Similarity
19 pages
DM Day3 Preprocessing A S25
No ratings yet
DM Day3 Preprocessing A S25
109 pages
X Chapter 02 Data
No ratings yet
X Chapter 02 Data
67 pages
Data Science: Department of Computer Science & Engineering
No ratings yet
Data Science: Department of Computer Science & Engineering
31 pages
02 Data
No ratings yet
02 Data
35 pages
DS5 Statistics
No ratings yet
DS5 Statistics
67 pages
Week 5 - Data Mining Exploring Data With R
No ratings yet
Week 5 - Data Mining Exploring Data With R
146 pages
Lec01 Dataprep
No ratings yet
Lec01 Dataprep
67 pages
How To Work On Data You Haev
No ratings yet
How To Work On Data You Haev
40 pages
2 2 Data
No ratings yet
2 2 Data
27 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
29.measuring Data Similarity and Dissimilarity Introduction
No ratings yet
29.measuring Data Similarity and Dissimilarity Introduction
43 pages
Chapter 2 Data Issues
No ratings yet
Chapter 2 Data Issues
21 pages
03 - Data Mining
No ratings yet
03 - Data Mining
37 pages
Basic Statistical Descriptions of Data
No ratings yet
Basic Statistical Descriptions of Data
26 pages
2 Similarity Disimilarity Measure
No ratings yet
2 Similarity Disimilarity Measure
35 pages
Unit1 Data Preprocessing
No ratings yet
Unit1 Data Preprocessing
95 pages
III-IT-Data Mining Unit 1-Session 3
No ratings yet
III-IT-Data Mining Unit 1-Session 3
21 pages
Full
No ratings yet
Full
367 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Lect 2
No ratings yet
Lect 2
77 pages
DM - Topic Four - Part III (Autosaved)
No ratings yet
DM - Topic Four - Part III (Autosaved)
67 pages
02 Tinh Khoang Cach - Compatibility Mode
No ratings yet
02 Tinh Khoang Cach - Compatibility Mode
14 pages
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
No ratings yet
Bab 2 Data: Created By: Arif Djunaidy (Ftif - Its)
57 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
100 pages
DWDM Unit6-Data Similarity Measures
No ratings yet
DWDM Unit6-Data Similarity Measures
40 pages
Lecture 2
No ratings yet
Lecture 2
27 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
No ratings yet
Data Warehousing and Mining: Dr. Hossen Asiful Mustafa
49 pages
Lecture 5
No ratings yet
Lecture 5
27 pages
Wk. 3. Data (12-05-2021)
No ratings yet
Wk. 3. Data (12-05-2021)
57 pages
Unit I
No ratings yet
Unit I
57 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
ITS665dm Topic2-DataUnderstanding
No ratings yet
ITS665dm Topic2-DataUnderstanding
53 pages
Lecture 8-9 - Clustering
No ratings yet
Lecture 8-9 - Clustering
43 pages
17 Data Analysis
No ratings yet
17 Data Analysis
64 pages
Data
No ratings yet
Data
84 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
Slides of Lecture 2 of CS3319 SJTU
No ratings yet
Slides of Lecture 2 of CS3319 SJTU
35 pages
Unit 1 - IDS
No ratings yet
Unit 1 - IDS
49 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
DWDM AR16 Unit 1.2
No ratings yet
DWDM AR16 Unit 1.2
14 pages
CIS 467 - Topic 2 - Data Exploration and Preprocessing
No ratings yet
CIS 467 - Topic 2 - Data Exploration and Preprocessing
81 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Data Mining: Data
No ratings yet
Data Mining: Data
50 pages
Knowing Your Data
No ratings yet
Knowing Your Data
43 pages
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
No ratings yet
Agenda: 1) Assign Homework #1 (Due Wednesday 6/30) 2) Lecture Over More of Chapter 2
43 pages
HIT391-week 3-New
No ratings yet
HIT391-week 3-New
43 pages
Lect 2 DM Converted 1
No ratings yet
Lect 2 DM Converted 1
29 pages
The Dirac equation
From Everand
The Dirac equation
Alessio Mangoni
No ratings yet
Step by Step Exchange 2019 Installation Guide For Anyone v1.2
100% (2)
Step by Step Exchange 2019 Installation Guide For Anyone v1.2
55 pages
Fuzzy MCDM Sub Contractor Selection
No ratings yet
Fuzzy MCDM Sub Contractor Selection
25 pages
Grade 7 Ict Holiday Homework
No ratings yet
Grade 7 Ict Holiday Homework
1 page
Deploy Dell EMC VxFlex OS v3.x
No ratings yet
Deploy Dell EMC VxFlex OS v3.x
182 pages
Evolution of Computer Devices: Grade 12 Competency Level 2.2 Anuradha Dissanayake
No ratings yet
Evolution of Computer Devices: Grade 12 Competency Level 2.2 Anuradha Dissanayake
16 pages
Dental Laboratory Equipment: Ceramic Furnaces
No ratings yet
Dental Laboratory Equipment: Ceramic Furnaces
8 pages
From Excel To EDC - Nobel Biocare Case Study
No ratings yet
From Excel To EDC - Nobel Biocare Case Study
2 pages
CT 216 Final Exam
No ratings yet
CT 216 Final Exam
11 pages
Machine Sequencing Via Disjunctive Graphs An Implicit Enumeration Algorithm
No ratings yet
Machine Sequencing Via Disjunctive Graphs An Implicit Enumeration Algorithm
18 pages
Core and Advance Java Interview Questions
No ratings yet
Core and Advance Java Interview Questions
4 pages
D2p1 School To School Mentoring Report
No ratings yet
D2p1 School To School Mentoring Report
58 pages
Storage Systems CHAPTER CONTENTS Storage
No ratings yet
Storage Systems CHAPTER CONTENTS Storage
29 pages
Page 1 of 10
No ratings yet
Page 1 of 10
10 pages
Setting Smart Goals Activity ( (Yellow Bar (Description of Session) ) )
No ratings yet
Setting Smart Goals Activity ( (Yellow Bar (Description of Session) ) )
4 pages
Exercises
No ratings yet
Exercises
21 pages
LI - A Detailed Guide On Feroxbuster PDF
No ratings yet
LI - A Detailed Guide On Feroxbuster PDF
38 pages
10.swift Securitas - Profile
No ratings yet
10.swift Securitas - Profile
23 pages
Create Forms That Users Complete or Print in Word
No ratings yet
Create Forms That Users Complete or Print in Word
7 pages
Class 10th Pre-Board-1
No ratings yet
Class 10th Pre-Board-1
9 pages
York Chiller - Maintenance Requirements
No ratings yet
York Chiller - Maintenance Requirements
3 pages
ND-12.018.01 High Level Data Link Control (HDLC) Interface
No ratings yet
ND-12.018.01 High Level Data Link Control (HDLC) Interface
113 pages
Install and Setup OCS NG Inventory Server On CentOS 7
No ratings yet
Install and Setup OCS NG Inventory Server On CentOS 7
18 pages
Nan - de On Tap Tuyen Sinh 10 - 2020 - 2021
No ratings yet
Nan - de On Tap Tuyen Sinh 10 - 2020 - 2021
3 pages
AX K Manual
No ratings yet
AX K Manual
35 pages
Historical Development of Science and Technology in The Philippines
No ratings yet
Historical Development of Science and Technology in The Philippines
4 pages
Gtrar
No ratings yet
Gtrar
75 pages
Case Study
No ratings yet
Case Study
11 pages
@ANUBISTXT
No ratings yet
@ANUBISTXT
38 pages

Uploaded by

Uploaded by

Data Mining

Dr. Aym n Alhelb wy 26th M rch 2022

Measuring Data Similarity

Data Attribute Types

Similarity and Dissimilarity

■ Value is higher when objects are more alike

■ Often falls in the range [0,1]

■ Dissimilarity (e.g., distance)

■ Lower when objects are more alike

■ Minimum dissimilarity is often 0

■ Upper limit varies

■ Proximity refers to a similarity or dissimilarity

Data Matrix and Dissimilarity Matrix

Proximity Measure for Nominal Attributes

■ Can take 2 or more states, e.g., red, yellow, blue,

■ Method 2: Use a large number of binary attributes

Proximity Measure for Binary Attributes

■ Distance measure for symmetric

■ Note: Jaccard coefficient is the same as “coherence”:

Dissimilarity between Binary Variables

point attribute1 attribute2

Distance on Numeric Data: Minkowski Distance

Special Cases of Minkowski Distance

■ h = 1: Manhattan (city block, L1 norm) distance

■ h → ∞. “supremum” (Lmax norm, L∞ norm) distance.

Example: Minkowski Distance

■ A document can be represented by thousands of attributes, each recording

■ Other vector objects: gene features in micro-arrays, …

Example: Cosine Similarity

■ cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

■ Ex: Find the similarity between documents 1 and 2.

Major Tasks in Data Preprocessing

■ noisy: containing noise, errors, or outliers

■ inconsistent: containing discrepancies in codes or names, e.g.,

■ Was rating “1, 2, 3”, now rating “A, B, C”

■ discrepancy between duplicate records

■ Intentional (e.g., disguised missing data)

Incomplete (Missing) Data

■ Data is not always available

attributes, such as customer income in sales data

■ inconsistent with other recorded data and thus deleted

■ data not entered due to misunderstanding

■ certain data may not be considered important at the

■ Missing data may need to be inferred

How to Handle Missing Data?

■ the attribute mean

■ the attribute mean for all samples belonging to the same

Bayesian formula or decision tree

■ data entry problems

■ data transmission problems

■ inconsistency in naming convention

■ Other data problems which require data cleaning

How to Handle Noisy Data?

■ then one can smooth by bin means, smooth by bin

median, smooth by bin boundaries, etc.

■ Combined computer and human inspection

with possible outliers)

Data Cleaning as a Process

spell-check) to detect errors and make corrections

to detect violators (e.g., correlation and clustering to find outliers)

You might also like