0% found this document useful (0 votes)

22 views81 pages

CIS 467 - Topic 2 - Data Exploration and Preprocessing

Uploaded by

Dragon Pavilion

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views81 pages

CIS 467 - Topic 2 - Data Exploration and Preprocessing

Uploaded by

Dragon Pavilion

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 81

CIS 467 :Data Mining

Department of Computer Information Systems

Faculty of Information Technology
Yarmouk University – Jordan
Data Exploration and
Preprocessing

(1) Getting to Know your

Data

2
Topics

 Introduction and Motivation

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

3
Introduction

 It’s tempting to jump straight into mining, but first, we

need to get the data ready. This involves having a closer
look at attributes and data values.
 Real-world data are typically noisy, enormous in volume Data
(often several gigabytes or more), and may originate from challenges
a heterogeneous sources.
 This topic is about getting familiar with your data.
Knowledge about your data is useful for data
preprocessing, the first major task of the data mining
process.
 You will want to know the following: (1) What are the types
of attributes or fields that make up your data? (2)What kind
of values does each attribute have? (3) Which attributes
are discrete, and which are continuous-valued? (4)What do
the data look like?

4
Motivation: Why to know the data
first

 Data Exploration

 Basic statistical descriptions can be used to learn more

about each attribute’s values.
 Given an attribute we can determine its: mean (average
value), median (middle value), and mode (most common
value).

 These are measures of central tendency, which give us

an idea of the “middle” or center of distribution.

 Knowing such basic statistics regarding each attribute

makes it easier to fill in missing values, smooth noisy
values, and spot outliers during data preprocessing.

 Knowledge of the attributes and attribute values can also

help in fixing inconsistencies incurred during data
integration.
5
2.1 Data Objects

 Data sets are made up of data objects.

 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 In Database:
rows -> data objects;
columns ->attributes.
6
Attributes

 Attribute is a data field, representing a

characteristic or
feature of a data object.

E.g. 

Other Names: dimensions,

features,
variables,
column.

7
Example of a Flat Dataset

The Buys Computer Dataset: This follows an example from Quinlan’s ID3
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important

e.g., gender
 Asymmetric binary: outcomes not equally important.

e.g., medical test (positive vs. negative)

Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, grades, army rankings

9
Numeric Attribute Types
 Quantity (integer or real-valued)

 Interval

Measured on a scale of equal-sized units

Values have order
 E.g., temperature in C˚or F˚, calendar dates

No true zero-point
 Ratio

Inherent zero-point

We can speak of values as being an order of
magnitude larger than the unit of measurement (10
K˚ is twice as high as 5 K˚).

10
From Book

11
Discrete vs. Continuous
Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values


E.g., zip codes, profession, or the set of words in a
collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values


E.g., temperature, height, or weight
 Practically, real values can only be measured and

represented using a finite number of digits

 Continuous attributes are typically represented as floating-

point variables

12
Basic Statistical Descriptions of Data
 Motivation
 For data preprocessing to be successful, it is essential to
have an overall picture of your data.
 Basic statistical descriptions can be used to identify
properties of the data and highlight which data values
should be treated as noise or outliers.
 To better understand the data: central tendency,
variation and spread

 Data dispersion characteristics

 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of
precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube

13
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N
n
 Weighted arithmetic mean:

w x i i
Trimmed mean: chopping extreme values
x i 1
n
 Median:
w
i 1
i
 Middle value if odd number of values, or
average of the middle two values otherwise
 Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median L1  ( ) width
 Mode freq median
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode 3 (mean  median)

14
Symmetric vs. Skewed
Data

 Median, mean and mode of symmetric

symmetric, positively and negatively
skewed data

positively skewed negatively

skewed

15
November 22, 2024 15
Measuring the Dispersion of
Data
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)

1 n 1 n 2 1 n
2
s  
n  1 i 1
( xi  x ) 2
 [ 
n  1 i 1
xi  ( 
n i 1
xi ]
) 2

n n
1 1
  i
2
 
2
( xi   ) 
2
x   2

N i 1 N i 1

Standard deviation s (or σ) is the square root of variance s2 (or σ2)

16
Data Mining: Concepts and
11/22/2024 Techniques 17
Positively and Negatively Correlated Data

 The left half fragment is

positively correlated
 The right half is negative
correlated
18
Uncorrelated Data

19
Similarity and Dissimilarity

 Similarity

Numerical measure of how alike two data objects are

Value is higher when objects are more alike

Often falls in the range [0,1]

 Dissimilarity (e.g., distance)


Numerical measure of how different two data objects are

Lower when objects are more alike

Minimum dissimilarity is often 0

Upper limit varies

 Proximity refers to a similarity or dissimilarity for Ordinal

attributes

20
Similarity

21
Data Visualization
 Why data visualization?
 Gain insight into an information space by mapping data onto
graphical primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships
among data
 Help find interesting regions and suitable parameters for further
quantitative analysis
 Provide a visual proof of computer representations derived
 Categorization of visualization methods:
 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations

22
Pixel oriented visualization of 4 attributes by sorting all customers in
income Ascending order.

Icon Visualization
Geometric projection techniques - Scatter

23
Simple Data Visualization using Excel

24
Think and Discuss
 Discuss how and where we can use the
Visualization concepts for the three phases
of Knowledge Discovery (KDD) Process.?

1) Preprocessing Phase
2) Data Mining Phase
3) Post Processing Phase

25
Statistics Tools
 Students are encouraged to practice data
exploration using one of these tools:

 SPSS
 Excel
 SAS
 R
 PYTHON

Don’t fear, just try One

26
Exercise
 Given the following numbers that represent students grades
in a given course, find the following:
90, 85, 75, 45, 95, 66, 74, 83, 66, 57
 Arithmetic Mean
 Trimmed Mean
 Mode
 Variance
 Median
 Standard Deviation
 Range
 Plot the grades using Line Chart
 Use Excel/SPSS to find all previous metrics.
 Write your own notes

27
Exercise
 Given the following numbers that represent students grades
in a given course, find the following:
85, 85, 85, 85, 85, 85, 85, 85, 85, 85
 Arithmetic Mean
 Trimmed Mean
 Mode
 Variance
 Median
 Standard Deviation
 Range
 Plot the grades using Line Chart
 Use Excel/SPSS to find all previous metrics.
 Write your own notes

28
Break for FUN: Amazing mathematics
 If: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
 Then
 H+A+R+D+W+O+R+K ; 8+1+18+4+23+15+18+11=98%
 K+N+O+W+L+E+D+G+E ; 11+14+15+23+12+5+4+7+5=96%
 L+O+V+E; 12+15+22+5 = 54%
 L+U+C+K ; 12+21+3+11 = 47%
 None of them makes 100%. Then what makes 100% ???
 Is it Money? NO !!!
M+O+N+E+Y= 13+15+14+5+25=72%
 Leadership? NO !!!
L+E+A+D+E+R+S+H+I+P= 12+5+1+4+5+18+19+8+9+16=97%

 Every problem has a solution, only if we perhaps change our "ATTITUDE".

 A+T+T+I+T+U+D+E ; 1+20+20+9+20+21+4+5 = 100%


It is therefore OUR ATTITUDE towards Life and Work that makes OUR Life 100%
Successful..
Topic 2

(2) Data Preprocessing

30
30
Topics

 Data Preprocessing: An Overview

 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

31
31
Data Preprocessing

32
Data Quality: Why Preprocess the Data?

 Data in the Real World is Dirty ( GIGO ………)

 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be understood?

33
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

34
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names,
e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records
 Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
35
Incomplete (Missing) Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time
of entry
 not register history or changes of the data
 Missing data may need to be inferred …..

36
How to Handle Missing Data?
1. Ignore the tuple: This is usually done when the class label is missing:
(assuming the mining task involves classification). This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially poor
when the percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming
and may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown” or ??. If missing values
are replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: For example, suppose that
the average income of AllElectronics customers is $56,000. Use this value to replace
the missing value for income.
5. Use the attribute mean for all samples belonging to the same class as the
given tuple (Stratified Mean): For example, if classifying customers according to
credit risk, replace the missing value with the average income value for customers in
the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your data
set, you may construct a decision tree to predict the missing values for income.
7. Use a domain specific method.

Optional Assignment: Write a program that implement these methods.

37
Exercise
 Fill in the missing values (??X and ??Y) using the following
methods.
A1 A2 Class
 Attribute Mean 10 650 Yes
 Stratified Attribute Mean 12 150 No
 Global Constant 15 200 No
 Use Linear Regression (Wait) 20 200 Yes
??X 350 Yes
9 400 Yes
5 500 No
19 ??Y No
17 100 Yes
16 300 Yes

38
Exercise
 Fill in the missing values (??X and ??Y) using the following
methods.
A1 A2 Class
 Attribute Mean 10 650 Yes
 Stratified Attribute Mean 12 150 No
 Global Constant 15 200 No
 Use Linear Regression (Wait) 20 200 Yes
mean x=13.66 ??X 350 Yes
y=316.66
9 400 Yes
stratified 5 500 No
Mean x= (10+20+9+17+16) / 5 = 14.4
y=(150+200+500) / 3 = 283.3
19 ??Y No
Assume any missing values for A1 is global
Constant constant 10 17 100 Yes
Assume any missing values for A2 is global
16 300 Yes
constant 250

39
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention
 Other data problems which require data cleaning

duplicate records

incomplete data

inconsistent data

40
How to Handle Noisy Data?
 Binning

first sort data and partition into (equal-frequency) bins

then one can

smooth by bin means,
smooth by bin median,
smooth by bin boundaries, etc.

 Regression

smooth by fitting the data into regression functions
 Clustering

detect and remove outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g., deal with
possible outliers)

41
Binning

42
Example: Binning
S={28, 25, 21, 8, 4, 34, 24, 15, 21 }

43
Data Preprocessing to build a Data
Warehouse

44
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent
store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data
sources,
e.g., Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from
different sources are different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units, mm vs cm

45
45
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases

 Object identification: The same attribute or object may have
different names in different databases
 Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
 Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
 Purpose of integration of the data from multiple sources may help
to:
 reduce/avoid redundancies and inconsistencies
and
 improve mining speed and quality

46
46
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product moment

coefficient)


n
(ai bi )  n A B
i 1 (ai  A)(bi  B)
n

rA, B   i 1

(n  1) A B (n  1) A B
where n is the number of tuples,A and
B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.

 rA,B = 0: independent;

 rAB < 0: negatively correlated 47

Exercise
 Find the correlation between A1 and A2.

A1 A2 Class
10 650 Yes
12 550 No
15 200 No
20 100 Yes
14 350 Yes
9 700 Yes
5 900 No
19 250 No
17 200 Yes
16 300 Yes

=CORREL(A2:A11,B2:B11)
48
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or almost
the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes

Wavelet transforms

Principal Components Analysis (PCA)

Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)

Regression and Log-Linear Models

Histograms, clustering, sampling

Data cube aggregation
 Data compression

49
Data Reduction
1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

50
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in one or
more other attributes
 E.g., purchase price of a product and the amount of sales tax
paid

 Irrelevant attributes
 Contain no information that is useful for the data mining task at
hand
 E.g., students' ID is often irrelevant to the task of predicting
students' GPA

51
Heuristic Search in Attribute
Selection
 There are 2d possible attribute combinations of d attributes
 Typical heuristic attribute selection methods:
 Best single attribute under the attribute independence
assumption: choose by significance tests
 Best step-wise feature selection:

The best single-attribute is picked first

Then next best attribute condition to the first, ...
 Step-wise attribute elimination:

Repeatedly eliminate the worst attribute
 Best combined attribute selection and elimination
 Optimal branch and bound:

Use attribute elimination and backtracking
52
The 2d problem
Assume we have 5 attributes (A, B, C, D, E)

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

53
Heuristic Search in Attribute
Selection

54
Attribute Creation (Feature
Generation)

 Create new attributes (features) that can capture the

important information in a data set more effectively than the
original ones

 Three general methodologies


Attribute extraction

Domain-specific

Mapping data to new space (see: data reduction)

E.g., Fourier transformation, wavelet transformation,
manifold approaches (not covered)

Attribute construction

Combining features

Data discretization

55
Clustering
 Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms

Wait until we cover the Cluster

Analysis Topic
56
Sampling


Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:

 Note: Sampling may not reduce database I/Os (page at a time)

57
Types of Sampling Methods
 Simple random sampling
 There is an equal probability of
selecting any particular item

 Sampling without replacement

 Once an object is selected, it is
removed from the population Example

 Sampling with replacement

 A selected object is not removed
from the population

 Stratified sampling: Example

 Partition the data set, and draw
samples from each partition
(proportionally, i.e.,
approximately the same
percentage of the data)
 Used in conjunction with skewed
data
58
Data Reduction 3: Data Compression
 String compression

There are extensive theories and well-tuned algorithms

Typically lossless, but only limited manipulation is possible without
expansion
 Audio/video compression

Typically lossy compression, with progressive refinement

Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
 Time sequence is not audio

Typically short and vary slowly with time

 Note: Dimensionality and numerosity reduction may also be

considered as forms of data compression

59
Think, Remember, and Discuss

 Lossy vs Lossless Techniques

 General Purpose vs Special Purpose
Techniques

 Domain specific methods: Medical, Financial ….

 Name some Tools : WinRar, WinZip, Compress, …..

 Note: This topic is well covered in CIS 280 and CIS

256 Courses.

60
Data Compression

Compressed
Original Data
Data
lossless

s s y
Original Data lo
Approximated

61
Data Transformation
 A function that maps the entire set of values of a given attribute to a new
set of replacement values such that each old value can be identified with
one of the new values
 Methods

Smoothing: Remove noise from data

Attribute/feature construction

New attributes constructed from the given ones

Aggregation: Summarization, data cube construction

Normalization: Scaled to fall within a smaller, specified range

min-max normalization

z-score normalization

normalization by decimal scaling

Discretization: Concept hierarchy climbing

62
Normalization
 Min-max normalization:
 [minA, maxA] to [new_minA, new_maxA]

v − min 𝐴
𝑣 ′= (new max 𝐴 −new min 𝐴)+𝑛𝑒𝑤 min 𝐴
max 𝐴− min 𝐴
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,600 is mapped to

73,600−12,000
𝑣 ′= (1.0−0)+0=0.716
98,000−12,000
63
Normalization
 Z-score normalization (μ: mean, σ: standard deviation):
𝑣−𝜇 𝐴
𝑣 ′=
𝜎 𝐴
 Ex. Let income range $12,000 to $98,000 AND Let μ = 54,000,
σ = 16,000. Then $73,600 is mapped to

73,600− 54,000
𝑣 ′= =1.225
16,000

64
Normalization
 Normalization by decimal scaling

v
v'  j
10 Where j is the smallest integer such that Max(|ν’|) < 1

To normalize by decimal scaling:

- Find the largest number in the given range
- Count the number of digits in the largest number (i.e., j
= 3)
- Divide each number by 10^j (i.e., 10^3 = 1000)

Example, the normalized valued for 90 and 150 is 0.09

and 0.15

65
Exercise
Income Min-Max Z-Score Dec Scale j=2 Dec Scale j=3
15500 0.041 -0.963 155.00 15.50
20000 0.093 -0.816 200.00 20.00
24000 0.14 -0.686 240.00 24.00
23000 0.128 -0.719 230.00 23.00
12000 0 -1.076 120.00 12.00
Normalize
73600 0.716 0.926 736.00 73.60
45000 0.384 -0.004 450.00 45.00
98000 1 1.72 980.00 98.00
78000 0.767 1.069 780.00 78.00
62000 0.581 0.549 620.00 62.00

New Min = 0 v −min 𝐴

New Max = 1 𝑣 ′= (new max 𝐴 −new min 𝐴)+𝑛𝑒𝑤 min 𝐴
Min = 12000
max 𝐴−min 𝐴
Max = 98000 v  v

A
Mean = 45110 v'  v '
 10 j
SD = 30757.27376 A

66
Discretization
 Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic
rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into
intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification

67
Data Discretization Methods
 Typical methods: All the methods can be applied recursively
 Binning

Top-down split, unsupervised
 Histogram analysis

Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

68
Simple Discretization: Binning

 Equal-width (distance) partitioning

 Divides the range into N intervals of equal size: uniform
grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate
presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing
approximately same number of samples (frequency = size
of the interval). 69
Example: Binning using equal Frequency
Method

70
Concept Hierarchy Generation

71
Concept Hierarchy:
Examples

72
Concept Hierarchy
Generation
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
 Concept hierarchies facilitate drilling and rolling in data warehouses
to view data in multiple granularity.
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
 Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.

73
Reminder: Data Cup

74
An Example of a Data Cube

75
An Example of a Data Cube

76
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes explicitly at the
schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data
grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}

77
Automatic Concept Hierarchy
Generation
 Some hierarchies can be automatically generated based on the analysis
of the number of distinct values per attribute in the data set
 The attribute with the most distinct values is placed at the lowest

level of the hierarchy

 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

78
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of
research.

79
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:

Entity identification problem

Remove redundancies

Detect inconsistencies
 Data reduction

Dimensionality reduction

Numerosity reduction

Data compression
 Data transformation and data discretization

Normalization

Concept hierarchy generation

80
Relax and Think

List of CCTV Parts and Accessories
50% (2)
List of CCTV Parts and Accessories
11 pages
Metavision Manual
100% (1)
Metavision Manual
78 pages
DM UNIT-1-1
No ratings yet
DM UNIT-1-1
56 pages
Data-Preprocessing
No ratings yet
Data-Preprocessing
138 pages
02Data
No ratings yet
02Data
24 pages
Data ch2
No ratings yet
Data ch2
16 pages
Week 2 - 3getting To Know Your Data
No ratings yet
Week 2 - 3getting To Know Your Data
67 pages
Lect 3
No ratings yet
Lect 3
51 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
02know Your Data-Lecture2-3
No ratings yet
02know Your Data-Lecture2-3
53 pages
Chapter 2 - Understand Data
No ratings yet
Chapter 2 - Understand Data
63 pages
Chapter 2
No ratings yet
Chapter 2
53 pages
Data Mining Notes
No ratings yet
Data Mining Notes
25 pages
Unit 3 Data Preprocessing - Data
No ratings yet
Unit 3 Data Preprocessing - Data
90 pages
02 Data
No ratings yet
02 Data
62 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
65 pages
Unit 4
No ratings yet
Unit 4
66 pages
02Know Your Data Lecture2 3
No ratings yet
02Know Your Data Lecture2 3
53 pages
Chapter 2: Getting To Know Your Data
No ratings yet
Chapter 2: Getting To Know Your Data
30 pages
Getting To Know Your Data
No ratings yet
Getting To Know Your Data
78 pages
Module 1
No ratings yet
Module 1
64 pages
02-KnowYourData
No ratings yet
02-KnowYourData
44 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
VIPDMTheoryChapter2
No ratings yet
VIPDMTheoryChapter2
56 pages
02 Data
No ratings yet
02 Data
35 pages
Chapter 2
No ratings yet
Chapter 2
65 pages
Data Analysts-1
No ratings yet
Data Analysts-1
65 pages
Transportation Data Mining: Chapter 2. Getting To Know Your Data
No ratings yet
Transportation Data Mining: Chapter 2. Getting To Know Your Data
77 pages
02Data
No ratings yet
02Data
65 pages
2-1-Data
No ratings yet
2-1-Data
22 pages
Unit1 Statistics
No ratings yet
Unit1 Statistics
60 pages
Data Mining (DM) : Lecture 3: Know Your Data
No ratings yet
Data Mining (DM) : Lecture 3: Know Your Data
53 pages
Data and Metrics
No ratings yet
Data and Metrics
35 pages
02 Data
No ratings yet
02 Data
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
54 pages
02Data
No ratings yet
02Data
66 pages
Data Preprocessing Data Basics
No ratings yet
Data Preprocessing Data Basics
86 pages
Data Mining - Lecture 1
No ratings yet
Data Mining - Lecture 1
33 pages
Chapter 2 - Tagged
No ratings yet
Chapter 2 - Tagged
66 pages
Data Type, Data Chart, Descriptive Statistics
No ratings yet
Data Type, Data Chart, Descriptive Statistics
65 pages
Concepts and Techniques: - Chapter 2
No ratings yet
Concepts and Techniques: - Chapter 2
36 pages
Lec.02 Getting to Know Your Data
No ratings yet
Lec.02 Getting to Know Your Data
62 pages
Data Mining
No ratings yet
Data Mining
40 pages
02 Data
No ratings yet
02 Data
41 pages
DM Day2 DataUnderstanding MS S25
No ratings yet
DM Day2 DataUnderstanding MS S25
165 pages
DWDM-LS2-Fall-24-25
No ratings yet
DWDM-LS2-Fall-24-25
42 pages
Data Mining: Data Exploration: - Chapter 6
No ratings yet
Data Mining: Data Exploration: - Chapter 6
56 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
46 pages
Mining
No ratings yet
Mining
129 pages
Unit I
No ratings yet
Unit I
57 pages
Data Preprocessing I
No ratings yet
Data Preprocessing I
39 pages
unit1
No ratings yet
unit1
78 pages
02 Data
No ratings yet
02 Data
64 pages
data mining 2
No ratings yet
data mining 2
64 pages
Unit 2 - Data Visualization Techniques
No ratings yet
Unit 2 - Data Visualization Techniques
101 pages
CS822-DataMining-Week2 (2)
No ratings yet
CS822-DataMining-Week2 (2)
28 pages
Datalec1 (1)
No ratings yet
Datalec1 (1)
23 pages
Data Science Mid Syllabus
No ratings yet
Data Science Mid Syllabus
102 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Co-Clustering: Models, Algorithms and Applications
From Everand
Co-Clustering: Models, Algorithms and Applications
Gérard Govaert
No ratings yet
Data Warehouse (HR Data Mart Design)
No ratings yet
Data Warehouse (HR Data Mart Design)
5 pages
Probing Ict Competency of Student-Teachers in Region 10: Basis For Software Input Development
No ratings yet
Probing Ict Competency of Student-Teachers in Region 10: Basis For Software Input Development
21 pages
Mainframe Features
No ratings yet
Mainframe Features
14 pages
4965 14574 1 PB
No ratings yet
4965 14574 1 PB
9 pages
Mobile Application (Raziq)
No ratings yet
Mobile Application (Raziq)
5 pages
Btech Cs 6 Sem Computer Graphics rcs603 2019
No ratings yet
Btech Cs 6 Sem Computer Graphics rcs603 2019
2 pages
Owner's Instructions: HL-S4266W HL-S4666W HL-S5066W HL-S5666W HL-S6166W
No ratings yet
Owner's Instructions: HL-S4266W HL-S4666W HL-S5066W HL-S5666W HL-S6166W
132 pages
49207947en Airscout Probe DP Manual
No ratings yet
49207947en Airscout Probe DP Manual
28 pages
UC 1 Computer Hardware
No ratings yet
UC 1 Computer Hardware
27 pages
2017 - Privacy-Preserving
No ratings yet
2017 - Privacy-Preserving
20 pages
Administrative BLDG Khalapur
No ratings yet
Administrative BLDG Khalapur
289 pages
22519-M
No ratings yet
22519-M
31 pages
Assessing The Effectiveness of Financial Literacy
No ratings yet
Assessing The Effectiveness of Financial Literacy
17 pages
A Bi-Objective Hyper-Heuristic Support Vector Machines For Big Data Cyber-Security
No ratings yet
A Bi-Objective Hyper-Heuristic Support Vector Machines For Big Data Cyber-Security
11 pages
Microsoft Excel: - Jamil Saudagar
No ratings yet
Microsoft Excel: - Jamil Saudagar
23 pages
Automotive Cybersecurity challenges ResearchGate
No ratings yet
Automotive Cybersecurity challenges ResearchGate
84 pages
Running WebVue As Java Web Start Application
No ratings yet
Running WebVue As Java Web Start Application
2 pages
Datamining Presentation
No ratings yet
Datamining Presentation
20 pages
PHStat Readme
No ratings yet
PHStat Readme
4 pages
Java Practicals
No ratings yet
Java Practicals
43 pages
F165 User Guide
No ratings yet
F165 User Guide
34 pages
Software Installation - Fanuc
No ratings yet
Software Installation - Fanuc
190 pages
486 Unp 301
No ratings yet
486 Unp 301
50 pages
Module 1 - Basic Organization of A Computer System
No ratings yet
Module 1 - Basic Organization of A Computer System
21 pages
Individual Score Form: Lembaga Peperiksaan Malaysia Kementerian Pelajaran Malaysia
No ratings yet
Individual Score Form: Lembaga Peperiksaan Malaysia Kementerian Pelajaran Malaysia
1 page
Erba Product Catalogue 2021 Version 3 English WEB-d4a02df4
No ratings yet
Erba Product Catalogue 2021 Version 3 English WEB-d4a02df4
21 pages
Azure Data Factory Deck 1
No ratings yet
Azure Data Factory Deck 1
59 pages
AutoCAD Record Assembly Drawing Experiments
No ratings yet
AutoCAD Record Assembly Drawing Experiments
21 pages

Uploaded by

Uploaded by

CIS 467 :Data Mining

Department of Computer Information Systems

(1) Getting to Know your

 Introduction and Motivation

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Measuring Data Similarity and Dissimilarity

 It’s tempting to jump straight into mining, but first, we

 Basic statistical descriptions can be used to learn more

 These are measures of central tendency, which give us

 Knowing such basic statistics regarding each attribute

 Knowledge of the attributes and attribute values can also

 Data sets are made up of data objects.

 Attribute is a data field, representing a

Other Names: dimensions,

 Note: Binary attributes are a special case of discrete

represented using a finite number of digits

 Data dispersion characteristics

 Median, mean and mode of symmetric

positively skewed negatively

Standard deviation s (or σ) is the square root of variance s2 (or σ2)

 The left half fragment is

 Dissimilarity (e.g., distance)

 Proximity refers to a similarity or dissimilarity for Ordinal

Don’t fear, just try One

 Every problem has a solution, only if we perhaps change our "ATTITUDE".

(2) Data Preprocessing

 Data Preprocessing: An Overview

 Data in the Real World is Dirty ( GIGO ………)

Optional Assignment: Write a program that implement these methods.

 Redundant data occur often when integration of multiple databases

 Correlation coefficient (also called Pearson’s product moment

 rAB < 0: negatively correlated 47

ABCD ABCE ABDE ACDE BCDE

 Create new attributes (features) that can capture the

 Three general methodologies

Wait until we cover the Cluster

 Note: Sampling may not reduce database I/Os (page at a time)

 Sampling without replacement

 Sampling with replacement

 Stratified sampling: Example

 Note: Dimensionality and numerosity reduction may also be

 Lossy vs Lossless Techniques

 Domain specific methods: Medical, Financial ….

 Note: This topic is well covered in CIS 280 and CIS

To normalize by decimal scaling:

Example, the normalized valued for 90 and 150 is 0.09

New Min = 0 v −min 𝐴

 Equal-width (distance) partitioning

level of the hierarchy

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

You might also like