CIS 467 - Topic 2 - Data Exploration and Preprocessing
CIS 467 - Topic 2 - Data Exploration and Preprocessing
2
Topics
Data Visualization
Summary
3
Introduction
4
Motivation: Why to know the data
first
Data Exploration
7
Example of a Flat Dataset
The Buys Computer Dataset: This follows an example from Quinlan’s ID3
Attribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV
positive)
Ordinal
Values have a meaningful order (ranking) but magnitude
between successive values is not known.
Size = {small, medium, large}, grades, army rankings
9
Numeric Attribute Types
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of
magnitude larger than the unit of measurement (10
K˚ is twice as high as 5 K˚).
10
From Book
11
Discrete vs. Continuous
Attributes
Discrete Attribute
Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a
collection of documents
Sometimes, represented as integer variables
attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and
point variables
12
Basic Statistical Descriptions of Data
Motivation
For data preprocessing to be successful, it is essential to
have an overall picture of your data.
Basic statistical descriptions can be used to identify
properties of the data and highlight which data values
should be treated as noise or outliers.
To better understand the data: central tendency,
variation and spread
13
Measuring the Central Tendency
Mean (algebraic measure) (sample vs. population): 1 n
x xi x
Note: n is sample size and N is population size. n i 1 N
n
Weighted arithmetic mean:
w x i i
Trimmed mean: chopping extreme values
x i 1
n
Median:
w
i 1
i
Middle value if odd number of values, or
average of the middle two values otherwise
Estimated by interpolation (for grouped data):
n / 2 ( freq )l
median L1 ( ) width
Mode freq median
Value that occurs most frequently in the data
Unimodal, bimodal, trimodal
Empirical formula: mean mode 3 (mean median)
14
Symmetric vs. Skewed
Data
15
November 22, 2024 15
Measuring the Dispersion of
Data
Variance and standard deviation (sample: s, population: σ)
Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n
2
s
n 1 i 1
( xi x ) 2
[
n 1 i 1
xi (
n i 1
xi ]
) 2
n n
1 1
i
2
2
( xi )
2
x 2
N i 1 N i 1
16
Data Mining: Concepts and
11/22/2024 Techniques 17
Positively and Negatively Correlated Data
19
Similarity and Dissimilarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
20
Similarity
21
Data Visualization
Why data visualization?
Gain insight into an information space by mapping data onto
graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships
among data
Help find interesting regions and suitable parameters for further
quantitative analysis
Provide a visual proof of computer representations derived
Categorization of visualization methods:
Pixel-oriented visualization techniques
Geometric projection visualization techniques
Icon-based visualization techniques
Hierarchical visualization techniques
Visualizing complex data and relations
22
Pixel oriented visualization of 4 attributes by sorting all customers in
income Ascending order.
Icon Visualization
Geometric projection techniques - Scatter
23
Simple Data Visualization using Excel
24
Think and Discuss
Discuss how and where we can use the
Visualization concepts for the three phases
of Knowledge Discovery (KDD) Process.?
1) Preprocessing Phase
2) Data Mining Phase
3) Post Processing Phase
25
Statistics Tools
Students are encouraged to practice data
exploration using one of these tools:
SPSS
Excel
SAS
R
PYTHON
26
Exercise
Given the following numbers that represent students grades
in a given course, find the following:
90, 85, 75, 45, 95, 66, 74, 83, 66, 57
Arithmetic Mean
Trimmed Mean
Mode
Variance
Median
Standard Deviation
Range
Plot the grades using Line Chart
Use Excel/SPSS to find all previous metrics.
Write your own notes
27
Exercise
Given the following numbers that represent students grades
in a given course, find the following:
85, 85, 85, 85, 85, 85, 85, 85, 85, 85
Arithmetic Mean
Trimmed Mean
Mode
Variance
Median
Standard Deviation
Range
Plot the grades using Line Chart
Use Excel/SPSS to find all previous metrics.
Write your own notes
28
Break for FUN: Amazing mathematics
If: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Then
H+A+R+D+W+O+R+K ; 8+1+18+4+23+15+18+11=98%
K+N+O+W+L+E+D+G+E ; 11+14+15+23+12+5+4+7+5=96%
L+O+V+E; 12+15+22+5 = 54%
L+U+C+K ; 12+21+3+11 = 47%
None of them makes 100%. Then what makes 100% ???
Is it Money? NO !!!
M+O+N+E+Y= 13+15+14+5+25=72%
Leadership? NO !!!
L+E+A+D+E+R+S+H+I+P= 12+5+1+4+5+18+19+8+9+16=97%
It is therefore OUR ATTITUDE towards Life and Work that makes OUR Life 100%
Successful..
Topic 2
30
30
Topics
31
31
Data Preprocessing
32
Data Quality: Why Preprocess the Data?
33
Major Tasks in Data Preprocessing
Data cleaning
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Data integration
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
34
Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
e.g., Occupation=“ ” (missing data)
noisy: containing noise, errors, or outliers
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names,
e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
35
Incomplete (Missing) Data
Data is not always available
E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time
of entry
not register history or changes of the data
Missing data may need to be inferred …..
36
How to Handle Missing Data?
1. Ignore the tuple: This is usually done when the class label is missing:
(assuming the mining task involves classification). This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially poor
when the percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming
and may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown” or ??. If missing values
are replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: For example, suppose that
the average income of AllElectronics customers is $56,000. Use this value to replace
the missing value for income.
5. Use the attribute mean for all samples belonging to the same class as the
given tuple (Stratified Mean): For example, if classifying customers according to
credit risk, replace the missing value with the average income value for customers in
the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your data
set, you may construct a decision tree to predict the missing values for income.
7. Use a domain specific method.
38
Exercise
Fill in the missing values (??X and ??Y) using the following
methods.
A1 A2 Class
Attribute Mean 10 650 Yes
Stratified Attribute Mean 12 150 No
Global Constant 15 200 No
Use Linear Regression (Wait) 20 200 Yes
mean x=13.66 ??X 350 Yes
y=316.66
9 400 Yes
stratified 5 500 No
Mean x= (10+20+9+17+16) / 5 = 14.4
y=(150+200+500) / 3 = 283.3
19 ??Y No
Assume any missing values for A1 is global
Constant constant 10 17 100 Yes
Assume any missing values for A2 is global
16 300 Yes
constant 250
39
Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
40
How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can
smooth by bin means,
smooth by bin median,
smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g., deal with
possible outliers)
41
Binning
42
Example: Binning
S={28, 25, 21, 8, 4, 34, 24, 15, 21 }
43
Data Preprocessing to build a Data
Warehouse
44
Data Integration
Data integration:
Combines data from multiple sources into a coherent
store
Schema integration: e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data
sources,
e.g., Bill Clinton = William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from
different sources are different
Possible reasons: different representations, different scales,
e.g., metric vs. British units, mm vs cm
45
45
Handling Redundancy in Data Integration
46
46
Correlation Analysis (Numeric Data)
n
(ai bi ) n A B
i 1 (ai A)(bi B)
n
rA, B i 1
(n 1) A B (n 1) A B
where n is the number of tuples,A and
B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
rA,B = 0: independent;
A1 A2 Class
10 650 Yes
12 550 No
15 200 No
20 100 Yes
14 350 Yes
9 700 Yes
5 900 No
19 250 No
17 200 Yes
16 300 Yes
=CORREL(A2:A11,B2:B11)
48
Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or almost
the same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
Data reduction strategies
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Numerosity reduction (some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression
49
Data Reduction
1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
50
Attribute Subset Selection
Another way to reduce dimensionality of data
Redundant attributes
Duplicate much or all of the information contained in one or
more other attributes
E.g., purchase price of a product and the amount of sales tax
paid
Irrelevant attributes
Contain no information that is useful for the data mining task at
hand
E.g., students' ID is often irrelevant to the task of predicting
students' GPA
51
Heuristic Search in Attribute
Selection
There are 2d possible attribute combinations of d attributes
Typical heuristic attribute selection methods:
Best single attribute under the attribute independence
assumption: choose by significance tests
Best step-wise feature selection:
The best single-attribute is picked first
Then next best attribute condition to the first, ...
Step-wise attribute elimination:
Repeatedly eliminate the worst attribute
Best combined attribute selection and elimination
Optimal branch and bound:
Use attribute elimination and backtracking
52
The 2d problem
Assume we have 5 attributes (A, B, C, D, E)
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCDE
53
Heuristic Search in Attribute
Selection
54
Attribute Creation (Feature
Generation)
55
Clustering
Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
Sampling: obtaining a small sample s to represent the
whole data set N
Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Key principle: Choose a representative subset of the data
Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods, e.g., stratified
sampling:
57
Types of Sampling Methods
Simple random sampling
There is an equal probability of
selecting any particular item
59
Think, Remember, and Discuss
60
Data Compression
Compressed
Original Data
Data
lossless
s s y
Original Data lo
Approximated
61
Data Transformation
A function that maps the entire set of values of a given attribute to a new
set of replacement values such that each old value can be identified with
one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
62
Normalization
Min-max normalization:
[minA, maxA] to [new_minA, new_maxA]
v − min 𝐴
𝑣 ′= (new max 𝐴 −new min 𝐴)+𝑛𝑒𝑤 min 𝐴
max 𝐴− min 𝐴
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,600 is mapped to
73,600−12,000
𝑣 ′= (1.0−0)+0=0.716
98,000−12,000
63
Normalization
Z-score normalization (μ: mean, σ: standard deviation):
𝑣−𝜇 𝐴
𝑣 ′=
𝜎 𝐴
Ex. Let income range $12,000 to $98,000 AND Let μ = 54,000,
σ = 16,000. Then $73,600 is mapped to
73,600− 54,000
𝑣 ′= =1.225
16,000
64
Normalization
Normalization by decimal scaling
v
v' j
10 Where j is the smallest integer such that Max(|ν’|) < 1
65
Exercise
Income Min-Max Z-Score Dec Scale j=2 Dec Scale j=3
15500 0.041 -0.963 155.00 15.50
20000 0.093 -0.816 200.00 20.00
24000 0.14 -0.686 240.00 24.00
23000 0.128 -0.719 230.00 23.00
12000 0 -1.076 120.00 12.00
Normalize
73600 0.716 0.926 736.00 73.60
45000 0.384 -0.004 450.00 45.00
98000 1 1.72 980.00 98.00
78000 0.767 1.069 780.00 78.00
62000 0.581 0.549 620.00 62.00
66
Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic
rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into
intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
67
Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or
bottom-up merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
68
Simple Discretization: Binning
70
Concept Hierarchy Generation
71
Concept Hierarchy:
Examples
72
Concept Hierarchy
Generation
Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
Concept hierarchies facilitate drilling and rolling in data warehouses
to view data in multiple granularity.
Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as youth, adult, or senior)
Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
73
Reminder: Data Cup
74
An Example of a Data Cube
75
An Example of a Data Cube
76
Concept Hierarchy Generation
for Nominal Data
Specification of a partial/total ordering of attributes explicitly at the
schema level by users or experts
street < city < state < country
Specification of a hierarchy for a set of values by explicit data
grouping
{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
E.g., for a set of attributes: {street, city, state, country}
77
Automatic Concept Hierarchy
Generation
Some hierarchies can be automatically generated based on the analysis
of the number of distinct values per attribute in the data set
The attribute with the most distinct values is placed at the lowest
78
Summary
Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
Many types of data sets, e.g., numerical, text, graph, Web, image.
Gain insight into the data by:
Basic statistical data description: central tendency, dispersion,
graphical displays
Data visualization: map data onto graphical primitives
Measure data similarity
Above steps are the beginning of data preprocessing.
Many methods have been developed but still an active area of
research.
79
Summary
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
80
Relax and Think