0% found this document useful (0 votes)
22 views81 pages

CIS 467 - Topic 2 - Data Exploration and Preprocessing

Uploaded by

Dragon Pavilion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views81 pages

CIS 467 - Topic 2 - Data Exploration and Preprocessing

Uploaded by

Dragon Pavilion
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 81

CIS 467 :Data Mining

Department of Computer Information Systems


Faculty of Information Technology
Yarmouk University – Jordan
Data Exploration and
Preprocessing

(1) Getting to Know your


Data

2
Topics

 Introduction and Motivation

 Data Objects and Attribute Types

 Basic Statistical Descriptions of Data

 Data Visualization

 Measuring Data Similarity and Dissimilarity

 Summary

3
Introduction

 It’s tempting to jump straight into mining, but first, we


need to get the data ready. This involves having a closer
look at attributes and data values.
 Real-world data are typically noisy, enormous in volume Data
(often several gigabytes or more), and may originate from challenges
a heterogeneous sources.
 This topic is about getting familiar with your data.
Knowledge about your data is useful for data
preprocessing, the first major task of the data mining
process.
 You will want to know the following: (1) What are the types
of attributes or fields that make up your data? (2)What kind
of values does each attribute have? (3) Which attributes
are discrete, and which are continuous-valued? (4)What do
the data look like?

4
Motivation: Why to know the data
first

 Data Exploration

 Basic statistical descriptions can be used to learn more


about each attribute’s values.
 Given an attribute we can determine its: mean (average
value), median (middle value), and mode (most common
value).

 These are measures of central tendency, which give us


an idea of the “middle” or center of distribution.

 Knowing such basic statistics regarding each attribute


makes it easier to fill in missing values, smooth noisy
values, and spot outliers during data preprocessing.

 Knowledge of the attributes and attribute values can also


help in fixing inconsistencies incurred during data
integration.
5
2.1 Data Objects

 Data sets are made up of data objects.


 A data object represents an entity.
 Examples:
 sales database: customers, store items, sales
 medical database: patients, treatments
 university database: students, professors, courses
 Also called samples , examples, instances, data points,
objects, tuples.
 Data objects are described by attributes.
 In Database:
rows -> data objects;
columns ->attributes.
6
Attributes

 Attribute is a data field, representing a


characteristic or
feature of a data object.

E.g. 

Other Names: dimensions,


features,
variables,
column.

7
Example of a Flat Dataset

The Buys Computer Dataset: This follows an example from Quinlan’s ID3
Attribute Types
 Nominal: categories, states, or “names of things”
 Hair_color = {auburn, black, blond, brown, grey, red, white}
 marital status, occupation, ID numbers, zip codes
 Binary
 Nominal attribute with only 2 states (0 and 1)
 Symmetric binary: both outcomes equally important

e.g., gender
 Asymmetric binary: outcomes not equally important.

e.g., medical test (positive vs. negative)

Convention: assign 1 to most important outcome (e.g., HIV
positive)
 Ordinal
 Values have a meaningful order (ranking) but magnitude
between successive values is not known.
 Size = {small, medium, large}, grades, army rankings

9
Numeric Attribute Types
 Quantity (integer or real-valued)

 Interval

Measured on a scale of equal-sized units

Values have order
 E.g., temperature in C˚or F˚, calendar dates

No true zero-point
 Ratio

Inherent zero-point

We can speak of values as being an order of
magnitude larger than the unit of measurement (10
K˚ is twice as high as 5 K˚).

10
From Book

11
Discrete vs. Continuous
Attributes
 Discrete Attribute
 Has only a finite or countably infinite set of values


E.g., zip codes, profession, or the set of words in a
collection of documents
 Sometimes, represented as integer variables

 Note: Binary attributes are a special case of discrete

attributes
 Continuous Attribute
 Has real numbers as attribute values


E.g., temperature, height, or weight
 Practically, real values can only be measured and

represented using a finite number of digits


 Continuous attributes are typically represented as floating-

point variables

12
Basic Statistical Descriptions of Data
 Motivation
 For data preprocessing to be successful, it is essential to
have an overall picture of your data.
 Basic statistical descriptions can be used to identify
properties of the data and highlight which data values
should be treated as noise or outliers.
 To better understand the data: central tendency,
variation and spread

 Data dispersion characteristics


 median, max, min, quantiles, outliers, variance, etc.
 Numerical dimensions correspond to sorted intervals
 Data dispersion: analyzed with multiple granularities of
precision
 Boxplot or quantile analysis on sorted intervals
 Dispersion analysis on computed measures
 Folding measures into numerical dimensions
 Boxplot or quantile analysis on the transformed cube

13
Measuring the Central Tendency
 Mean (algebraic measure) (sample vs. population): 1 n
x   xi   x
Note: n is sample size and N is population size. n i 1 N
n
 Weighted arithmetic mean:

w x i i
Trimmed mean: chopping extreme values
x i 1
n
 Median:
w
i 1
i
 Middle value if odd number of values, or
average of the middle two values otherwise
 Estimated by interpolation (for grouped data):
n / 2  ( freq )l
median L1  ( ) width
 Mode freq median
 Value that occurs most frequently in the data
 Unimodal, bimodal, trimodal
 Empirical formula: mean  mode 3 (mean  median)

14
Symmetric vs. Skewed
Data

 Median, mean and mode of symmetric


symmetric, positively and negatively
skewed data

positively skewed negatively


skewed

15
November 22, 2024 15
Measuring the Dispersion of
Data
 Variance and standard deviation (sample: s, population: σ)
 Variance: (algebraic, scalable computation)

1 n 1 n 2 1 n
2
s  
n  1 i 1
( xi  x ) 2
 [ 
n  1 i 1
xi  ( 
n i 1
xi ]
) 2

n n
1 1
  i
2
 
2
( xi   ) 
2
x   2

N i 1 N i 1

Standard deviation s (or σ) is the square root of variance s2 (or σ2)

16
Data Mining: Concepts and
11/22/2024 Techniques 17
Positively and Negatively Correlated Data

 The left half fragment is


positively correlated
 The right half is negative
correlated
18
Uncorrelated Data

19
Similarity and Dissimilarity

 Similarity

Numerical measure of how alike two data objects are

Value is higher when objects are more alike

Often falls in the range [0,1]

 Dissimilarity (e.g., distance)



Numerical measure of how different two data objects are

Lower when objects are more alike

Minimum dissimilarity is often 0

Upper limit varies

 Proximity refers to a similarity or dissimilarity for Ordinal


attributes

20
Similarity

21
Data Visualization
 Why data visualization?
 Gain insight into an information space by mapping data onto
graphical primitives
 Provide qualitative overview of large data sets
 Search for patterns, trends, structure, irregularities, relationships
among data
 Help find interesting regions and suitable parameters for further
quantitative analysis
 Provide a visual proof of computer representations derived
 Categorization of visualization methods:
 Pixel-oriented visualization techniques
 Geometric projection visualization techniques
 Icon-based visualization techniques
 Hierarchical visualization techniques
 Visualizing complex data and relations

22
Pixel oriented visualization of 4 attributes by sorting all customers in
income Ascending order.

Icon Visualization
Geometric projection techniques - Scatter

23
Simple Data Visualization using Excel

24
Think and Discuss
 Discuss how and where we can use the
Visualization concepts for the three phases
of Knowledge Discovery (KDD) Process.?

1) Preprocessing Phase
2) Data Mining Phase
3) Post Processing Phase

25
Statistics Tools
 Students are encouraged to practice data
exploration using one of these tools:

 SPSS
 Excel
 SAS
 R
 PYTHON

Don’t fear, just try One

26
Exercise
 Given the following numbers that represent students grades
in a given course, find the following:
90, 85, 75, 45, 95, 66, 74, 83, 66, 57
 Arithmetic Mean
 Trimmed Mean
 Mode
 Variance
 Median
 Standard Deviation
 Range
 Plot the grades using Line Chart
 Use Excel/SPSS to find all previous metrics.
 Write your own notes

27
Exercise
 Given the following numbers that represent students grades
in a given course, find the following:
85, 85, 85, 85, 85, 85, 85, 85, 85, 85
 Arithmetic Mean
 Trimmed Mean
 Mode
 Variance
 Median
 Standard Deviation
 Range
 Plot the grades using Line Chart
 Use Excel/SPSS to find all previous metrics.
 Write your own notes

28
Break for FUN: Amazing mathematics
 If: A B C D E F G H I J K L M N O P Q R S T U V W X Y Z =
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
 Then
 H+A+R+D+W+O+R+K ; 8+1+18+4+23+15+18+11=98%
 K+N+O+W+L+E+D+G+E ; 11+14+15+23+12+5+4+7+5=96%
 L+O+V+E; 12+15+22+5 = 54%
 L+U+C+K ; 12+21+3+11 = 47%
 None of them makes 100%. Then what makes 100% ???
 Is it Money? NO !!!
M+O+N+E+Y= 13+15+14+5+25=72%
 Leadership? NO !!!
L+E+A+D+E+R+S+H+I+P= 12+5+1+4+5+18+19+8+9+16=97%

 Every problem has a solution, only if we perhaps change our "ATTITUDE".


 A+T+T+I+T+U+D+E ; 1+20+20+9+20+21+4+5 = 100%


It is therefore OUR ATTITUDE towards Life and Work that makes OUR Life 100%
Successful..
Topic 2

(2) Data Preprocessing

30
30
Topics

 Data Preprocessing: An Overview


 Data Quality
 Major Tasks in Data Preprocessing
 Data Cleaning
 Data Integration
 Data Reduction
 Data Transformation and Data Discretization
 Summary

31
31
Data Preprocessing

32
Data Quality: Why Preprocess the Data?

 Data in the Real World is Dirty ( GIGO ………)


 Measures for data quality: A multidimensional view
 Accuracy: correct or wrong, accurate or not
 Completeness: not recorded, unavailable, …
 Consistency: some modified but some not, dangling, …
 Timeliness: timely update?
 Believability: how trustable the data are correct?
 Interpretability: how easily the data can be understood?

33
Major Tasks in Data Preprocessing
 Data cleaning
 Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
 Data integration
 Integration of multiple databases, data cubes, or files
 Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
 Data transformation and data discretization
 Normalization
 Concept hierarchy generation

34
Data Cleaning
 Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission
error
 incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data

e.g., Occupation=“ ” (missing data)
 noisy: containing noise, errors, or outliers

e.g., Salary=“−10” (an error)
 inconsistent: containing discrepancies in codes or names,
e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records
 Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
35
Incomplete (Missing) Data
 Data is not always available
 E.g., many tuples have no recorded value for several
attributes, such as customer income in sales data
 Missing data may be due to
 equipment malfunction
 inconsistent with other recorded data and thus deleted
 data not entered due to misunderstanding
 certain data may not be considered important at the time
of entry
 not register history or changes of the data
 Missing data may need to be inferred …..

36
How to Handle Missing Data?
1. Ignore the tuple: This is usually done when the class label is missing:
(assuming the mining task involves classification). This method is not very effective,
unless the tuple contains several attributes with missing values. It is especially poor
when the percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming
and may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown” or ??. If missing values
are replaced by, say, “Unknown,” then the mining program may mistakenly think that
they form an interesting concept, since they all have a value in common—that of
“Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: For example, suppose that
the average income of AllElectronics customers is $56,000. Use this value to replace
the missing value for income.
5. Use the attribute mean for all samples belonging to the same class as the
given tuple (Stratified Mean): For example, if classifying customers according to
credit risk, replace the missing value with the average income value for customers in
the same credit risk category as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your data
set, you may construct a decision tree to predict the missing values for income.
7. Use a domain specific method.

Optional Assignment: Write a program that implement these methods.


37
Exercise
 Fill in the missing values (??X and ??Y) using the following
methods.
A1 A2 Class
 Attribute Mean 10 650 Yes
 Stratified Attribute Mean 12 150 No
 Global Constant 15 200 No
 Use Linear Regression (Wait) 20 200 Yes
??X 350 Yes
9 400 Yes
5 500 No
19 ??Y No
17 100 Yes
16 300 Yes

38
Exercise
 Fill in the missing values (??X and ??Y) using the following
methods.
A1 A2 Class
 Attribute Mean 10 650 Yes
 Stratified Attribute Mean 12 150 No
 Global Constant 15 200 No
 Use Linear Regression (Wait) 20 200 Yes
mean x=13.66 ??X 350 Yes
y=316.66
9 400 Yes
stratified 5 500 No
Mean x= (10+20+9+17+16) / 5 = 14.4
y=(150+200+500) / 3 = 283.3
19 ??Y No
Assume any missing values for A1 is global
Constant constant 10 17 100 Yes
Assume any missing values for A2 is global
16 300 Yes
constant 250

39
Noisy Data
 Noise: random error or variance in a measured variable
 Incorrect attribute values may be due to

faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention
 Other data problems which require data cleaning

duplicate records

incomplete data

inconsistent data

40
How to Handle Noisy Data?
 Binning

first sort data and partition into (equal-frequency) bins

then one can

smooth by bin means,
smooth by bin median,
smooth by bin boundaries, etc.

 Regression

smooth by fitting the data into regression functions
 Clustering

detect and remove outliers
 Combined computer and human inspection

detect suspicious values and check by human (e.g., deal with
possible outliers)

41
Binning

42
Example: Binning
S={28, 25, 21, 8, 4, 34, 24, 15, 21 }

43
Data Preprocessing to build a Data
Warehouse

44
Data Integration
 Data integration:
 Combines data from multiple sources into a coherent
store
 Schema integration: e.g., A.cust-id  B.cust-#
 Integrate metadata from different sources
 Entity identification problem:
 Identify real world entities from multiple data
sources,
e.g., Bill Clinton = William Clinton
 Detecting and resolving data value conflicts
 For the same real world entity, attribute values from
different sources are different
 Possible reasons: different representations, different scales,
e.g., metric vs. British units, mm vs cm

45
45
Handling Redundancy in Data Integration

 Redundant data occur often when integration of multiple databases


 Object identification: The same attribute or object may have
different names in different databases
 Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
 Redundant attributes may be able to be detected by correlation
analysis and covariance analysis
 Purpose of integration of the data from multiple sources may help
to:
 reduce/avoid redundancies and inconsistencies
and
 improve mining speed and quality

46
46
Correlation Analysis (Numeric Data)

 Correlation coefficient (also called Pearson’s product moment


coefficient)


n
(ai bi )  n A B
i 1 (ai  A)(bi  B)
n

rA, B   i 1

(n  1) A B (n  1) A B
where n is the number of tuples,A and
B are the respective
means of A and B, σA and σB are the respective standard deviation
of A and B, and Σ(aibi) is the sum of the AB cross-product.
 If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.

 rA,B = 0: independent;

 rAB < 0: negatively correlated 47


Exercise
 Find the correlation between A1 and A2.

A1 A2 Class
10 650 Yes
12 550 No
15 200 No
20 100 Yes
14 350 Yes
9 700 Yes
5 900 No
19 250 No
17 200 Yes
16 300 Yes

=CORREL(A2:A11,B2:B11)
48
Data Reduction Strategies
 Data reduction: Obtain a reduced representation of the data set
that is much smaller in volume but yet produces the same (or almost
the same) analytical results
 Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time
to run on the complete data set.
 Data reduction strategies
 Dimensionality reduction, e.g., remove unimportant attributes

Wavelet transforms

Principal Components Analysis (PCA)

Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)

Regression and Log-Linear Models

Histograms, clustering, sampling

Data cube aggregation
 Data compression

49
Data Reduction
1: Dimensionality Reduction
 Curse of dimensionality
 When dimensionality increases, data becomes increasingly sparse
 Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
 The possible combinations of subspaces will grow exponentially
 Dimensionality reduction
 Avoid the curse of dimensionality
 Help eliminate irrelevant features and reduce noise
 Reduce time and space required in data mining
 Allow easier visualization
 Dimensionality reduction techniques
 Wavelet transforms
 Principal Component Analysis
 Supervised and nonlinear techniques (e.g., feature selection)

50
Attribute Subset Selection
 Another way to reduce dimensionality of data
 Redundant attributes
 Duplicate much or all of the information contained in one or
more other attributes
 E.g., purchase price of a product and the amount of sales tax
paid

 Irrelevant attributes
 Contain no information that is useful for the data mining task at
hand
 E.g., students' ID is often irrelevant to the task of predicting
students' GPA

51
Heuristic Search in Attribute
Selection
 There are 2d possible attribute combinations of d attributes
 Typical heuristic attribute selection methods:
 Best single attribute under the attribute independence
assumption: choose by significance tests
 Best step-wise feature selection:

The best single-attribute is picked first

Then next best attribute condition to the first, ...
 Step-wise attribute elimination:

Repeatedly eliminate the worst attribute
 Best combined attribute selection and elimination
 Optimal branch and bound:

Use attribute elimination and backtracking
52
The 2d problem
Assume we have 5 attributes (A, B, C, D, E)

null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

ABCDE

53
Heuristic Search in Attribute
Selection

54
Attribute Creation (Feature
Generation)

 Create new attributes (features) that can capture the


important information in a data set more effectively than the
original ones

 Three general methodologies



Attribute extraction

Domain-specific

Mapping data to new space (see: data reduction)

E.g., Fourier transformation, wavelet transformation,
manifold approaches (not covered)

Attribute construction

Combining features

Data discretization

55
Clustering
 Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
 Can be very effective if data is clustered but not if data is
“smeared”
 Can have hierarchical clustering and be stored in multi-
dimensional index tree structures
 There are many choices of clustering definitions and
clustering algorithms

Wait until we cover the Cluster


Analysis Topic
56
Sampling


Sampling: obtaining a small sample s to represent the
whole data set N
 Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
 Key principle: Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
 Develop adaptive sampling methods, e.g., stratified
sampling:

 Note: Sampling may not reduce database I/Os (page at a time)

57
Types of Sampling Methods
 Simple random sampling
 There is an equal probability of
selecting any particular item

 Sampling without replacement


 Once an object is selected, it is
removed from the population Example

 Sampling with replacement


 A selected object is not removed
from the population

 Stratified sampling: Example


 Partition the data set, and draw
samples from each partition
(proportionally, i.e.,
approximately the same
percentage of the data)
 Used in conjunction with skewed
data
58
Data Reduction 3: Data Compression
 String compression

There are extensive theories and well-tuned algorithms

Typically lossless, but only limited manipulation is possible without
expansion
 Audio/video compression

Typically lossy compression, with progressive refinement

Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
 Time sequence is not audio

Typically short and vary slowly with time

 Note: Dimensionality and numerosity reduction may also be


considered as forms of data compression

59
Think, Remember, and Discuss

 Lossy vs Lossless Techniques


 General Purpose vs Special Purpose
Techniques

 Domain specific methods: Medical, Financial ….


 Name some Tools : WinRar, WinZip, Compress, …..

 Note: This topic is well covered in CIS 280 and CIS


256 Courses.

60
Data Compression

Compressed
Original Data
Data
lossless

s s y
Original Data lo
Approximated

61
Data Transformation
 A function that maps the entire set of values of a given attribute to a new
set of replacement values such that each old value can be identified with
one of the new values
 Methods

Smoothing: Remove noise from data

Attribute/feature construction

New attributes constructed from the given ones

Aggregation: Summarization, data cube construction

Normalization: Scaled to fall within a smaller, specified range

min-max normalization

z-score normalization

normalization by decimal scaling

Discretization: Concept hierarchy climbing

62
Normalization
 Min-max normalization:
 [minA, maxA] to [new_minA, new_maxA]

v − min 𝐴
𝑣 ′= (new max 𝐴 −new min 𝐴)+𝑛𝑒𝑤 min 𝐴
max 𝐴− min 𝐴
 Ex. Let income range $12,000 to $98,000 normalized to [0.0,
1.0]. Then $73,600 is mapped to

73,600−12,000
𝑣 ′= (1.0−0)+0=0.716
98,000−12,000
63
Normalization
 Z-score normalization (μ: mean, σ: standard deviation):
𝑣−𝜇 𝐴
𝑣 ′=
𝜎 𝐴
 Ex. Let income range $12,000 to $98,000 AND Let μ = 54,000,
σ = 16,000. Then $73,600 is mapped to

73,600− 54,000
𝑣 ′= =1.225
16,000

64
Normalization
 Normalization by decimal scaling

v
v'  j
10 Where j is the smallest integer such that Max(|ν’|) < 1

To normalize by decimal scaling:


- Find the largest number in the given range
- Count the number of digits in the largest number (i.e., j
= 3)
- Divide each number by 10^j (i.e., 10^3 = 1000)

Example, the normalized valued for 90 and 150 is 0.09


and 0.15

65
Exercise
Income Min-Max Z-Score Dec Scale j=2 Dec Scale j=3
15500 0.041 -0.963 155.00 15.50
20000 0.093 -0.816 200.00 20.00
24000 0.14 -0.686 240.00 24.00
23000 0.128 -0.719 230.00 23.00
12000 0 -1.076 120.00 12.00
Normalize
73600 0.716 0.926 736.00 73.60
45000 0.384 -0.004 450.00 45.00
98000 1 1.72 980.00 98.00
78000 0.767 1.069 780.00 78.00
62000 0.581 0.549 620.00 62.00

New Min = 0 v −min 𝐴


New Max = 1 𝑣 ′= (new max 𝐴 −new min 𝐴)+𝑛𝑒𝑤 min 𝐴
Min = 12000
max 𝐴−min 𝐴
Max = 98000 v  v

A
Mean = 45110 v'  v '
 10 j
SD = 30757.27376 A

66
Discretization
 Three types of attributes
 Nominal—values from an unordered set, e.g., color, profession
 Ordinal—values from an ordered set, e.g., military or academic
rank
 Numeric—real numbers, e.g., integer or real numbers
 Discretization: Divide the range of a continuous attribute into
intervals
 Interval labels can then be used to replace actual data values
 Reduce data size by discretization
 Supervised vs. unsupervised
 Split (top-down) vs. merge (bottom-up)
 Discretization can be performed recursively on an attribute
 Prepare for further analysis, e.g., classification

67
Data Discretization Methods
 Typical methods: All the methods can be applied recursively
 Binning

Top-down split, unsupervised
 Histogram analysis

Top-down split, unsupervised
 Clustering analysis (unsupervised, top-down split or
bottom-up merge)
 Decision-tree analysis (supervised, top-down split)
 Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)

68
Simple Discretization: Binning

 Equal-width (distance) partitioning


 Divides the range into N intervals of equal size: uniform
grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
 The most straightforward, but outliers may dominate
presentation
 Skewed data is not handled well
 Equal-depth (frequency) partitioning
 Divides the range into N intervals, each containing
approximately same number of samples (frequency = size
of the interval). 69
Example: Binning using equal Frequency
Method

70
Concept Hierarchy Generation

71
Concept Hierarchy:
Examples

72
Concept Hierarchy
Generation
 Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
 Concept hierarchies facilitate drilling and rolling in data warehouses
to view data in multiple granularity.
 Concept hierarchy formation: Recursively reduce the data by
collecting and replacing low level concepts (such as numeric values
for age) by higher level concepts (such as youth, adult, or senior)
 Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
 Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.

73
Reminder: Data Cup

74
An Example of a Data Cube

75
An Example of a Data Cube

76
Concept Hierarchy Generation
for Nominal Data
 Specification of a partial/total ordering of attributes explicitly at the
schema level by users or experts
 street < city < state < country
 Specification of a hierarchy for a set of values by explicit data
grouping
 {Urbana, Champaign, Chicago} < Illinois
 Specification of only a partial set of attributes
 E.g., only street < city, not others
 Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values
 E.g., for a set of attributes: {street, city, state, country}

77
Automatic Concept Hierarchy
Generation
 Some hierarchies can be automatically generated based on the analysis
of the number of distinct values per attribute in the data set
 The attribute with the most distinct values is placed at the lowest

level of the hierarchy


 Exceptions, e.g., weekday, month, quarter, year

country 15 distinct values

province_or_ state 365 distinct values

city 3567 distinct values

street 674,339 distinct values

78
Summary
 Data attribute types: nominal, binary, ordinal, interval-scaled, ratio-
scaled
 Many types of data sets, e.g., numerical, text, graph, Web, image.
 Gain insight into the data by:
 Basic statistical data description: central tendency, dispersion,
graphical displays
 Data visualization: map data onto graphical primitives
 Measure data similarity
 Above steps are the beginning of data preprocessing.
 Many methods have been developed but still an active area of
research.

79
Summary
 Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
 Data cleaning: e.g. missing/noisy values, outliers
 Data integration from multiple sources:

Entity identification problem

Remove redundancies

Detect inconsistencies
 Data reduction

Dimensionality reduction

Numerosity reduction

Data compression
 Data transformation and data discretization

Normalization

Concept hierarchy generation

80
Relax and Think

You might also like