0% found this document useful (0 votes)
46 views64 pages

Concepts and Techniques: - Chapter 6

Data Mining Basics

Uploaded by

Pu Gu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views64 pages

Concepts and Techniques: - Chapter 6

Data Mining Basics

Uploaded by

Pu Gu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Concepts and

Techniques
(3rd ed.)

— Chapter 6 —

Jiawei Han, Micheline Kamber, and Jian Pei


University of Illinois at Urbana-Champaign &
Simon Fraser University
©2011 Han, Kamber & Pei. All rights reserved.
1
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

2
What Is Frequent Pattern
Analysis?
 Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the
context of frequent itemsets and association rule mining
 Motivation: Finding inherent regularities in data
 What products were often purchased together?— Beer and
diapers?!
 What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale
campaign analysis, Web log (click stream) analysis, and DNA
sequence analysis.
3
Why Is Freq. Pattern Mining
Important?

 Freq. pattern: An intrinsic and important property of


datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis

 Sequential, structural (e.g., sub-graph) patterns

 Pattern analysis in spatiotemporal, multimedia, time-

series, and stream data


 Classification: discriminative, frequent pattern analysis

 Cluster analysis: frequent pattern-based clustering

 Data warehousing: iceberg cube and cube-gradient

 Semantic data compression: fascicles

 Broad applications

4
Basic Concepts: Frequent Patterns
Ti Items bought  itemset: A set of one or
d more items
10 Beer, Nuts, Diaper  k-itemset X = {x1, …, xk}
20 Beer, Coffee, Diaper  (absolute) support, or, support
30 Beer, Diaper, Eggs count of X: Frequency or
occurrence of an itemset X
40 Nuts, Eggs, Milk
 (relative) support, s, is the
Nuts, Coffee, Diaper, Eggs,
50
fraction of transactions that
Custo Milk Custome contains X (i.e., the probability
mer r that a transaction contains X)
buys buys  An itemset X is frequent if
both diaper X’s support is no less than a
minsup threshold

Customer
buys beer

5
Basic Concepts: Association Rules

Ti Items bought  Find all the rules X Y with


d minimum support and confidence
10 Beer, Nuts, Diaper
20 Beer, Coffee, Diaper  support, s, probability that a
30 Beer, Diaper, Eggs transaction contains X Y
40 Nuts, Eggs, Milk  confidence, c, conditional
50 Nuts, Coffee, Diaper, Eggs, probability that a transaction
Milk having X also contains Y
Custome
Custom Let minsup = 50%, minconf = 50%
r
er
buys Freq. Pat.: Beer:3, Nuts:3, Diaper:4,
buys
both Eggs:3, {Beer, Diaper}:3
diaper

Customer
buys beer  Association rules: (many more!)
 Beer Diaper (60%, 100%)
 Diaper Beer (60%, 75%)
6
Closed Patterns and Max-Patterns

 A long pattern contains a combinatorial number of


sub-patterns, e.g., {a1, …, a100} contains (1001) + (1002)
+ … + (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
 Solution: Mine closed patterns and max-patterns
instead
 An itemset X is closed if X is frequent and there
exists no super-pattern Y ‫ כ‬X, with the same support
as X (proposed by Pasquier, et al. @ ICDT’99)
 An itemset X is a max-pattern if X is frequent and
there exists no frequent super-pattern Y ‫ כ‬X
(proposed by Bayardo @ SIGMOD’98)
 Closed pattern is a lossless compression of freq.
patterns
 Reducing the # of patterns and rules

7
Closed Patterns and Max-
Patterns

 Exercise. DB = {<a1, …, a100>, < a1, …,


a50>}
 Min_sup = 1.
 What is the set of closed itemset?
 <a1, …, a100>: 1
 < a1, …, a50>: 2
 What is the set of max-pattern?
 <a1, …, a100>: 1
 What is the set of all patterns?
 !! 8
Computational Complexity of Frequent
Itemset Mining
 How many itemsets are potentially to be generated in the
worst case?
 The number of frequent itemsets to be generated is
senstive to the minsup threshold
 When minsup is low, there exist potentially an
exponential number of frequent itemsets
 The worst case: MN where M: # distinct items, and N:
max length of transactions

9
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

10
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with

Vertical Data Format 11


The Downward Closure Property and
Scalable Mining Methods
 The downward closure property of frequent
patterns
 Any subset of a frequent itemset must be
frequent
 If {beer, diaper, nuts} is frequent, so is
{beer, diaper}
 i.e., every transaction having {beer, diaper,
nuts} also contains {beer, diaper}
 Scalable mining methods: Three major
approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei &
Yin @SIGMOD’00)
12
Apriori: A Candidate Generation & Test
Approach

 Apriori pruning principle: If there is any


itemset which is infrequent, its superset
should not be generated/tested! (Agrawal &
Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
 Method:
 Initially, scan DB once to get frequent 1-
itemset
 Generate length (k+1) candidate itemsets
from length k frequent itemsets
 Test the candidates against DB
 Terminate when no frequent or candidate
13
The Apriori Algorithm—An
Example
Supmin = Itemset sup
Database TDB 2 {A} 2
Itemset sup
L1 {A} 2
Tid Items C1 {B} 3
{B} 3
10 A, C, D {C} 3
20 B, C, E 1st scan {D} 1
{C} 3
{E} 3
30 A, B, C, {E} 3
E
40 B, E
C2 Itemset sup C2
{A, B} 1 Itemset
L2 Itemset sup
{A, C} 2 2nd scan {A, B}
{A, C} 2 {A, C}
{A, E} 1
{B, C} 2 {A, E}
{B, C} 2
{B, E} 3
{B, E} 3 {B, C}
{C, E} 2
{C, E} 2 {B, E}
{C, E}

C3 Itemset
3rd scan L3 Itemset sup
{B, C, E} {B, C, E} 2
14
The Apriori Algorithm
(Pseudo-Code)

Ck: Candidate itemset of size k


Lk : frequent itemset of size k

L1 = {frequent items};
for (k = 1; Lk != ; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database
do
increment the count of all candidates
in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with
min_support
end 15
Implementation of Apriori

 How to generate candidates?


 Step 1: self-joining Lk
 Step 2: pruning
 Example of Candidate-generation
 L3={abc, abd, acd, ace, bcd}
 Self-joining: L3*L3
 abcd from abc and abd
 acde from acd and ace
 Pruning:
 acde is removed because ade is not in L3
 C4 = {abcd}
16
How to Count Supports of
Candidates?

 Why counting supports of candidates a


problem?
 The total number of candidates can be
very huge
 One transaction may contain many
candidates
 Method:
 Candidate itemsets are stored in a hash-
tree
 Leaf node of hash-tree contains a list of
itemsets and counts
17
Candidate Generation: An SQL
Implementation
 SQL Implementation of candidate generation
 Suppose the items in Lk-1 are listed in an order
 Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
q.itemk-1
 Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
 Use object-relational extensions like UDFs, BLOBs, and Table
functions for efficient implementation [See: S. Sarawagi, S.
Thomas, and R. Agrawal. Integrating association rule mining
with relational database systems: Alternatives and implications.
SIGMOD’98]
18
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical 19


Further Improvement of the Apriori
Method

 Major computational challenges


 Multiple scans of transaction database
 Huge number of candidates
 Tedious workload of support counting for
candidates
 Improving Apriori: general ideas
 Reduce passes of transaction database
scans
 Shrink number of candidates
 Facilitate support counting of candidates 20
Partition: Scan Database Only
Twice

 Any itemset that is potentially frequent in


DB must be frequent in at least one of the
partitions of DB
 Scan 1: partition database and find local
frequent patterns
 Scan 2: consolidate global frequent patterns
 A. Savasere, E. Omiecinski and S. Navathe,
VLDB’95

DB1 + DB2 + + DBk =


sup1(i) < sup2(i) < supk(i) < DBsup(i) <
σDB1 σDB2 σDBk σDB
DHP: Reduce the Number of
Candidates

 A k-itemset whose corresponding hashing bucket count


is below the threshold cannot be frequent
count itemset
 Candidates: a, b, c, d, e 35 s ad,
{ab,
88 ae}
{bd, be,
 Hash entries de}
 {ab, ad, ae} . .
 {bd, be, de} .. ..

 … 102 {yz, qs,


wt} Table
Hash
 Frequent 1-itemset: a, b, d, e
 ab is not a candidate 2-itemset if the sum of
count of {ab, ad, ae} is below support threshold
 J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. SIGMOD’95
22
Sampling for Frequent Patterns

 Select a sample of original database, mine


frequent patterns within sample using Apriori
 Scan database once to verify frequent
itemsets found in sample, only borders of
closure of frequent patterns are checked
 Example: check abcd instead of ab, ac, …,
etc.
 Scan database again to find missed frequent
patterns
 H. Toivonen. Sampling large databases for
association rules. In VLDB’96 23
DIC: Reduce Number of Scans

ABCD
 Once both A and D are
determined frequent, the
ABC ABD ACD BCD counting of AD begins
 Once all length-2 subsets of
BCD are determined frequent,
AB AC BC AD BD CD the counting of BCD begins
Transactions
1-itemsets
A B C D
Apriori 2-itemsets

{}
Itemset lattice 1-itemsets
S. Brin R. Motwani, J.
2-items
Ullman, and S. Tsur. DIC 3-items
Dynamic itemset counting
and implication rules for
market basket data. 24
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical 25


Frequent Patterns Without Candidate
Generation

 Bottlenecks of the Apriori approach


 Breadth-first (i.e., level-wise) search
 Candidate generation and test
 Often generates a huge number of candidates
 The FPGrowth Approach (J. Han, J. Pei, and Y. Yin,
SIGMOD’ 00)
 Depth-first search
 Avoid explicit candidate generation
 Major philosophy: Grow long patterns from short ones
using local frequent items only
 “abc” is a frequent pattern
 Get all transactions having “abc”, i.e., project DB on
abc: DB|abc 26
Construct FP-tree from a Transaction Database

TID(ordered)Items bought
frequent items
100 {f,
{f, c, a, m,a,p}
c, d, g, i, m, p}
200 {a,
{f, c, a, b, b,
m} c, f, l, m, o}
min_support
300 {f, b} {b, f, h, j, o, w} =3
400 {b, c, k, s, p}
{c, b, p} {}
500 {a, Header Table
{f, c, a, m,f,p}
c, e, l, p, m, n}
1. Scan DB once, find Item frequency headf:4 c:1
frequent 1-itemset f 4
(single item c 4 c:3 b:1 b:1
pattern) a 3
b 3 a:3 p:1
2. Sort frequent items m 3
in frequency p 3
descending order, f- m:2 b:1
list F-list = f-c-a-b-m-p p:2 m:1
3. Scan DB again, 27
Partition Patterns and
Databases

 Frequent patterns can be partitioned


into subsets according to f-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b,
m, p
 Pattern f
 Completeness and non-redundency
28
Find Patterns Having P From P-conditional
Database

 Starting at the frequent item header table in the


FP-tree
 Traverse the FP-tree by following the link of each
frequent item p
 Accumulate all of transformed prefix paths of item
p to form p’s conditional pattern base
{}
Header Table
f:4 c:1 Conditional pattern
Item frequency head
bases
f 4
c 4 c:3 b:1 b:1 item cond. pattern
a 3 base
b 3 a:3 p:1 c f:3
m 3
p 3 a fc:3
m:2 b:1
b fca:1, f:1, c:1
p:2 m:1 m fca:2, fcab:1
29
From Conditional Pattern-bases to Conditional
FP-trees

 For each pattern-base


 Accumulate the count for each item in
the base
 Construct the FP-tree for the frequent
items of the pattern base
m-conditional pattern
{} base:
Header Table
fca:2, fcab:1
All frequent
Item frequency head f:4 c:1 patterns relate
f 4 {} to m
c 4 c:3 b:1 b:1 m,
a 3 f:3
b 3 a:3 p:1 fm, cm, am,
m 3 c:3 fcm, fam, cam,
p 3 m:2 b:1
fcam
p:2 m:1 a:3
m-conditional FP-tree
30
Recursion: Mining Each Conditional
FP-tree

{}

{} Cond. pattern base of “am”: f:3


(fc:3)
c:3
f:3
am-conditional FP
c:3 {}
Cond. pattern base of “cm”: (f:3)
a:3 f:3
m-conditional FP-tree
cm-conditional FP

{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree

31
A Special Case: Single Prefix Path
in FP-tree

 Suppose a (conditional) FP-tree T has a


shared single prefix-path P
 Mining can be decomposed into two
{} parts
a1:n1  Reduction of the single prefix path
a2:n2 into one node
a3:n3  Concatenation of the mining results of
r1
{}
the two parts
C1:k1 a1:n1
b1:m1 r1 =
a2:n2
+ b1:m1 C1:k1

C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
32
Benefits of the FP-tree Structure

 Completeness
 Preserve complete information for frequent
pattern mining
 Never break a long pattern of any
transaction
 Compactness
 Reduce irrelevant info—infrequent items are
gone
 Items in frequency descending order: the
more frequently occurring, the more likely
to be shared
 Never be larger than the original database 33
The Frequent Pattern Growth Mining
Method

 Idea: Frequent pattern growth


 Recursively grow frequent patterns by
pattern and database partition
 Method
 For each frequent item, construct its
conditional pattern-base, and then its
conditional FP-tree
 Repeat the process on each newly created
conditional FP-tree
 Until the resulting FP-tree is empty, or it
contains only one path—single path will
generate all the combinations of its sub-
paths, each of which is a frequent pattern
34
Scaling FP-growth by Database
Projection
 What about if FP-tree cannot fit in memory?
 DB projection
 First partition a database into a set of projected DBs
 Then construct and mine FP-tree for each projected
DB
 Parallel projection vs. partition projection techniques
 Parallel projection
 Project the DB in parallel for each frequent
item
 Parallel projection is space costly
 All the partitions can be processed in parallel
 Partition projection
 Partition the DB based on the ordered frequent
items 35
Partition-Based Projection

 Parallel projection needs Tran. DB


a lot of disk space fcamp
fcabm
 Partition projection saves fb
it cbp
fcamp

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB


fcam fcab f fc f …
cb fca cb … …
fcam fca …

am-proj DB cm-proj DB
fc f …
fc f
fc f
36
Performance of FPGrowth in Large
Datasets

100
140
90 D1 FP-grow th runtime D2 FP-growth
80
D1 Apriori runtime 120 D2 TreeProjection
70 100

Runtime (sec.)
Run time(sec.)

60
80
50 Data set T25I20D10K Data set
40 60
T25I20D100K
30 40
20
20
10

0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)

FP-Growth vs. FP-Growth vs. Tree-


Apriori Projection

37
Advantages of the Pattern Growth
Approach

 Divide-and-conquer:
 Decompose both the mining task and DB
according to the frequent patterns obtained so far
 Lead to focused search of smaller databases
 Other factors
 No candidate generation, no candidate test
 Compressed database: FP-tree structure
 No repeated scan of entire database
 Basic ops: counting local freq items and building
sub FP-tree, no pattern search and matching
 A good open-source implementation and refinement of
FPGrowth
 FPGrowth+ (Grahne and J. Zhu, FIMI'03) 38
Further Improvements of Mining
Methods

 AFOPT (Liu, et al. @ KDD’03)


 A “push-right” method for mining condensed
frequent pattern (CFP) tree
 Carpenter (Pan, et al. @ KDD’03)
 Mine data sets with small rows but numerous
columns
 Construct a row-enumeration tree for efficient
mining
 FPgrowth+ (Grahne and Zhu, FIMI’03)
 Efficiently Using Prefix-Trees in Mining Frequent
Itemsets, Proc. ICDM'03 Int. Workshop on Frequent
Itemset Mining Implementations (FIMI'03),
39
Extension of Pattern Growth Mining
Methodology
 Mining closed frequent itemsets and max-patterns
 CLOSET (DMKD’00), FPclose, and FPMax (Grahne &

Zhu, Fimi’03)
 Mining sequential patterns
 PrefixSpan (ICDE’01), CloSpan (SDM’03), BIDE (ICDE’04)

 Mining graph patterns


 gSpan (ICDM’02), CloseGraph (KDD’03)

 Constraint-based mining of frequent patterns


 Convertible constraints (ICDE’01), gPrune (PAKDD’03)

 Computing iceberg data cubes with complex measures


 H-tree, H-cubing, and Star-cubing (SIGMOD’01, VLDB’03)

 Pattern-growth-based Clustering
 MaPle (Pei, et al., ICDM’03)

 Pattern-Growth-Based Classification
 Mining frequent and discriminative patterns (Cheng, 40
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical 41


ECLAT: Mining by Exploring Vertical
Data Format

 Vertical format: t(AB) = {T11, T25, …}


 tid-list: list of trans.-ids containing an itemset
 Deriving frequent patterns based on vertical
intersections
 t(X) = t(Y): X and Y always happen together
 t(X) t(Y): transaction having X always has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
 Diffset (XY, X) = {T2}
 Eclat (Zaki et al. @KDD’97)
 Mining Closed patterns using vertical format: CHARM
(Zaki & Hsiao@SDM’02) 42
Scalable Frequent Itemset Mining
Methods

 Apriori: A Candidate Generation-and-Test

Approach

 Improving the Efficiency of Apriori

 FPGrowth: A Frequent Pattern-Growth

Approach

 ECLAT: Frequent Pattern Mining with Vertical 43


Mining Frequent Closed Patterns:
CLOSET

 Flist: list of all frequent items in support


ascending order Min_sup=2
 Flist: d-a-f-e-c TID Items
10 a, c, d,
 Divide search space e, f
20 a, b, e
 Patterns having d 30 c, e, f
40 a, c, d, f
 Patterns having d but no a, etc. 50 c, e, f

 Find frequent closed pattern recursively


 Every transaction having d also has cfa
cfad is a frequent closed pattern
 J. Pei, J. Han & R. Mao. “CLOSET: An Efficient
Algorithm for Mining Frequent Closed Itemsets",
CLOSET+: Mining Closed Itemsets by Pattern-
Growth

 Itemset merging: if Y appears in every


occurrence of X, then Y is merged with X
 Sub-itemset pruning: if Y ‫ כ‬X, and sup(X) =
sup(Y), X and all of X’s descendants in the set
enumeration tree can be pruned
 Hybrid tree projection
 Bottom-up physical tree-projection
 Top-down pseudo tree-projection
 Item skipping: if a local frequent item has the
same support in several header tables at
different levels, one can prune it from the
header table at higher levels
MaxMiner: Mining Max-Patterns

 1st scan: find frequent items Tid Items


10 A, B, C, D,
 A, B, C, D, E E
B, C, D, E,
 2nd scan: find support for 20
30 A, C, D, F
 AB, AC, AD, AE, ABCDE
 BC, BD, BE, BCDE Potential
 CD, CE, CDE, DE max-
 patterns
Since BCDE is a max-pattern, no need to
check BCD, BDE, CDE in later scan
 R. Bayardo. Efficiently mining long patterns
from databases. SIGMOD’98
CHARM: Mining by Exploring Vertical
Data Format

 Vertical format: t(AB) = {T11, T25, …}


 tid-list: list of trans.-ids containing an
itemset
 Deriving closed patterns based on vertical
intersections
 t(X) = t(Y): X and Y always happen
together
 t(X) t(Y): transaction having X always
has Y
 Using diffset to accelerate mining
 Only keep track of differences of tids
 t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
Visualization of Association Rules:
Plane Graph

48
Visualization of Association Rules:
Rule Graph

49
Rules
(SGI/MineSet 3.0)

50
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

51
Interestingness Measure: Correlations
(Lift)

 play basketball eat cereal [40%, 66.7%] is


misleading
 The overall % of students eating cereal is 75% >
66.7%.
 play basketball not eat cereal [20%, 33.3%] is more
accurate, although with lower support and confidence
P( AÈ B) Basketba Not Sum
 lift = of dependent/correlated
Measure events:
ll lift
basketball (row)
P( A) P( B) Cereal 2000 1750 3750

2000 / 5000 Not 1000 250 1250


lift ( B, C ) = =0.89 cereal
3000 / 5000 * 3750 / 5000
Sum(col.) 3000 2000 5000
1000 / 5000
lift ( B, ØC ) = =1.33
3000 / 5000 *1250 / 5000

52
Are lift and 2 Good Measures of
Correlation?

 “Buy walnuts
buy milk [1%, 80%]”
is misleading if
85% of customers
buy milk
 Support and
confidence are not
good to indicate
correlations
 Over 20
interestingness
measures have
been proposed
(see Tan, Kumar,
Sritastava @KDD’02) 53
Null-Invariant Measures

54
Comparison of Interestingness
Measures

 Null-(transaction) invariance is crucial for correlation


analysis
 Lift and 2 are not null-invariant
 5 null-invariant measures
Milk No Milk Sum
(row)
Coffee m, c ~m, c c
No m, ~c ~m, ~c ~c
Coffee
Sum(co m ~m 
l.) Null-transactions Kulczynski
w.r.t. m and c measure (1927) Null-invariant

June 21, 2020 Data Mining: Concepts and Subtle: They disagree55
Techniques
Analysis of DBLP Coauthor
Relationships

Recent DB conferences, removing balanced associations, low sup, etc.

Advisor-advisee relation: Kulc: high,


coherence: low, cosine: middle
 Tianyi Wu, Yuguo Chen and Jiawei Han, “Association
Mining in Large Databases: A Re-Examination of Its
Measures”, Proc. 2007 Int. Conf. Principles and Practice
of Knowledge Discovery in Databases (PKDD'07), Sept.
2007 56
Which Null-Invariant Measure Is
Better?

 IR (Imbalance Ratio): measure the imbalance


of two itemsets A and B in rule implications

 Kulczynski and Imbalance Ratio (IR) together


present a clear picture for all the three
datasets D4 through D6
 D4 is balanced & neutral
 D5 is imbalanced & neutral
 D6 is very imbalanced & neutral
Chapter 5: Mining Frequent Patterns,
Association and Correlations: Basic
Concepts and Methods

 Basic Concepts

 Frequent Itemset Mining Methods

 Which Patterns Are Interesting?—Pattern

Evaluation Methods

 Summary

58
Summary

 Basic concepts: association rules,


support-confident framework, closed and
max-patterns
 Scalable frequent pattern mining
methods
 Apriori (Candidate generation & test)
 Projection-based (FPgrowth,
CLOSET+, ...)
 Vertical format approach (ECLAT,
CHARM, ...) 59
Ref: Basic Concepts of Frequent
Pattern Mining

 (Association Rules) R. Agrawal, T. Imielinski, and A. Swami.


Mining association rules between sets of items in large
databases. SIGMOD'93
 (Max-pattern) R. J. Bayardo. Efficiently mining long patterns
from databases. SIGMOD'98
 (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L.
Lakhal. Discovering frequent closed itemsets for association
rules. ICDT'99
 (Sequential pattern) R. Agrawal and R. Srikant. Mining
sequential patterns. ICDE'95

60
Ref: Apriori and Its Improvements
 R. Agrawal and R. Srikant. Fast algorithms for mining
association rules. VLDB'94
 H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms
for discovering association rules. KDD'94
 A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association rules in large databases.
VLDB'95
 J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based
algorithm for mining association rules. SIGMOD'95
 H. Toivonen. Sampling large databases for association rules.
VLDB'96
 S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset
counting and implication rules for market basket analysis.
SIGMOD'97
 S. Sarawagi, S. Thomas, and R. Agrawal. Integrating 61
Ref: Depth-First, Projection-Based FP
Mining
 R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection
algorithm for generation of frequent itemsets. J. Parallel and
Distributed Computing, 2002.
 G. Grahne and J. Zhu, Efficiently Using Prefix-Trees in Mining Frequent
Itemsets, Proc. FIMI'03
 B. Goethals and M. Zaki. An introduction to workshop on frequent
itemset mining implementations. Proc. ICDM’03 Int. Workshop on
Frequent Itemset Mining Implementations (FIMI’03), Melbourne, FL,
Nov. 2003
 J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD’ 00
 J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by
Opportunistic Projection. KDD'02
 J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed
Patterns without Minimum Support. ICDM'02
62
Ref: Vertical Format and Row
Enumeration Methods
 M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel
algorithm for discovery of association rules. DAMI:97.
 M. J. Zaki and C. J. Hsiao. CHARM: An Efficient Algorithm for
Closed Itemset Mining, SDM'02.
 C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A
Dual-Pruning Algorithm for Itemsets with Constraints. KDD’02.
 F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki ,
CARPENTER: Finding Closed Patterns in Long Biological
Datasets. KDD'03.
 H. Liu, J. Han, D. Xin, and Z. Shao, Mining Interesting Patterns
from Very High Dimensional Data: A Top-Down Row
Enumeration Approach, SDM'06. 63
Ref: Mining Correlations and
Interesting Rules
 S. Brin, R. Motwani, and C. Silverstein. Beyond market basket:
Generalizing association rules to correlations. SIGMOD'97.
 M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I.
Verkamo. Finding interesting rules from large sets of discovered
association rules. CIKM'94.
 R. J. Hilderman and H. J. Hamilton. Knowledge Discovery and
Measures of Interest. Kluwer Academic, 2001.
 C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable
techniques for mining causal structures. VLDB'98.
 P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right
Interestingness Measure for Association Patterns. KDD'02.
 E. Omiecinski. Alternative Interest Measures for Mining
Associations. TKDE’03.
 T. Wu, Y. Chen, and J. Han, “Re-Examination of Interestingness
Measures in Pattern Mining: A Unified Framework", Data Mining and
Knowledge Discovery, 21(3):371-397, 2010 64

You might also like