Data Mining
Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Course Description – Module Structure
M1 Introduction to Data Mining
M2 Data Preprocessing:
To understand the need for data preprocessing and various techniques used
in the context of Data Mining
M3 Data Exploration:
A preliminary exploration of the data to better understand its characteristics
• ?
4
BITS Pilani, Pilani Campus
Motivation
• ?
5
BITS Pilani, Pilani Campus
Books
Prescribed Text Book
7
BITS Pilani, Pilani Campus
Quiz
8
BITS Pilani, Pilani Campus
2) Data mining can also applied to other forms such as ................
i) Data streams
ii) Sequence data
iii) Networked data
iv) Text data
v) Spatial data
9
BITS Pilani, Pilani Campus
3) Which of the following is not a data mining
functionality?
A) i, ii and iv only
B) ii, iii and iv only
C) i, ii and iii only
D) All i, ii, iii and iv 11
BITS Pilani, Pilani Campus
5) _____________ is the application of data
mining techniques to discover patterns from the
Web.
A. Text Mining.
B. Multimedia Mining.
C. Web Mining.
D. Link Mining
12
BITS Pilani, Pilani Campus
6)__________________refers to the process of
deriving high-quality information from text.
A. Text Mining.
B. Image Mining.
C. Database Mining.
D. Multimedia Mining
13
BITS Pilani, Pilani Campus
Introduction to Data Mining
21
BITS Pilani, Pilani Campus
Why Data Mining
A search engine (e.g., Google) receives hundreds of millions of queries every day. Each
query can be viewed as a transaction where the user describes her or his information
need.
What novel and useful knowledge can a search engine learn from such a huge
collection of queries collected from users over time?
Some patterns found in user search queries can disclose invaluable knowledge
For example, Google's Flu Trends uses specific search terms as indicators of flu activity.
Using aggregated Google search data, Flu Trends can estimate flu activity up to two
weeks faster than traditional systems can.
This example shows how data mining can turn a large collection of data into
knowledge that can help meet a current global challenge.
Many Definitions
– Non-trivial extraction of implicit, previously
unknown and potentially useful information from
data.
27
BITS Pilani, Pilani Campus
Data Mining Tasks
• Prediction Methods
– Use some variables to predict unknown or future
values of other variables.
• Description Methods
– Find human-interpretable patterns that describe
the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
• Classification [Predictive]
• Clustering [Descriptive]
• Association Rule Discovery [Descriptive]
• Sequential Pattern Discovery [Descriptive]
• Regression [Predictive]
• Deviation Detection [Predictive]
Set Classifier
• Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
– Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.
• Fraud Detection
–Goal: Predict fraudulent cases in credit card
transactions.
–Approach:
• Use credit card transactions and the information on its
account-holder as attributes.
– When does a customer buy, what does he buy, how often he pays
on time, etc
• Label past transactions as fraud or fair transactions. This
forms the class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.
• Customer Attrition/Churn:
–Goal: To predict whether a customer is likely to be
lost to a competitor.
–Approach:
• Use detailed record of transactions with each of the
past and present customers, to find attributes.
– How often the customer calls, where he calls, what time-of-
the day he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
From [Berry & Linoff] Data Mining Techniques, 1997
• Unsupervised Learning
• Document Clustering:
–Goal: To find groups of documents that are similar to
each other based on the important terms appearing in
them.
–Approach: To identify frequently occurring terms in
each document. Form a similarity measure based on
the frequencies of different terms. Use it to cluster.
–Gain: Information Retrieval can utilize the clusters to
relate a new document or search term to clustered
documents.
• Given is a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among different
events.
(A B) (C) (D E)
• Applications:
– Credit Card Fraud Detection
– Network Intrusion Detection
BITS Pilani, Pilani Campus
Which Technologies Are Used?
45
BITS Pilani, Pilani Campus
Major Issues in Data Mining (1)
• Mining Methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional space
– Data mining: An interdisciplinary effort
– Handling noise, uncertainty, and incompleteness of data
– Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data mining results
47
BITS Pilani, Pilani Campus
Major Issues in Data Mining (2)
48
BITS Pilani, Pilani Campus
Homeplay
49
BITS Pilani, Pilani Campus
Summary
• Data mining: Discovering interesting patterns and knowledge from massive
amount of data
• A natural evolution of database technology, in great demand, with wide
applications
• A KDD process includes data cleaning, data integration, data selection,
transformation, data mining, pattern evaluation, and knowledge
presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination, association,
classification, clustering, outlier and trend analysis, etc.
• Data mining technologies and applications
• Major issues in data mining
50
BITS Pilani, Pilani Campus
Virtual Lab - EDA
• Python
• Weka
51
BITS Pilani, Pilani Campus
Contact Session 2 :Data Preprocessing
• RL’s on Module 2
– Types of Data (Nominal,Categorical)
– Data Quality
– Data Preprocessing Tasks
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Data compression
• Data transformation and data discretization
– Normalization
52
BITS Pilani, Pilani Campus
Thank You
53
BITS Pilani, Pilani Campus
54
BITS Pilani, Pilani Campus