0% found this document useful (0 votes)
153 views24 pages

Data Mining Unit 1

This document discusses data mining, including its definition, importance, types of data that can be mined, functionalities, patterns that can be discovered, and classifications of data mining systems. Data mining involves analyzing large amounts of data to discover useful patterns and trends. It is used in applications like market analysis, fraud detection, and production control.

Uploaded by

Anurag sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views24 pages

Data Mining Unit 1

This document discusses data mining, including its definition, importance, types of data that can be mined, functionalities, patterns that can be discovered, and classifications of data mining systems. Data mining involves analyzing large amounts of data to discover useful patterns and trends. It is used in applications like market analysis, fraud detection, and production control.

Uploaded by

Anurag sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

1.

1Motivation
Data mining is the procedure of finding useful new correlations, patterns, and trends by sharing through a high amount of data saved in repositories, using
pattern recognition technologies including statistical and mathematical techniques.

1.2Importance of data mining


The information or knowledge extracted so can be used for any of the following applications −

Market Analysis

Fraud Detection

Customer Retention

Production Control

Science Exploration

Market Analysis and Management


Listed below are the various fields of market where data mining is used −

Customer Profiling − Data mining helps determine what kind of people buy what kind of products.

Identifying Customer Requirements − Data mining helps in identifying the best products for different customers. It uses prediction to find the factors that may
attract new customers.

Cross Market Analysis − Data mining performs Association/correlations between product sales.

Target Marketing − Data mining helps to find clusters of model customers who share the same characteristics such as interests, spending habits, income, etc.

Determining Customer purchasing pattern − Data mining helps in determining customer purchasing pattern.

Providing Summary Information − Data mining provides us various multidimensional summary reports.
Corporate Analysis and Risk Management
Data mining is used in the following fields of the Corporate Sector −

Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.

Resource Planning − It involves summarizing and comparing the resources and spending.

Competition − It involves monitoring competitors and market directions.

Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect frauds. In fraud telephone calls, it helps to find the
destination of the call, duration of the call, time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.

1.3What is Data Mining?


Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining
knowledge from data.

Figure 1.3 Data mining—searching for knowledge (interesting patterns) in data


❖ “The process of extracting information to identify patterns, trends, and useful data that would allow

the business to take the data-driven decision from huge sets of data is called Data Mining.”

❖ “Data mining is the process of analyzing massive volumes of data to discover business

intelligence that helps companies solve problems, mitigate risks, and seize new opportunities.”

❖ “Data mining is the process of finding anomalies, patterns and correlations within large data sets to

predict outcomes. Using a broad range of techniques, you can use this information to increase

revenues, cut costs, improve customer relationships, reduce risks and more.”

1.4Kind of data be mined


Kinds of Data on which Data mining is performed
Data mining can be performed on the following types of data:

1. Relational Database

A relational database is a collection of multiple data sets formally organized by tables, records,

and columns from which data can be accessed in various ways without having to recognize the

DCOM-DATA MINING

database tables. Tables convey and share information, which facilitates data searchability,

reporting, and organization.


2. Data Warehouse

A Data Warehouse is the technology that collects the data from various sources within the

organization to provide meaningful business insights. The huge amount of data comes from

multiple places such as Marketing and Finance. The extracted data is utilized for analytical

purposes and helps in decision- making for a business organization. The data warehouse is

designed for the analysis of data rather than transaction processing.

3. Data Repositories

The Data Repository generally refers to a destination for data storage. However, many IT

professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure.

For example, a group of databases, where an organization has kept various kinds of information.

4. Object-Relational Database

A combination of an object-oriented database model and relational database model is called an

object-relational model. It supports Classes, Objects, Inheritance, etc. One of the primary

objectives of the Object-relational data model is to close the gap between the Relational database

and the object-oriented model practices frequently utilized in many programming languages, for

example, C++, Java, C#, and so on.

5. Transactional Database

A transactional database refers to a database management system (DBMS) that has the potential

to undo a database transaction if it is not performed appropriately. Even though this was a unique

capability a very long while back, today, most of the relational database systems support
transactional database activities.

1.5Data Mining Functionalities


Data mining functionalities are used to represent the type of patterns that have to be discovered in data mining tasks. In general, data mining tasks can be
classified into two types including descriptive and predictive. Descriptive mining tasks define the common features of the data in the database and the predictive
mining tasks act inference on the current information to develop predictions.

There are various data mining functionalities which are as follows −

Data characterization − It is a summarization of the general characteristics of an object class of data. The data corresponding to the user-specified class is
generally collected by a database query. The output of data characterization can be presented in multiple forms.

Data discrimination − It is a comparison of the general characteristics of target class data objects with the general characteristics of objects from one or a set
of contrasting classes. The target and contrasting classes can be represented by the user, and the equivalent data objects fetched through database queries.

Association Analysis − It analyses the set of items that generally occur together in a transactional dataset. There are two parameters that are used for
determining the association rules −

It provides which identifies the common item set in the database.

Confidence is the conditional probability that an item occurs in a transaction when another item occurs.

Classification − Classification is the procedure of discovering a model that represents and distinguishes data classes or concepts, for the objective of being
able to use the model to predict the class of objects whose class label is anonymous. The derived model is established on the analysis of a set of training data
(i.e., data objects whose class label is common).

Prediction − It defines predict some unavailable data values or pending trends. An object can be anticipated based on the attribute values of the object and
attribute values of the classes. It can be a prediction of missing numerical values or increase/decrease trends in time-related information.

Clustering − It is similar to classification but the classes are not predefined. The classes are represented by data attributes. It is unsupervised learning. The
objects are clustered or grouped, depends on the principle of maximizing the intraclass similarity and minimizing the intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a given class or cluster. These are the data objects which have multiple behaviour
from the general behaviour of other data objects. The analysis of this type of data can be essential to mine the knowledge.

Evolution analysis − It defines the trends for objects whose behaviour changes over some time.

1.6Kinds of patterns
Using the most relevant data (which may come from organizational databases or may be obtained from outside sources), data mining builds models to identify
patterns among the attributes (i.e., variables or characteristics) that exist in a data set

Associations find commonly co-occurring groupings of things, such as “beers and diapers” or “bread and butter” commonly purchased and observed together in
a shopping cart (i.e., market-basket analysis). Another type of association pattern captures the sequences of things. These sequential relationships can discover
time-ordered events, such as predicting that an existing banking customer who already has a checking account will open a savings account followed by an
investment account within a year.

Predictions tell the nature of future occurrences of certain events based on what has happened in the past, such as predicting the winner of the Super Bowl or
forecasting the absolute temperature on a particular day.

Clusters identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their
demographics and past purchase behaviors.
1.7Data Mining System Classification
A data mining system can be classified according to the following criteria −

Database Technology

Statistics

Machine Learning

Information Science

Visualization

Other Disciplines

Apart from these, a data mining system can also be classified based on the kind of (a) databases mined, (b) knowledge mined, (c) techniques utilized, and (d)
applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database system can be classified according to different criteria such as data
models, types of data, etc. And the data mining system can be classified accordingly.

For example, if we classify a database according to the data model, then we may have a relational, transactional, object-relational, or data warehouse mining
system.

Classification Based on the kind of Knowledge Mined


We can classify a data mining system according to the kind of knowledge mined. It means the data mining system is classified on the basis of functionalities such
as −

Characterization

Discrimination

Association and Correlation Analysis

Classification

Prediction

Outlier Analysis

Evolution Analysis

Classification Based on the Techniques Utilized


We can classify a data mining system according to the kind of techniques used. We can describe these techniques according to the degree of user interaction
involved or the methods of analysis employed.

Classification Based on the Applications Adapted


We can classify a data mining system according to the applications adapted. These applications are as follows −
Finance

Telecommunications

DNA

Stock Markets

E-mail

1.8Data Mining Task Primitives


We can specify a data mining task in the form of a data mining query.

This query is input to the system.

A data mining query is defined in terms of data mining task primitives.

Note − These primitives allow us to communicate in an interactive manner with the data mining system. Here is the list of Data Mining Task Primitives −

Set of task relevant data to be mined.

Kind of knowledge to be mined.

Background knowledge to be used in discovery process.

Interestingness measures and thresholds for pattern evaluation.

Representation for visualizing the discovered patterns.

1.9Integrating a Data Mining System with a DB/DW System


If a data mining system is not integrated with a database or a data warehouse system, then there will be no system to communicate with. This scheme is known
as the non-coupling scheme. In this scheme, the main focus is on data mining design and on developing efficient and effective algorithms for mining the
available data sets.
The list of Integration Schemes is as follows −
No Coupling − In this scheme, the data mining system does not utilize any of the database or data warehouse functions. It fetches the data from a particular
source and processes that data using some data mining algorithms. The data mining result is stored in another file.

Loose Coupling − In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the
data respiratory managed by these systems and performs data mining on that data. It then stores the mining result either in a file or in a designated place in a
database or in a data warehouse.

Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a data warehouse system and in addition to that, efficient
implementations of a few data mining primitives can be provided in the database.

Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem
is treated as one functional component of an information system.

1.10 Major issue in data mining


Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various
heterogeneous data sources. These factors also create some issues. Here in this tutorial, we will discuss the major issues regarding −

Mining Methodology and User Interaction

Performance Issues

Diverse Data Types Issues

The following diagram describes the major issues.


Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −

Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for
data mining to cover a broad range of knowledge discovery task.

Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on the returned results.
Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction.

Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and flexible data mining.

Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.

Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor.

Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty.

Performance Issues
There can be performance-related issues such as follows −

Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in databases, data
mining algorithm must be efficient and scalable.

Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions
which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases
without mining the data again from scratch.

Diverse Data Types Issues


Handling of relational and complex types of data − The database may contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems − The data is available at different data sources on LAN or
WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data
mining.

1.11Types of data sets and Attributes values

At t r i b u t e s
Attribute (or dimensions, features, variables):a data field, representing a characteristic
or feature of a data object.

E.g., customer _ID, name, address

Types:
Nominal: “red”, “black”, “blue”, …

Binary: 1/0, TRUE/FALSE

Numeric: quantitative

Interval-scaled

Ratio-scaled
A ttribute Types

Nominal: categories, states, or “names of things”

Hair_color = {auburn, black, blond, brown, grey, red, white}

marital status, occupation, ID numbers, zip codes

Binary

Nominal attribute with only 2 states (0 and 1)

Symmetric binary: both outcomes equally important


e.g., gender

Asymmetric binary: outcomes not equally important.


e.g., medical test (positive vs. negative)

Convention: assign 1 to most important outcome (e.g., HIV positive)

Ordinal

Values have a meaningful order (ranking) but magnitude between successive values is not known.

Size = {small, medium, large}, grades, army rankings


Nu m e r i c At t r i b u t e T y p e s

Quantity (integer or real-valued)


Interval

Measured on a scale of equal-sized units

Values have order

E.g., temperature in C˚or F˚, calendar dates

No true zero-point

Ratio

Inherent zero-point

We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).

e.g., temperature in Kelvin, length, counts, monetary quantities


A ttributes

Discrete Attribute Has only a finite or countably infinite set of values

E.g., zip codes, profession, or the set of words in a collection of documents

Sometimes, represented as integer variables

Note: Binary attributes are a special case of discrete attributes

Continuous Attribute

Has real numbers as attribute values

E.g., temperature, height, or weight

Practically, real values can only be measured and represented using a finite number of digits

Continuous attributes are typically represented as floating-point variables

1.12.1Basic Sta tis tic al De s cr i p t i o n s o f Data

Motivation
To better understand the data: central tendency, variation and spread

Data dispersion characteristics


median, max, min, quantiles, outliers, variance, etc.

Numerical dimensions correspond to sorted intervals


Data dispersion: analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals

Dispersion analysis on computed measures


Folding measures into numerical dimensions

Boxplot or quantile analysis on the transformed cube

1.12.2Data V isualization

Why data visualization?

Gain insight into an information space by mapping data onto graphical primitives

Provide qualitative overview of large data sets

Search for patterns, trends, structure, irregularities, relationships among data

Help find interesting regions and suitable parameters for further quantitative analysis

Provide a visual proof of computer representations derived

Categorization of visualization methods:

1.12.3 Measuring Data Similarity


Similarity
Numerical measure of how alike two data objects are

Value is higher when objects are more alike

Often falls in the range [0,1]


1.13PREPROCESSING
1.13.1Data quality

Data Quality: Why do we preprocess the data?


Many characteristics act as a deciding factor for data quality, such as incompleteness
and incoherent information, which are common properties of the big database in the
real world. Factors used for data quality assessment are:

Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having
incorrect values of properties that could be human or computer errors.

Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer
information for sales & transaction data may not always be available.

Consistency:
Incorrect data can also result from inconsistencies in naming convention or data
codes, or from input field incoherent format. Duplicate tuples need cleaning of details,
too.
Timeliness:
It also affects the quality of the data. At the end of the month, several sales
representatives fail to file their sales record on time. These are also several corrections
& adjustments which flow into after the end of the month. Data stored in the database
are incomplete for a time after each month.
Believability:
It is reflective of how much users trust the data.

Interpretability:
It is a reflection of how easy the users can understand the data.
1.13.2 Major Tasks in Data Preprocessing
The major steps involved in data preprocessing, namely, data cleaning, data
integration, data reduction,And data transformation

1. Data Cleaning

Data cleaning routines work to “clean” the data by filling in missing values, smoothing
noisy data,identifying or removing outliers, and resolving inconsistencies. If users
believe the data are dirty, they are unlikely to trust the results of any data mining that
has been applied. Furthermore, dirty data can cause confusion for the mining
procedure, resulting in unreliable output. Although most mining routines have some
procedures for dealing with incomplete or noisy data, they are not always robust.
Instead, they may concentrate on avoiding overfitting the data to the function being
modeled. Therefore, a useful preprocessing step is to run your data through some data
cleaning routines.

Data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets, and it also replaces the missing values.
✓Ignore the tuple

✓ Fill the missing value manually

✓ Use global constant to fill the missing values

✓ Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the

missing value.

✓ Use the most probable value to fill in the missing value.

Noisy data can be handled in following ways:

✓ Binning method

✓ Regression method

✓ Clustering

Figure: Binning methods for data smoothing

2. Data Integration

The process of combining multiple sources into a single dataset. The Data integration process is one of

the main components in data management. Data integration is the method to assists when the

information is collected from diversified data origin and information is merging to form continuous
information
Data Transformation

The change made in the format or the structure of the data is called data transformation. This step can
be simple or complex based on the requirements. There are some methods in data transformation.

This involves following ways:

✓ Smoothing

✓ Aggregation

✓ Normalization

✓ Attribute Selection

✓ Discretization

✓ Concept hierarchy generation

4. Data Reduction

Since data mining is a technique that is used to handle huge amount of data. While working with huge

volume of data, analysis became harder in such cases. In order to get rid of this, we use data reduction

technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.

When the volume of data is huge, databases can become slower, costly to access, and challenging to

properly store. Data reduction aims to present a reduced representation of the data in a data
warehouse.

The various steps to data reduction are:

✓ Data Cube aggregation

✓ Attribute Subset Selection


✓Numerosity Reduction

✓ Dimensionality Reduction

Although numerous methods of data preprocessing have been developed, data preprocessing remains
an active area of research, due to the huge amount of inconsistent or dirty data and the complexity of
the problem.

Data Discretization
Data discretization refers to a method of converting a huge number of data values into smaller

ones so that the evaluation and management of data become easy. In other words, data

discretization is a method of converting attributes values of continuous data into a finite set of

intervals with minimum data loss.

Now, we can understand this concept with the help of an example

Suppose we have an attribute of Age with the given values

Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77

Table before Discretization

A concept hierarchy for a given numeric attribute attribute defines a discretization of the attribute.
Concept hierarchies can be used to reduce the data y collecting and replacing low-level concepts (such
as numeric value for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

Although detail is lost by such generalization, it becomes meaningful and it is easier to interpret.

Some Famous techniques of data discretization

➢ Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a continuous

data set. Histogram assists the data inspection for data distribution. For example, Outliers,

skewness representation, normal distribution representation, etc.

You might also like