Data Mining Unit 1
Data Mining Unit 1
1Motivation
Data mining is the procedure of finding useful new correlations, patterns, and trends by sharing through a high amount of data saved in repositories, using
pattern recognition technologies including statistical and mathematical techniques.
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Customer Profiling − Data mining helps determine what kind of people buy what kind of products.
Identifying Customer Requirements − Data mining helps in identifying the best products for different customers. It uses prediction to find the factors that may
attract new customers.
Cross Market Analysis − Data mining performs Association/correlations between product sales.
Target Marketing − Data mining helps to find clusters of model customers who share the same characteristics such as interests, spending habits, income, etc.
Determining Customer purchasing pattern − Data mining helps in determining customer purchasing pattern.
Providing Summary Information − Data mining provides us various multidimensional summary reports.
Corporate Analysis and Risk Management
Data mining is used in the following fields of the Corporate Sector −
Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.
Resource Planning − It involves summarizing and comparing the resources and spending.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect frauds. In fraud telephone calls, it helps to find the
destination of the call, duration of the call, time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.
the business to take the data-driven decision from huge sets of data is called Data Mining.”
❖ “Data mining is the process of analyzing massive volumes of data to discover business
intelligence that helps companies solve problems, mitigate risks, and seize new opportunities.”
❖ “Data mining is the process of finding anomalies, patterns and correlations within large data sets to
predict outcomes. Using a broad range of techniques, you can use this information to increase
revenues, cut costs, improve customer relationships, reduce risks and more.”
1. Relational Database
A relational database is a collection of multiple data sets formally organized by tables, records,
and columns from which data can be accessed in various ways without having to recognize the
DCOM-DATA MINING
database tables. Tables convey and share information, which facilitates data searchability,
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
3. Data Repositories
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure.
For example, a group of databases, where an organization has kept various kinds of information.
4. Object-Relational Database
object-relational model. It supports Classes, Objects, Inheritance, etc. One of the primary
objectives of the Object-relational data model is to close the gap between the Relational database
and the object-oriented model practices frequently utilized in many programming languages, for
5. Transactional Database
A transactional database refers to a database management system (DBMS) that has the potential
to undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
Data characterization − It is a summarization of the general characteristics of an object class of data. The data corresponding to the user-specified class is
generally collected by a database query. The output of data characterization can be presented in multiple forms.
Data discrimination − It is a comparison of the general characteristics of target class data objects with the general characteristics of objects from one or a set
of contrasting classes. The target and contrasting classes can be represented by the user, and the equivalent data objects fetched through database queries.
Association Analysis − It analyses the set of items that generally occur together in a transactional dataset. There are two parameters that are used for
determining the association rules −
Confidence is the conditional probability that an item occurs in a transaction when another item occurs.
Classification − Classification is the procedure of discovering a model that represents and distinguishes data classes or concepts, for the objective of being
able to use the model to predict the class of objects whose class label is anonymous. The derived model is established on the analysis of a set of training data
(i.e., data objects whose class label is common).
Prediction − It defines predict some unavailable data values or pending trends. An object can be anticipated based on the attribute values of the object and
attribute values of the classes. It can be a prediction of missing numerical values or increase/decrease trends in time-related information.
Clustering − It is similar to classification but the classes are not predefined. The classes are represented by data attributes. It is unsupervised learning. The
objects are clustered or grouped, depends on the principle of maximizing the intraclass similarity and minimizing the intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a given class or cluster. These are the data objects which have multiple behaviour
from the general behaviour of other data objects. The analysis of this type of data can be essential to mine the knowledge.
Evolution analysis − It defines the trends for objects whose behaviour changes over some time.
1.6Kinds of patterns
Using the most relevant data (which may come from organizational databases or may be obtained from outside sources), data mining builds models to identify
patterns among the attributes (i.e., variables or characteristics) that exist in a data set
Associations find commonly co-occurring groupings of things, such as “beers and diapers” or “bread and butter” commonly purchased and observed together in
a shopping cart (i.e., market-basket analysis). Another type of association pattern captures the sequences of things. These sequential relationships can discover
time-ordered events, such as predicting that an existing banking customer who already has a checking account will open a savings account followed by an
investment account within a year.
Predictions tell the nature of future occurrences of certain events based on what has happened in the past, such as predicting the winner of the Super Bowl or
forecasting the absolute temperature on a particular day.
Clusters identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their
demographics and past purchase behaviors.
1.7Data Mining System Classification
A data mining system can be classified according to the following criteria −
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of (a) databases mined, (b) knowledge mined, (c) techniques utilized, and (d)
applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database system can be classified according to different criteria such as data
models, types of data, etc. And the data mining system can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a relational, transactional, object-relational, or data warehouse mining
system.
Characterization
Discrimination
Classification
Prediction
Outlier Analysis
Evolution Analysis
Telecommunications
DNA
Stock Markets
Note − These primitives allow us to communicate in an interactive manner with the data mining system. Here is the list of Data Mining Task Primitives −
Loose Coupling − In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the
data respiratory managed by these systems and performs data mining on that data. It then stores the mining result either in a file or in a designated place in a
database or in a data warehouse.
Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a data warehouse system and in addition to that, efficient
implementations of a few data mining primitives can be provided in the database.
Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem
is treated as one functional component of an information system.
Performance Issues
Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for
data mining to cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on the returned results.
Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in databases, data
mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions
which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases
without mining the data again from scratch.
At t r i b u t e s
Attribute (or dimensions, features, variables):a data field, representing a characteristic
or feature of a data object.
Types:
Nominal: “red”, “black”, “blue”, …
Numeric: quantitative
Interval-scaled
Ratio-scaled
A ttribute Types
Binary
Ordinal
Values have a meaningful order (ranking) but magnitude between successive values is not known.
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).
Continuous Attribute
Practically, real values can only be measured and represented using a finite number of digits
Motivation
To better understand the data: central tendency, variation and spread
1.12.2Data V isualization
Gain insight into an information space by mapping data onto graphical primitives
Help find interesting regions and suitable parameters for further quantitative analysis
Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having
incorrect values of properties that could be human or computer errors.
Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer
information for sales & transaction data may not always be available.
Consistency:
Incorrect data can also result from inconsistencies in naming convention or data
codes, or from input field incoherent format. Duplicate tuples need cleaning of details,
too.
Timeliness:
It also affects the quality of the data. At the end of the month, several sales
representatives fail to file their sales record on time. These are also several corrections
& adjustments which flow into after the end of the month. Data stored in the database
are incomplete for a time after each month.
Believability:
It is reflective of how much users trust the data.
Interpretability:
It is a reflection of how easy the users can understand the data.
1.13.2 Major Tasks in Data Preprocessing
The major steps involved in data preprocessing, namely, data cleaning, data
integration, data reduction,And data transformation
1. Data Cleaning
Data cleaning routines work to “clean” the data by filling in missing values, smoothing
noisy data,identifying or removing outliers, and resolving inconsistencies. If users
believe the data are dirty, they are unlikely to trust the results of any data mining that
has been applied. Furthermore, dirty data can cause confusion for the mining
procedure, resulting in unreliable output. Although most mining routines have some
procedures for dealing with incomplete or noisy data, they are not always robust.
Instead, they may concentrate on avoiding overfitting the data to the function being
modeled. Therefore, a useful preprocessing step is to run your data through some data
cleaning routines.
Data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets, and it also replaces the missing values.
✓Ignore the tuple
✓ Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the
missing value.
✓ Binning method
✓ Regression method
✓ Clustering
2. Data Integration
The process of combining multiple sources into a single dataset. The Data integration process is one of
the main components in data management. Data integration is the method to assists when the
information is collected from diversified data origin and information is merging to form continuous
information
Data Transformation
The change made in the format or the structure of the data is called data transformation. This step can
be simple or complex based on the requirements. There are some methods in data transformation.
✓ Smoothing
✓ Aggregation
✓ Normalization
✓ Attribute Selection
✓ Discretization
4. Data Reduction
Since data mining is a technique that is used to handle huge amount of data. While working with huge
volume of data, analysis became harder in such cases. In order to get rid of this, we use data reduction
technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.
When the volume of data is huge, databases can become slower, costly to access, and challenging to
properly store. Data reduction aims to present a reduced representation of the data in a data
warehouse.
✓ Dimensionality Reduction
Although numerous methods of data preprocessing have been developed, data preprocessing remains
an active area of research, due to the huge amount of inconsistent or dirty data and the complexity of
the problem.
Data Discretization
Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
A concept hierarchy for a given numeric attribute attribute defines a discretization of the attribute.
Concept hierarchies can be used to reduce the data y collecting and replacing low-level concepts (such
as numeric value for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
Although detail is lost by such generalization, it becomes meaningful and it is easier to interpret.
➢ Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a continuous
data set. Histogram assists the data inspection for data distribution. For example, Outliers,