DM Unit2(Part1)
DM Unit2(Part1)
Techniques, Issues, Application, Data object and Attributes types ,Statistical Description of data,
Data preprocessing Techniques , Data Visualization, Data Similarity and dissimilarity Measures.
Relational data can be accessed by database queries written in a relational query language (e.g.,
SQL) or with the assistance of graphical user interfaces. A given query is transformed into a set of
relational operations, such as join, selection, and projection, and is then optimized for efficient
processing. When mining relational databases, we can go further by searching for trends or data
patterns.
2.Data Warehouses:
A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and usually residing at a single site. Data warehouses are constructed via a process
of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.
Figure 2 shows the typical framework for construction and use of a data warehouse for All
Electronics.
Figure2:Typica
lframeworkofadatawarehouseforAllElectronics
To facilitate decision making, the data in a data warehouse are organized around major subjects
(e.g., customer, item, supplier, and activity). The data are stored to provide information from a
historical perspective, such as in the past 6 to 12 months, and are typically summarized. For example,
rather than storing the details of each sales transaction, the data warehouse may store a summary of
the transactions per item type for each store or, summarized to a higher level, for each sales region.
A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in
which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell
stores the value of some aggregate measure such as count or sum(sales_amount). A data cube
provides a multidimensional view of data and allows the precomputation and fast access of
summarized data.
3. Transactional Data:
In general, each record in a transactional database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web page.
A transaction typically includes a unique transaction identity number (trans_ID) and a list of the
items making up the transaction, such as the items purchased in the transaction.
A transactional database may have additional tables, which contain other information related to
the transactions, such as item description, information about the sales person or the branch and so
on.
age(X, “20..29”) ∧ income(X, “40K..49K”) ⇒ buys(X, “laptop”) [support = 2%, confidence = 60%].
mining system may find association rules like
The rule indicates that of the All Electronics customers under study, 2% are 20 to 29 years old with an income
of $40,000 to $49,000 and have purchased a laptop (computer) at All Electronics. There is a 60%
probability that a customer in this age and income group will purchase a laptop. Note that this is an
association involving more than one attribute or predicate (i.e., age, income, and buys).
Adopting the terminology used in multidimensional databases, where each attribute is referred to as a
dimension, the above rule can be referred to as a multidimensional association rule.
Figure3:A classification model can be represented in various forms: (a)IF-THEN rules,(b) a decision tree,
or (c) a neural network.
A decision tree is a flowchart-like tree structure, where each node denotes a test on an attribute value,
each branch represents an outcome of the test, and tree leaves represent classes or class distributions.
Decision trees can easily be converted to classification rules.
A neural network, when used for classification, is typically a collection of neuron-like processing units with
weighted connections between the units. There are many other methods for constructing classification
models, such as naive Bayesian classification,support vector machines, and k-nearest-neighbor classification.
Whereas classification predicts categorical (discrete, unordered) labels, regression models continuous-
valued functions. That is, regression is used to predict missing or unavailable numerical data values rather
than (discrete) class labels. The term prediction refers to both numeric prediction and class label prediction.
Regression analysis is a statistical methodology that is most often used for numeric prediction, although
other methods exist as well. Regression also encompasses the identification of distribution trends based on
the available data.
Classification and regression may need to be preceded by relevance analysis, which attempts to identify
attributes that are significantly relevant to the classification and regression process. Such attributes will be
selected for the classification and regression process. Other attributes, which are irrelevant, can then be
excluded from consideration.
Cluster Analysis
Unlike classification and regression, which analyze class-labeled (training) data sets, clustering analyzes data
objects without consulting class labels. In many cases, class- labeled data may simply not exist at the
beginning.
Clustering can be used to generate class labels for a group of data.
The objects are clustered or grouped based on the principle of maximizing the intra class similarity and
minimizing the interclass similarity. That is, clusters of objects are formed so that objects within a cluster
have high similarity in com-parison to one another, but are rather dissimilar to objects in other clusters.
Each cluster so formed can be viewed as a class of objects, from which rules can be derived.
Clustering can also facilitate taxonomy formation, that is, the organization of observations into a hierarchy
of classes that group similar events together
Figure 1.10 A 2-D plot of customer data with respect to customer locations in a city, showing three data
clusters.
Example 1.9 Cluster analysis. Cluster analysis can be performed on AllElectronics customer data to identify
homogeneous subpopulations of customers. These clusters may represent individual target groups for
marketing. Figure 1.10 shows a 2-D plot of customers with respect to customer locations in a city.
Three clusters of data points are evident.
Outlier Analysis :A data set may contain objects that do not comply with the general behavior or modelof
the data. These data objects are outliers. Many data mining methods discard outliers as noise or exceptions.
However, in some applications (e.g., fraud detection) the rare events can be more interesting than the more
regularly occurring ones. The analysis of outlier data is referred to as outlier analysis or anomaly mining.
Outliers may be detected using statistical tests that assume a distribution or probability model for the data, or
using distance measures where objects that are remote from any other cluster are considered outliers. Rather
than using statistical or distance measures, density-based methods may identify outliers in a local region,
although they look normal from a global statistical distribution view.
Major Issues in data mining: Data mining is a dynamic and fast-expanding field with great strengths.
The major issues can divided into five groups:
a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society
a)Mining Methodology:
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.
Mining knowledge in multidimensional space – when searching for knowledge in
large datasets, we can explore the data in multidimensional space.
Handling noisy or incomplete data − the data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.
Pattern evaluation − the patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
B)User Interaction:
Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based
on the returned results.
Incorporation of background knowledge − To guide discovery process and
to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
C)Efficiency and scalability
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data
mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such
as huge size of databases, wide distribution of data, and complexity of data
mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions
is merged. The incremental algorithms, update databases without mining
the data again from scratch.
D) Diverse Data Types Issues
Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these
kind of data.
Mining information from heterogeneous databases and global
information systems − The data is available at different data
sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
E)Data Mining and Society
Social impacts of data mining – With data mining penetrating our everyday
lives, it is important to study the impact of data mining on society.
Privacy-preserving data mining – data mining will help scientific discovery,
business management, economy recovery, and security protection.
Invisible data mining – we cannot expect everyone in society to learn and
master data mining techniques. More and more systems should have data
mining functions built within so that people can perform data mining or use
data mining results simply by mouse clicking, without any knowledge of
data mining algorithms.
Data Mining Applications: The list of areas where data mining is widely used −
Financial Data Analysis
Retail Industry
Telecommunication Industry
Biological Data Analysis
Other Scientific Applications
Intrusion Detection
Financial Data Analysis: The financial data in banking and financial industry is generally reliable and of
high quality which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
Design and construction of data warehouses for multidimensional data analysis and data mining.
Loan payment prediction and customer credit policy analysis.
Classification and clustering of customers for targeted marketing.
Detection of money laundering and other financial crimes.
Retail Industry: Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that
the quantity of data collected will continue to expand rapidly because of the increasing ease, availability and
popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved
quality of customer service and good customer retention and satisfaction. Here is the list of examples of data
mining in the retail industry −
Design and Construction of data warehouses based on the benefits of data mining.
Multidimensional analysis of sales, customers, products, time and region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data transmission,
etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become very
important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication patterns, catch
fraudulent activities, make better use of resource, and improve quality of service. Here is the list of
examples for which data mining improves telecommunication services −
Example 2.2 suppose patient undergoes a medical test that has two possible outcomes. The attribute medical
test is binary, where a value of 1 means the result of the test for the patient is positive, while 0 means the
result is negative.
a) Symmetric binary attribute: binary attribute is symmetric if both of its states are equally
valuable and carry the same weight;
b) Asymmetric binary attribute binary attribute is asymmetric if the outcomes of the states are
not equally impor- tant
3)Ordinal Attributes: An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.
Example 2.3 Eg1:Suppose that drink size corresponds to the size of drinks available at a fast-food
restaurant. This nominal attribute has three possible values: small, medium, and large. The values have a
meaningful sequence (which corresponds to increasing drink size); however, we cannot tell from the
values how much bigger, say, a medium is than a large.
Eg 2:Other examples of ordinal attributes include grade (e.g., A , A, A , B , and so on) and professional
rank. Professional ranks can be enumerated in a sequential order: for example, assistant, associate, and
full for professors, and private, private first class, specialist, corporal, and sergeant for army ranks.
Eg 3: one survey, participants were asked to rate how satisfied they were as cus- tomers. Customer
satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2:
neutral, 3: satisfied, and 4: very satisfied.
4)Numeric Attributes: A numeric attribute is quantitative; that is, it is a measurable quantity, represented in
integer or real values.
Numeric attributes are two types a) interval-scaled b) ratio-scaled.
a) Interval-Scaled Attributes: An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the correct reference point or we can call zero point.
Data can be added and subtracted at interval scale but cannot be multiplied or divided.
Consider an example of temperature in degrees Centigrade. If a day’s temperature of one day is twice
than the other day we cannot say that one day is twice as hot as another day.
b) Ratio-scaled Attribute: Ratio-Scaled Attributes A ratio-scaled attribute is a numeric attribute with a fix
zero-point.
If a measurement is ratio-scaled, we can say of a value as being a multiple (or ratio) of another value
The values are ordered, and we can also compute the difference between values, and the mean, median,
mode, Quantile-range and five number summaries can be given.
5)Other Attributes are:1)Discrete Attriburtes2)Continuous Attributes
a)Discrete Attribute: Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countable infinite set of values.
Example:
b) Continuous: Continuous data have an infinite no of states. Continuous data is of float type. There can be
many values between 2 and 3.
Example :
3. Finally, we can use many graphic displays of basic statistical descriptions to visually
inspect our Most statistical or graphical data presentation software packages include bar
charts, pie charts, and line graphs.
Other popular displays of data summaries and distributions include quantile plots,
quantile–quantile plots, histograms, and scatter plots.
Measuring the Central Tendency: Mean, Median, and Mode
Measures of central tendency include the mean, median, mode, and midrange.
Mean: The most common and most effective numerical measure of the “center” of a set of data is the
arithmetic mean.
Let x1, x2, . . . , xN be a set of N values or observations, such as for some numeric attribute X.
MEAN:
Sometimes, each value xi in a set may be associated with a weight wi . – The weights reflect the significance
and importance attached to their respective values.
Weighted Mean:
Example: Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Using Eq. we have
30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110
¯x 12
= 696/12=58
Thus, the mean salary is $58,000.
Median: The given data set of N values for an attribute X is sorted in increasing order.
If N is odd, then the median is the middle value of the ordered set.
If N is even, then the median is not unique; it is the two middlemost values and any value in between.
If X is a numeric attribute in this case, by convention,
The median is taken as the average of the two middlemost values.
Example: The two middlemost values of 52 and 56 (that is, within the sixth and seventh values in the
list).
The two middlemost values as the median; that is, 52+2 56 = 108 = 54. Thus,
the median is $54,000
Mode: The mode is another measure of central tendency.
The mode for a set of data is the value that occurs most frequently in the set.
Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
In general, a data set with two or more modes is multimodal. At the other extreme, if each data value
occurs only once, then there is no mode.
For unimodal numeric data that are moderately skewed (asymmetrical), we have the following empirical relation:
mean − mode ≈ 3 × (mean − median).
Midrange: The midrange can also be used to assess the central tendency of a numeric data set.
It is the average of the largest and smallest values in the set. This measure is easy to compute using the SQL
aggregate functions, max() and min().
30,000+110,000
Example: The midrange of the data of = $70,000 2
In a unimodal frequency curve with perfect symmetric data distribution, the mean, median, and mode
are all at the same center value.
positively skewed, where the mode occurs at a value that is smaller than the median.
negatively skewed, where the mode occurs at a value greater than the median.
Rage: Let x1, x2, . . . , xN be a set of observations for some numeric attribute, X. The range
of the set is the difference between the largest (max()) and smallest (min()) values.
Quantiles : Quantile are points taken at regular intervals of a data distribution, dividing it
into essentially equal- size consecutive sets.
Quartile: Each part represents one-fourth of the data distribution. They are more commonly
referred to as quartiles.
Percentiles: The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets. The median, quartiles, and percentiles are the
most widely used forms of quantiles
−
−
The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. The third
quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or highest 25%) of the data. The
second quartile is the 50th percentile. As the median, it gives the center of the data distribution.
InterQuartile Range: The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the interquartile range (IQR)
and is defined as
IQR = Q3 − Q1 .
Example: Inter quartile range. The quartiles are the three values that split the sorted data set into four equal parts.
The data of Example 2.6 contain 12 observations, already sorted in increasing order. Thus, the quartiles for this
data are the third, sixth, and ninth val- ues, respectively, in the sorted list.
=
Therefore, Q1 =47,000 and Q3 =63,000. Thus, the range is IQR = 63-47 = $16,000.
= −
=
Five-Number Summary, Boxplots and Outliers
Five Number Summary : This contains five values Minimum, Q1, interquartile Median(Q2),
Q3 and Maximum.
These Five numbers are represented as Boxplot in graphical format.
A boxplot incorporates the five-number summary as follows:
Boxplot is Data is represented with a box.
The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ.
The median is marked by a line within the box.
Whiskers: Two lines outside the box extend to Minimum and Maximum.
To show outliers, the whiskers are extended to the extreme low and high
observations only if these values are less than 1.5 * IQR beyond the
quartiles.
Figure Boxplot for the unit price data for items sold at four branches of AllElectronics during a given time
period.
Example Boxplot. Figur shows boxplots for unit price data for items sold at four branches of
AllElectronics during a given time period. For branch 1, we see that the median price of items sold is
$80, Q1 is $60, and Q3 is $100. Notice that two outlying observations for this branch were plotted
individually, as their value
1
σ2= (302 + 362 + 472 . . . + 1102) − 582
1
2 ≈ 379.17
√
σ ≈ 379.17 ≈ 19.47.
The basic properties of the standard deviation, σ , as a measure of spread are as follows:
σ measures spread about the mean and should be considered only when the mean is chosen as the measure of
center
σ=0 only when there is no spread,
= that is, when all observations have the same value. Otherwise, σ > 0.
Graphical Displays of Basic Statistical data:
Graphic displays of basic statistical descriptions.
These include quantile plots, quantile–quantile plots, histograms, and scatter plots. Such graphs are help-
ful for the visual inspection of data, which is useful for data preprocessing.
The first three of these show univariate distributions (i.e., data for one attribute), while scatter plots
show bivariate distributions (i.e., involving two attributes)
Quantile plots:
A quantile plot is a simple and effective way to have a first look at a
univariate data distribution.
Plots quantile information
For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
Note that
the 0.25 quantile corresponds to quartile Q1,
the 0.50 quantile is the median, and
the 0.75 quantile is Q3.
Quantile - Quantile plots:
In statistics, a Q-Q plot is a portability plot, which is a graphical method for comparing two
portability distributions by plotting their quantiles against each other.
Example: A quantile–quantile plot for unit price data of items sold at two branches of All
Electronics during a given time period. Each point cor- responds to the same quantile for each
data set and shows the unit price of items sold at branch 1 versus branch 2 for that quantile
Scatter plot Is one of the most effective graphical methods for determining if there appears to be a
relationship, clusters of points, or outliers between two numerical attributes.
Each pair of values is treated as a pair of coordinates and plotted as points in the plane
The scatter plot is a useful method for providing a first look at bi variate data to see clusters of points
and outliers, or to explore the possibility of correlation relationships.
Two attributes, X, and Y , are correlated if one attribute implies the other. Correlations can be positive,
negative, or null (uncorrelated).
a) If pattern of plotted points slopes from lower left to upper right, this means that the
values of X increase as the values of Y increase, suggesting a positive correlation
b) If the pattern of plotted points slopes from upper left to lower right, the values of X
increase as the values of Y decrease, suggesting a negative correlation
Three cases for which there is no correlation relationship between the two attributes in each of the given data
sets. shows how scatter plots can be extended to n attributes, resulting in a scatter-plot matrix.