0% found this document useful (0 votes)
7 views19 pages

DM Unit2(Part1)

The document provides an overview of data mining systems, including the knowledge discovery process, various data mining techniques, and the types of data that can be mined. It discusses the importance of data mining for extracting valuable insights from the increasing volume of information and outlines the different functionalities of data mining, such as classification, regression, clustering, and outlier analysis. Additionally, it describes the data preprocessing steps and the types of data sources applicable for data mining, including databases, data warehouses, and multimedia data.

Uploaded by

22jr1a1216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views19 pages

DM Unit2(Part1)

The document provides an overview of data mining systems, including the knowledge discovery process, various data mining techniques, and the types of data that can be mined. It discusses the importance of data mining for extracting valuable insights from the increasing volume of information and outlines the different functionalities of data mining, such as classification, regression, clustering, and outlier analysis. Additionally, it describes the data preprocessing steps and the types of data sources applicable for data mining, including databases, data warehouses, and multimedia data.

Uploaded by

22jr1a1216
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Unit-2 :Introduction to Data mining System, Knowledge Discovery Process, Data mining

Techniques, Issues, Application, Data object and Attributes types ,Statistical Description of data,
Data preprocessing Techniques , Data Visualization, Data Similarity and dissimilarity Measures.

Why we need Data Mining?


Volume of information is increasing everyday that we can handle from business transactions,
scientific data, sensor data, Pictures, videos, etc. So, we need a system that will be capable of
extracting essence of information available and that can automatically generate report, views or
summary of data for better decision-making.
What Is Data Mining?
 Data mining refer to Extracting or “Mining” Knowledge from large amounts of data
 In addition, many other terms have a similar meaning to data mining—for example, knowledge mining
from data, knowledge extraction, data/pattern analysis, data archaeology, and data dredging..
 Many people treat data mining as a synonym for another popularly used term, knowledge discovery
from data, or KDD. while others view data mining as merely an essential step in the process of
knowledge discovery. The knowledge discovery process is shown in figure
1. Data cleaning:(to remove noise and inconsistent data)
2. Data integration : (where multiple data sources maybe combined)
3. Dataselection:(wheredatarelevanttotheanalysistaskareretrievedfromthe database)

4. Datatransformation:(wheredataaretransformedandconsolidatedintoforms appropriate for mining by


performing summary or aggregation operations)
5. Data mining (an essential process where intelligent methods are applied to extract data patterns)
6. Pattern evaluation: (to identify the truly interesting patterns representing knowledge based on
interestingness measures)
7. Knowledge presentation:(where visualization and knowledge representation techniques are used to
present mined knowledge to users)
What Kinds of Data Can Be Mined?
As a general technology, data mining can be applied to any kind of data as long as the data are meaningful
for a target application. The most basic forms of data for mining applications are:
1. Database Data
2. Data Ware houses
3. Transactional Data
4. Multimedia Data bases
5. Spatial Database
6. Time Series Databases
7. World Wide Web(WWW)
8. Flat Files
1.Database data:
 A database system, also called a database management system (DBMS), consists of a collection of
interrelated data, known as a database, and a set of software programs to manage and access the
data. The software programs provide mechanisms for defining database structures and data storage;
for specifying and managing concurrent, shared, or distributed data access; and for ensuring
consistency and security of the information stored despite system crashes or attempts at
unauthorized access.
 A relational database is a collection of tables, each of which is assigned a unique name. Each table
consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or
rows). Each tuple in a relational table represents an object identified by a unique key and described by
a set of attribute values. A semantic data model, such as an entity-relationship (ER) data model, is
often constructed for relational databases. An ER data model represents the database as a set of
entities and their relationships.

 Relational data can be accessed by database queries written in a relational query language (e.g.,
SQL) or with the assistance of graphical user interfaces. A given query is transformed into a set of
relational operations, such as join, selection, and projection, and is then optimized for efficient
processing. When mining relational databases, we can go further by searching for trends or data
patterns.

2.Data Warehouses:

 A data warehouse is a repository of information collected from multiple sources, stored under a
unified schema, and usually residing at a single site. Data warehouses are constructed via a process
of data cleaning, data integration, data transformation, data loading, and periodic data refreshing.
 Figure 2 shows the typical framework for construction and use of a data warehouse for All
Electronics.
Figure2:Typica
lframeworkofadatawarehouseforAllElectronics
 To facilitate decision making, the data in a data warehouse are organized around major subjects
(e.g., customer, item, supplier, and activity). The data are stored to provide information from a
historical perspective, such as in the past 6 to 12 months, and are typically summarized. For example,
rather than storing the details of each sales transaction, the data warehouse may store a summary of
the transactions per item type for each store or, summarized to a higher level, for each sales region.
 A data warehouse is usually modeled by a multidimensional data structure, called a data cube, in
which each dimension corresponds to an attribute or a set of attributes in the schema, and each cell
stores the value of some aggregate measure such as count or sum(sales_amount). A data cube
provides a multidimensional view of data and allows the precomputation and fast access of
summarized data.
3. Transactional Data:
 In general, each record in a transactional database captures a transaction, such as a customer’s
purchase, a flight booking, or a user’s clicks on a web page.
 A transaction typically includes a unique transaction identity number (trans_ID) and a list of the
items making up the transaction, such as the items purchased in the transaction.
 A transactional database may have additional tables, which contain other information related to
the transactions, such as item description, information about the sales person or the branch and so
on.

4. Multimedia Data bases:


 Multimedia databases consists audio, video, images and text media.
 They can be stored on Object-Oriented Databases.
 They are used to store complex information in pre-specified formats.
 Application: Digital libraries, video-on demand, news-on demand, musical database, etc.
5. Spatial Data base:
 Store geographical information. Stores data in the form of coordinates, topology, lines, polygons, etc.
 Application: Maps, Global positioning, etc.
6. Series Data bases:
 Time series databases contain stock exchange data and user logged activities.
 Handles array of numbers indexed by time, date, etc.
 It requires real-time analysis.
 Application: eXtreme DB, Graphite, Influx DB, etc.
7.WWW:
 WWW refers to World wide web is a collection of documents and resources like audio, video, text, etc
which are identified by Uniform Resource Locators (URLs) through web browsers, linked by HTML
pages, and accessible via the Internet network.
 It is the most heterogeneous repository as it collects data from multiple resources.
 It is dynamic in nature as Volume of data is continuously increasing and changing.
 Application: Online shopping, Job search, Research, studying, etc.
8.Flat Files:
 Flat files are defined as data files in text form or binary form with a structure that can be easily
extracted by data mining algorithms.
 Data stored in flat files have no relationship or path among themselves, like if a relational database is
stored on flat file, and then there will be no relations between the tables.
 Flat files are represented by data dictionary. Eg: CSV file.
 Application: Used in Data Warehousing to store data, Used in carrying data to and from server, etc.
Data Mining Functionalities-What Kinds of Patterns Can Be Mined? OR Data mining
Techniques
Data mining functionalities are used to specify the kinds of patterns to be found in data mining tasks. In
general, such tasks can be classified into two categories: descriptive and predictive.
 Descriptive mining tasks characterize properties of the data in a target dataset.
 Predictive mining tasks perform induction on the current data in order to make predictions.
1.Class/Concept Description: Characterization and Discrimination:
 Data entries can be associated with classes or concepts.
 For example, in the All Electronics store, classes of items for sale include computers and
printers, and concepts of customers include big Spenders and budget Spenders.
 It can be useful to describe individual classes and concepts in summarized, concise, and yet
precise terms. Such descriptions of a class or a concept are called class/concept descriptions.
 These descriptions can be derived using
1. Data characterization, by summarizing the data of the class under study (often called the target
class) in general terms, or
2. Data discrimination,by comparison of the target class with one or a set of comparative classes
(often called the contrasting classes), or
3. Both data characterization and discrimination.
Data characterization:
 Is a summarization of the general characteristics or features of a target class of data. The data
corresponding to the user-specified class are typically collected by a query.
 For example, to study the characteristics of software products with sales that increased by 10% in the
previous year, the data related to such products can be collected by executing an SQL query on the
sales database.
 There are several methods for effective data summarization and characterization. Simple data
summaries based on statistical measures and plots. The data cube-based OLAP roll-up operation can
be used to perform user-controlled data summarization along a specified dimension. An attribute-
oriented induction technique can be used to perform data generalization and characterization
without step-by-step user interaction.
 The output of data characterization can be presented in various forms.
 Examples include pie charts, bar charts, curves, multidimensional data cubes, and multidimensional
tables, including crosstabs. The resulting descriptions can also be presented as generalized relations
or in rule form (called characteristic rules).
 Example: A customer relationship manager at All Electronics may order the following data mining task:
Summarize the characteristics of customers who spend more than $5000 a year at All Electronics. The
result is a general profile of these customers, such as that they are 40 to 50 years old, employed, and
have excellent credit ratings. The data mining system should allow the customer relationship manager
to drill down on any dimension, such as on occupation to view these customers according to their type
of employment.
Data discrimination
 Is a comparison of the general features of the target class data objects against the general features
of objects from one or multiple contrasting classes.
 The target and contrasting classes can be specified by a user, and the corresponding data objects can
be retrieved through database queries.
 For example, a user may want to compare the general features of software products with sales that
increased by 10% last year against those with sales that decreased by at least 30% during the same
period. The methods used for data discrimination are similar to those used for data characterization.
2.Mining Frequent Patterns, Associations and Correlations:
 Frequent patterns are those patterns that occur frequently in transactional data. Here is the list of
kind of frequent patterns –
 Frequent Item Set − It refers to a set of items that frequently appear together.
For example milk and bread, which are frequently bought together in grocery stores by many
customers..
 Frequent Subsequence − A sequence of patterns that occur frequently such as the pattern that
customers, tend to purchase first a laptop, followed by a digital camera and then a memory card,is
a(frequent) sequential pattern.
Frequent Sub Structure − Substructure refers to different structural forms, such as graphs, trees, or
lattices, which may be combined with item-sets or subsequences.
 Mining frequent patterns leads to the discovery of interesting associations and correlations within
data
Association analysis. Suppose that, as a marketing manager at All Electronics, you want to know which items
are frequently purchased together (i.e., within the same transaction). An example of such a rule, mined from

buys(X,“computer”) ⇒ buys(X,“software”) [support = 1%, confidence = 50%],


the All Electronics transactional database, is

where X is a variable representing a customer. A confidence, or certainty, of 50% means that if a


customer buys a computer, there is a 50% chance that she will buy software as well. A 1% support
means that 1% of all the transactions under analysis show that computer and software are purchased
together. This association rule involves a single attribute or predicate (i.e., buys) that repeats.
Association rules that contain a single predicate are referred to as single-dimensional association rules
Suppose, instead, that we are given the All Electronics relational database related⇒ to purchases. A data

age(X, “20..29”) ∧ income(X, “40K..49K”) ⇒ buys(X, “laptop”) [support = 2%, confidence = 60%].
mining system may find association rules like

The rule indicates that of the All Electronics customers under study, 2% are 20 to 29 years old with an income
of $40,000 to $49,000 and have purchased a laptop (computer) at All Electronics. There is a 60%
probability that a customer in this age and income group will purchase a laptop. Note that this is an
association involving more than one attribute or predicate (i.e., age, income, and buys).
Adopting the terminology used in multidimensional databases, where each attribute is referred to as a
dimension, the above rule can be referred to as a multidimensional association rule.

Classification and Regression for Predictive Analysis:


 Classification is the process of finding a model (or function) that describes and distinguishes data
classes or concepts.
 The model is derived based on the analysis of a set of training data (i.e., data objects for which the
class labels are known). The model is used to predict the class label of objects for which the class
label is unknown.
The derived model may be represented in various forms, such as classification rules (i.e., IF- THEN rules),
decision trees, mathematical formulae, or neural networks

Figure3:A classification model can be represented in various forms: (a)IF-THEN rules,(b) a decision tree,
or (c) a neural network.
A decision tree is a flowchart-like tree structure, where each node denotes a test on an attribute value,
each branch represents an outcome of the test, and tree leaves represent classes or class distributions.
Decision trees can easily be converted to classification rules.
A neural network, when used for classification, is typically a collection of neuron-like processing units with
weighted connections between the units. There are many other methods for constructing classification
models, such as naive Bayesian classification,support vector machines, and k-nearest-neighbor classification.
Whereas classification predicts categorical (discrete, unordered) labels, regression models continuous-
valued functions. That is, regression is used to predict missing or unavailable numerical data values rather
than (discrete) class labels. The term prediction refers to both numeric prediction and class label prediction.
Regression analysis is a statistical methodology that is most often used for numeric prediction, although
other methods exist as well. Regression also encompasses the identification of distribution trends based on
the available data.
Classification and regression may need to be preceded by relevance analysis, which attempts to identify
attributes that are significantly relevant to the classification and regression process. Such attributes will be
selected for the classification and regression process. Other attributes, which are irrelevant, can then be
excluded from consideration.
Cluster Analysis
Unlike classification and regression, which analyze class-labeled (training) data sets, clustering analyzes data
objects without consulting class labels. In many cases, class- labeled data may simply not exist at the
beginning.
 Clustering can be used to generate class labels for a group of data.
 The objects are clustered or grouped based on the principle of maximizing the intra class similarity and
minimizing the interclass similarity. That is, clusters of objects are formed so that objects within a cluster
have high similarity in com-parison to one another, but are rather dissimilar to objects in other clusters.
Each cluster so formed can be viewed as a class of objects, from which rules can be derived.
 Clustering can also facilitate taxonomy formation, that is, the organization of observations into a hierarchy
of classes that group similar events together

Figure 1.10 A 2-D plot of customer data with respect to customer locations in a city, showing three data
clusters.
Example 1.9 Cluster analysis. Cluster analysis can be performed on AllElectronics customer data to identify
homogeneous subpopulations of customers. These clusters may represent individual target groups for
marketing. Figure 1.10 shows a 2-D plot of customers with respect to customer locations in a city.
Three clusters of data points are evident.

Outlier Analysis :A data set may contain objects that do not comply with the general behavior or modelof
the data. These data objects are outliers. Many data mining methods discard outliers as noise or exceptions.
However, in some applications (e.g., fraud detection) the rare events can be more interesting than the more
regularly occurring ones. The analysis of outlier data is referred to as outlier analysis or anomaly mining.

Outliers may be detected using statistical tests that assume a distribution or probability model for the data, or
using distance measures where objects that are remote from any other cluster are considered outliers. Rather
than using statistical or distance measures, density-based methods may identify outliers in a local region,
although they look normal from a global statistical distribution view.

Major Issues in data mining: Data mining is a dynamic and fast-expanding field with great strengths.
The major issues can divided into five groups:

a) Mining Methodology
b) User Interaction
c) Efficiency and scalability
d) Diverse Data Types Issues
e) Data mining society
a)Mining Methodology:
 It refers to the following kinds of issues −
 Mining different kinds of knowledge in databases − Different users may be
interested in different kinds of knowledge. Therefore it is necessary for data
mining to cover a broad range of knowledge discovery task.
 Mining knowledge in multidimensional space – when searching for knowledge in
large datasets, we can explore the data in multidimensional space.
 Handling noisy or incomplete data − the data cleaning methods are required to
handle the noise and incomplete objects while mining the data regularities. If the
data cleaning methods are not there then the accuracy of the discovered patterns
will be poor.
 Pattern evaluation − the patterns discovered should be interesting because either
they represent common knowledge or lack novelty.
B)User Interaction:
 Interactive mining of knowledge at multiple levels of abstraction − The
data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based
on the returned results.
 Incorporation of background knowledge − To guide discovery process and
to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered
patterns not only in concise terms but at multiple levels of abstraction.
 Data mining query languages and ad hoc data mining − Data Mining Query
language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for
efficient and flexible data mining.
 Presentation and visualization of data mining results − Once the patterns
are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
C)Efficiency and scalability
There can be performance-related issues such as follows −
 Efficiency and scalability of data mining algorithms − In order to effectively
extract the information from huge amount of data in databases, data
mining algorithm must be efficient and scalable.
 Parallel, distributed, and incremental mining algorithms − The factors such
as huge size of databases, wide distribution of data, and complexity of data
mining methods motivate the development of parallel and distributed data
mining algorithms. These algorithms divide the data into partitions which is
further processed in a parallel fashion. Then the results from the partitions
is merged. The incremental algorithms, update databases without mining
the data again from scratch.
D) Diverse Data Types Issues
 Handling of relational and complex types of data − The database may
contain complex data objects, multimedia data objects, spatial data,
temporal data etc. It is not possible for one system to mine all these
kind of data.
 Mining information from heterogeneous databases and global
information systems − The data is available at different data
sources on LAN or WAN. These data source may be structured, semi
structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
E)Data Mining and Society
 Social impacts of data mining – With data mining penetrating our everyday
lives, it is important to study the impact of data mining on society.
 Privacy-preserving data mining – data mining will help scientific discovery,
business management, economy recovery, and security protection.
 Invisible data mining – we cannot expect everyone in society to learn and
master data mining techniques. More and more systems should have data
mining functions built within so that people can perform data mining or use
data mining results simply by mouse clicking, without any knowledge of
data mining algorithms.
Data Mining Applications: The list of areas where data mining is widely used −
 Financial Data Analysis
 Retail Industry
 Telecommunication Industry
 Biological Data Analysis
 Other Scientific Applications
 Intrusion Detection
Financial Data Analysis: The financial data in banking and financial industry is generally reliable and of
high quality which facilitates systematic data analysis and data mining. Some of the typical cases are as
follows −
 Design and construction of data warehouses for multidimensional data analysis and data mining.
 Loan payment prediction and customer credit policy analysis.
 Classification and clustering of customers for targeted marketing.
 Detection of money laundering and other financial crimes.

Retail Industry: Data Mining has its great application in Retail Industry because it collects large amount of data
from on sales, customer purchasing history, goods transportation, consumption and services. It is natural that
the quantity of data collected will continue to expand rapidly because of the increasing ease, availability and
popularity of the web.
Data mining in retail industry helps in identifying customer buying patterns and trends that lead to improved
quality of customer service and good customer retention and satisfaction. Here is the list of examples of data
mining in the retail industry −

 Design and Construction of data warehouses based on the benefits of data mining.
 Multidimensional analysis of sales, customers, products, time and region.
 Analysis of effectiveness of sales campaigns.
 Customer Retention.
 Product recommendation and cross-referencing of items.

Telecommunication Industry
Today the telecommunication industry is one of the most emerging industries providing various
services such as fax, pager, cellular phone, internet messenger, images, e-mail, web data transmission,
etc. Due to the development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why data mining is become very
important to help and understand the business.
Data mining in telecommunication industry helps in identifying the telecommunication patterns, catch
fraudulent activities, make better use of resource, and improve quality of service. Here is the list of
examples for which data mining improves telecommunication services −

 Multidimensional Analysis of Telecommunication data.


 Fraudulent pattern analysis.
 Identification of unusual patterns.
 Multidimensional association and sequential patterns analysis.
 Mobile Telecommunication services.
 Use of visualization tools in telecommunication data analysis.
Biological Data Analysis In recent times, we have seen a tremendous growth in the field of biology such as
genomics, proteomics, functional Genomics and biomedical research. Biological data mining is a very
important part of Bioinformatics. Following are the aspects in which data mining contributes for biological
data analysis −
 Semantic integration of heterogeneous, distributed genomic and proteomic databases.
 Alignment, indexing, similarity search and comparative analysis multiple nucleotide sequences.
 Discovery of structural patterns and analysis of genetic networks and protein pathways.
 Association and path analysis.
 Visualization tools in genetic data analysis.

Data Objects and Attribute Types:


 Data sets are made up of data objects.
 A data object represents an entity—
 Example:In a sales database, the objects may be customers, store items, and sales; in a medical
database, the objects may be patients; in a university database, the objects may be students,
professors, and courses.
 Data objects are typically described by attributes.
 Data objects can also be referred to as samples, examples, instances, data points, or objects. If the
data objects are stored in a database, they are data tuples. That is, the rows of a database correspond
to the data objects, and the columns correspond to the attributes. we define attributes and look at the
various attribute types.
What Is an Attribute?
 An attribute is a data field, representing a characteristic or feature of a data object.
 The nouns attribute, dimension, feature, and variable are often used interchangeably in the literature.
 Attributes describing a customer object can include, for example, customer ID, name, and
address.
 Observed values for a given attribute are known as observations.
The type of an attribute values 1.Nomina 2.binary 3.ordinal 4. numeric 5.Other attributes
a) Discrete
b)Continuous
1)Nominal Attributes
 Nominal means “relating to names.” The values of a nominal attribute are symbols or names of things.
 Each value represents some kind of category, code, or state, and so nomi-nal attributes are also referred to as
categorical.
Example Suppose that hair color and marital status are two attributes describing person objects. In our
application, possible values for hair color are black, brown, blond, red, auburn, gray, and white. The attribute
marital status can take on the values single, married, divorced, and widowed. Both hair color and marital
status are nominal attributes. Another example of a nominal attribute is occupation, with the values teacher,
dentist, programmer, farmer, and so on.

Values Attribute Values


Colors Black, Green, Brown, red Martial status Single, married,
divorced, widowed
2)Binary Attributes
 A binary attribute is a nominal attribute with only two categories or states: 0 or 1, where 0 typically means
that the attribute is absent, and 1 means that it is present.
 Binary attributes are referred to as Boolean if the two states correspond to true and false.

Example 2.2 suppose patient undergoes a medical test that has two possible outcomes. The attribute medical
test is binary, where a value of 1 means the result of the test for the patient is positive, while 0 means the
result is negative.

a) Symmetric binary attribute: binary attribute is symmetric if both of its states are equally
valuable and carry the same weight;
b) Asymmetric binary attribute binary attribute is asymmetric if the outcomes of the states are
not equally impor- tant

3)Ordinal Attributes: An ordinal attribute is an attribute with possible values that have a meaningful order or
ranking among them, but the magnitude between successive values is not known.

Example 2.3 Eg1:Suppose that drink size corresponds to the size of drinks available at a fast-food
restaurant. This nominal attribute has three possible values: small, medium, and large. The values have a
meaningful sequence (which corresponds to increasing drink size); however, we cannot tell from the
values how much bigger, say, a medium is than a large.

Eg 2:Other examples of ordinal attributes include grade (e.g., A , A, A , B , and so on) and professional
rank. Professional ranks can be enumerated in a sequential order: for example, assistant, associate, and
full for professors, and private, private first class, specialist, corporal, and sergeant for army ranks.

Eg 3: one survey, participants were asked to rate how satisfied they were as cus- tomers. Customer
satisfaction had the following ordinal categories: 0: very dissatisfied, 1: somewhat dissatisfied, 2:
neutral, 3: satisfied, and 4: very satisfied.

4)Numeric Attributes: A numeric attribute is quantitative; that is, it is a measurable quantity, represented in
integer or real values.
Numeric attributes are two types a) interval-scaled b) ratio-scaled.
a) Interval-Scaled Attributes: An interval-scaled attribute has values, whose differences are
interpretable, but the numerical attributes do not have the correct reference point or we can call zero point.

 Data can be added and subtracted at interval scale but cannot be multiplied or divided.
 Consider an example of temperature in degrees Centigrade. If a day’s temperature of one day is twice
than the other day we cannot say that one day is twice as hot as another day.
b) Ratio-scaled Attribute: Ratio-Scaled Attributes A ratio-scaled attribute is a numeric attribute with a fix
zero-point.

 If a measurement is ratio-scaled, we can say of a value as being a multiple (or ratio) of another value
 The values are ordered, and we can also compute the difference between values, and the mean, median,
mode, Quantile-range and five number summaries can be given.
5)Other Attributes are:1)Discrete Attriburtes2)Continuous Attributes

a)Discrete Attribute: Discrete data have finite values it can be numerical and can also be in
categorical form. These attributes has finite or countable infinite set of values.
Example:

b) Continuous: Continuous data have an infinite no of states. Continuous data is of float type. There can be
many values between 2 and 3.

 Example :

Basic Statistical Descriptions of Data


 For data preprocessing to be successful, it is essential to have an overall picture of your data.
 Basic statistical descriptions can be used to identify properties of the data and highlight which data values
should be treated as noise or outliers.
Here we have Three areas of basic statistical descriptions.
1. We start with measures of central tendency which measure the location of the middle or
center of a data distribution
Measures of central tendency include mean, median, mode, and midrange.
2. To assessing the central tendency of our data set, we also would like to have an idea of the
dispersion of the data-That is, how are the data spread out?
Measures of data dispersion include range, quartiles, interquartile range (IQR), the five-
number summary and boxplots and the variance and standard deviation of the data

3. Finally, we can use many graphic displays of basic statistical descriptions to visually
inspect our Most statistical or graphical data presentation software packages include bar
charts, pie charts, and line graphs.

Other popular displays of data summaries and distributions include quantile plots,
quantile–quantile plots, histograms, and scatter plots.
Measuring the Central Tendency: Mean, Median, and Mode
Measures of central tendency include the mean, median, mode, and midrange.
Mean: The most common and most effective numerical measure of the “center” of a set of data is the
arithmetic mean.

Let x1, x2, . . . , xN be a set of N values or observations, such as for some numeric attribute X.

MEAN:

Sometimes, each value xi in a set may be associated with a weight wi . – The weights reflect the significance
and importance attached to their respective values.

Weighted Mean:

Example: Suppose we have the following values for salary (in thousands of dollars), shown in increasing order: 30,
36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110. Using Eq. we have

30 + 36 + 47 + 50 + 52 + 52 + 56 + 60 + 63 + 70 + 70 + 110
¯x 12
= 696/12=58
Thus, the mean salary is $58,000.
Median: The given data set of N values for an attribute X is sorted in increasing order.

 If N is odd, then the median is the middle value of the ordered set.
 If N is even, then the median is not unique; it is the two middlemost values and any value in between.
 If X is a numeric attribute in this case, by convention,
 The median is taken as the average of the two middlemost values.
Example: The two middlemost values of 52 and 56 (that is, within the sixth and seventh values in the
list).
The two middlemost values as the median; that is, 52+2 56 = 108 = 54. Thus,
the median is $54,000
Mode: The mode is another measure of central tendency.
 The mode for a set of data is the value that occurs most frequently in the set.
 Data sets with one, two, or three modes are respectively called unimodal, bimodal, and trimodal.
 In general, a data set with two or more modes is multimodal. At the other extreme, if each data value
occurs only once, then there is no mode.
For unimodal numeric data that are moderately skewed (asymmetrical), we have the following empirical relation:
mean − mode ≈ 3 × (mean − median).

Midrange: The midrange can also be used to assess the central tendency of a numeric data set.
It is the average of the largest and smallest values in the set. This measure is easy to compute using the SQL
aggregate functions, max() and min().
30,000+110,000
Example: The midrange of the data of = $70,000 2

 In a unimodal frequency curve with perfect symmetric data distribution, the mean, median, and mode
are all at the same center value.
 positively skewed, where the mode occurs at a value that is smaller than the median.
 negatively skewed, where the mode occurs at a value greater than the median.

Measuring the Dispersion of Data: Range, Quartiles, Variance, Standard Deviation


and Inter quartile Range
 To assess the dispersion or spread of numeric data.
 The measures include range, quantiles, quartiles, percentiles, and the interquartile range. The
five-number summary, which can be displayed as a boxplot, is useful in identifying outliers.
Variance and standard deviation also indicate the spread of a data distribution.

 Rage: Let x1, x2, . . . , xN be a set of observations for some numeric attribute, X. The range
of the set is the difference between the largest (max()) and smallest (min()) values.
 Quantiles : Quantile are points taken at regular intervals of a data distribution, dividing it
into essentially equal- size consecutive sets.
 Quartile: Each part represents one-fourth of the data distribution. They are more commonly
referred to as quartiles.
 Percentiles: The 100-quantiles are more commonly referred to as percentiles; they divide the data
distribution into 100 equal-sized consecutive sets. The median, quartiles, and percentiles are the
most widely used forms of quantiles

The first quartile, denoted by Q1, is the 25th percentile. It cuts off the lowest 25% of the data. The third
quartile, denoted by Q3, is the 75th percentile—it cuts off the lowest 75% (or highest 25%) of the data. The
second quartile is the 50th percentile. As the median, it gives the center of the data distribution.
InterQuartile Range: The distance between the first and third quartiles is a simple measure of spread that
gives the range covered by the middle half of the data. This distance is called the interquartile range (IQR)
and is defined as

IQR = Q3 − Q1 .
Example: Inter quartile range. The quartiles are the three values that split the sorted data set into four equal parts.
The data of Example 2.6 contain 12 observations, already sorted in increasing order. Thus, the quartiles for this
data are the third, sixth, and ninth val- ues, respectively, in the sorted list.
=
Therefore, Q1 =47,000 and Q3 =63,000. Thus, the range is IQR = 63-47 = $16,000.
= −
=
Five-Number Summary, Boxplots and Outliers

Five Number Summary : This contains five values Minimum, Q1, interquartile Median(Q2),
Q3 and Maximum.
These Five numbers are represented as Boxplot in graphical format.
A boxplot incorporates the five-number summary as follows:
 Boxplot is Data is represented with a box.
 The ends of the box are at the first and third quartiles, i.e., the height of the box is IRQ.
 The median is marked by a line within the box.
 Whiskers: Two lines outside the box extend to Minimum and Maximum.
 To show outliers, the whiskers are extended to the extreme low and high
observations only if these values are less than 1.5 * IQR beyond the
quartiles.

Figure Boxplot for the unit price data for items sold at four branches of AllElectronics during a given time
period.
Example Boxplot. Figur shows boxplots for unit price data for items sold at four branches of
AllElectronics during a given time period. For branch 1, we see that the median price of items sold is
$80, Q1 is $60, and Q3 is $100. Notice that two outlying observations for this branch were plotted
individually, as their value

Variance and Standard Deviation


The variance of N observations, x1, x2, . . . , xN , for a numeric attribute X,
¯ where x is the mean
value of the
observations,The
standard deviation,
σ , of the
observations is the square root of the variance, σ 2.
Example: Variance and standard deviation. we found x = $58,000 using Eq.¯ (2.1) for the mean. To determine
= use Eq. (2.6) to obtain
the variance and standard deviation of the data from that example, we set N = 12 and

1
σ2= (302 + 362 + 472 . . . + 1102) − 582
1
2 ≈ 379.17

σ ≈ 379.17 ≈ 19.47.

The basic properties of the standard deviation, σ , as a measure of spread are as follows:
σ measures spread about the mean and should be considered only when the mean is chosen as the measure of
center
σ=0 only when there is no spread,
= that is, when all observations have the same value. Otherwise, σ > 0.
Graphical Displays of Basic Statistical data:
 Graphic displays of basic statistical descriptions.
 These include quantile plots, quantile–quantile plots, histograms, and scatter plots. Such graphs are help-
ful for the visual inspection of data, which is useful for data preprocessing.
 The first three of these show univariate distributions (i.e., data for one attribute), while scatter plots
show bivariate distributions (i.e., involving two attributes)

Quantile plots:
 A quantile plot is a simple and effective way to have a first look at a
univariate data distribution.
 Plots quantile information
 For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
 Note that

the 0.25 quantile corresponds to quartile Q1,

the 0.50 quantile is the median, and

the 0.75 quantile is Q3.
Quantile - Quantile plots:
In statistics, a Q-Q plot is a portability plot, which is a graphical method for comparing two
portability distributions by plotting their quantiles against each other.
Example: A quantile–quantile plot for unit price data of items sold at two branches of All
Electronics during a given time period. Each point cor- responds to the same quantile for each
data set and shows the unit price of items sold at branch 1 versus branch 2 for that quantile

Histograms or frequency histograms:


 Histograms (or frequency histograms) are at least a century old and are
widely used. “
 Histos” means pole or mast, and “gram” means chart, so a histogram is a chart
of poles.
 Plotting histograms is a graphical method for summarizing the distribution of a
given attribute, X. If X is nominal, such as automobile model or item type, then
a pole or vertical bar is drawn for each known value of X. The height of the
bar indicates the frequency (i.e., count) of that X value. The resulting graph is
more commonly known as a bar chart.
Scatter Plots and Data Correlation

 Scatter plot Is one of the most effective graphical methods for determining if there appears to be a
relationship, clusters of points, or outliers between two numerical attributes.
 Each pair of values is treated as a pair of coordinates and plotted as points in the plane

The scatter plot is a useful method for providing a first look at bi variate data to see clusters of points
and outliers, or to explore the possibility of correlation relationships.
Two attributes, X, and Y , are correlated if one attribute implies the other. Correlations can be positive,
negative, or null (uncorrelated).
a) If pattern of plotted points slopes from lower left to upper right, this means that the
values of X increase as the values of Y increase, suggesting a positive correlation
b) If the pattern of plotted points slopes from upper left to lower right, the values of X
increase as the values of Y decrease, suggesting a negative correlation

Three cases for which there is no correlation relationship between the two attributes in each of the given data
sets. shows how scatter plots can be extended to n attributes, resulting in a scatter-plot matrix.

You might also like